URL Rewriting Rules (Apache)
Apache has functionalities for rewriting URLs, for purposes like removing characters, and redirection.
This is often a two-way process:
- Rewrite internal URLs to user-friendly or SEF (Search Engine Friendly) external URLs
- Convert external URLs back to internal URLs.
This article is focused on Apache's Rewrite Rule directive, but does include related directives. It is a continuation from various earlier articles on this wiki, related to Apache URL Rewriting Rules. See section See also below. This article applies only to WordPress sites.
Means & locations
There are a couple of places around an Apache webserver, where these directives can be deployed. As far as I am concerned (2023.09), the syntaxis is everywhere the same.
.htaccess files
.htaccess files are files that are placed in directories within a website
- These allows for the most fine-grained control of any of the means discussed in this chapter
- Changes apply directly - No need to restart Apache
- Using .htaccess files, is also often the most intuitive location, as it is widely used for all kind of settings
- It is also the slowest approach
- Example: Disallowing the execution of files in upload folders.
Virtual host definition files
Apache directives can be included in the virtual host definiton files that are used to define hosts on a webserver. These are the files like /etc/apache2/sites-available/example.com.conf
and /etc/apache2/sites-available/example.com-le-ssl.conf
- Settings in this file, apply to the domain as a whole. These are less fine-grained as what is possible with .htaccess files
- Settings only apply after restarting Apache
- Probably faster than .htaccess files
- I don't think I ever used these files for making domain-specific settings.
As an example, here is part of the default virtual host definition file that I use, that actually includes a rewrite:
RewriteEngine on RewriteBase / RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]
Apache configuration files
The last and most 'fundamental' way to use directives, is in the Apache configuration files. Settings here, apply to all domains:
/etc/apache2/apache2.conf
is probably the most commonly used configuration file and probably the only one I know from the top of my head- Settings only apply after restarting Apache
- Probably the fastest way to implement directives
- Example: Disallowing browsers to read the content of folders.
Syntaxis
Syntaxis of Apache's RewriteRule directive:
RewriteRule <pattern> <substitution> [flags]
This might seem counter-intuitive, as pattern and substitution seems to be reversed, but it isn't when you read the syntaxis from the point of view of Apache:
RewriteRule <pattern> <substitution> [flags] <=> RewriteRule <read this> <translate into this> [flags] <=> RewriteRule <match this> <translate into this> [flags]
What might be adding to the confusing: The pattern argument usually is a regular expression, to make its match as narrow or wide as desired, but it doesn't have to, as you can see in the first example below.
Another reason why this might be non-intuitive: To me, this is very much an example of a 4GL: Rather than being explicit, you state a filter - And I found 4GL stuff never intuitive.
Example DigitalOcean
Inspired by [1]:
Create a file hello-world.html
with some arbitrary HTML and verify that it works. Let's now create a RewriteRule so that you can invoke it with bye-world
(without .html):
RewriteEngine on RewriteRule bye-world hello-world.html [NC]
- This only worked after I commented-out the existing stuff in this
.htaccess
file (a local WordPress develop site) - This also seems to work without the
RewriteEngine On
directive - When you submit the original URL, it will not be rewritten to the first argument of the RewriteRule directive. Probably this is related to rewriting being a two-way process: From and to Apache. We are dealing with only one way here.
Without starting & closing regex
Actually, as this example doesn't contain the usual $
and ^
regex symbols, this RewriteRule is probably wider than intended. For example, it would match these URLs as well:
- bye-world
- bye-world.htm
- bye-world.html
- goodbye-world
- goodbye-worldchampion
- goodbye-worldchampion.html
In all these cases, you just fetch the intended file, and the URL stays as how it was typed. This might be fine, but it might be a problem if some of these URLs actually exist.
An example where this would really be a problem:
- RewriteRule:
RewriteRule drill drill.html
- Existing URL: drill.html
- Also an existing URL: angular-drill.html
- Page
angular-drill.html
cannot be reached anymore: All traffic is diverted to pagedrill.html
.
So, let's clean this up:
Without start & closing regex
Our shiny new RewriteRule directive:
RewriteRule ^bye-world$ hello-world.html [NC]
Now it only matches the exact URL bye-world
. For example, it doesn't match any of these:
- bye-world.html
- goodbye-world
- goodbye-world.html
Make it more generic: Strip all .html suffixes (1)
Strip all .html
suffixes (thanks to ChatGPT):
RewriteRule ^([^/]+)$ $1.html [L]
What it does:
^([^/]+)$
: The pattern that the incoming URL is compared against^
: Start of the URL - Afterhttp://
orhttps://
([^/]+)
: Capture one or more characters except / and store these in a capturing group, to be reused as$1
in the substitution part of the directive$
: End of the URL$1.html
: The substition consists of the earlier capture group plus the string.html
[L]
: This flag stands for Last, meaning that Apache shouldn't apply any more rewrite rules after this one, concerning this specific URL. Maybe to avoid complications like recursion.
Make it more generic: Strip all .html suffixes (2)
In the example before, I don't understand why the pattern is restricted to URLs without /
. Doesn't this seem more natural?
RewriteRule ^(.+)$ $1.html [L]
What it does:
- Pattern: Match any URL and store it in a capture group
- Substitution: The aforementioned capture group plus
.html
I've checked this by putting the html file in a directory map1. URL http://redirect.s0/map1/hello-world
works just fine.
Appearantly, it doesn't add .html
to URLs that already have this extension. ChatGPT:
- The pattern part of the RewriteRule directive does not include the file extension by default because of how Apache's mod_rewrite module processes URLs and how regular expressions are used in this context.
- When you define a pattern in a RewriteRule, you are essentially specifying a regular expression that Apache uses to match against the incoming URL. In regular expressions, . has a special meaning—it matches any character except for a newline. Therefore, if you include . in the pattern, it will match any character, including the dot used in file extensions.
- To create a pattern that matches URLs without their extensions, you typically avoid using . directly because it matches dots. Instead, you use patterns that specify what you want to match. For example:
- [^.]+ matches one or more characters that are not dots. This effectively matches a filename without the dot and extension.
So, when you use ^(.+)$ as the pattern in the RewriteRule, it captures everything in the URL path except for the query string, up to the last slash. It effectively matches URLs without their extensions because it doesn't include . in the capturing group. This is a common approach when you want to perform URL rewriting to add or modify extensions or other parts of the URL.
Don't rewrite file names
Let's make sure that if the URL refers to a file name, that it won't be rewritten:
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond
stands for Rewriting Condition - Do we need to delve into that here?%{REQUEST_FILENAME}
: This variable captures the full filesystem path. E.g., URLhttp://example.com/images/pic.jpg
may correspond with a full filesystem path like/var/www/example.com/images/pic.jpg
. This variable is always generated, even when it doesn't refer to an actual filename path%{REQUEST_FILENAME} !-f
: True if the full filename path doesn't refer to an actual file.
Don't rewrite directory names
This condition excludes directory names of getting rewritten:
RewriteCond %{REQUEST_FILENAME} !-d
However, I find it difficult to understand when this would be a relevant use case. Maybe this is the situation when you visit a site's home page, like http://example.com
. Here, the file index.html
(or something similar) isn't included in the URL, so it's effectively a directory.
ChatGPT offers some more info, but I find it still farfetched:
- Clean URLs: Websites often use URL rewriting to create clean and user-friendly URLs. While visitors request pages or resources, the URLs they see in their browsers may be designed to look like directory-style paths for aesthetic or semantic reasons. For example, instead of http://example.com?page=about, you might want to display http://example.com/about, which appears as if "about" is a directory but is actually handled by a script or routing mechanism.
- Routing: In web applications and content management systems (CMS), requests are often routed to specific scripts or controllers based on the URL. Even though visitors are accessing individual pages or resources, the application may internally interpret the URLs as routes to particular actions or content sections.
- Modular Structure: Some websites have a modular structure where different sections or components of the site are organized into directories for organization and maintenance purposes. However, when users request these sections, they are usually interested in the content within those directories, not the directory itself.
- Fallbacks: The condition can also be used to handle cases where a URL might match both a file and a directory. It ensures that if a directory with the same name as a file exists, the URL is not treated as a directory but as a file.
- In summary, while site visitors don't typically request directories directly, the use of RewriteCond %{REQUEST_FILENAME} !-d is more about how the server and web application handle URLs to create clean, structured, and user-friendly URLs and to route requests to the appropriate scripts or content. It's a common practice in web development to provide a better user experience and organization of resources.
Rewrite incoming URLs
Let's now do it the other way around: Rewriting incoming URLs to their SEF variants. Incoming refers to both URLs generated by Apache, as by users:
RewriteCond %{THE_REQUEST} \s/([^/]+)\.html [NC] RewriteRule ^ /%1 [R=301,L]
%{THE_REQUEST}
: Full HTTP request made by the client, including HTTP method & URI. E.g.:GET /example.html HTTP/1.1
\s/([^/]+)\.html
: The pattern argument. It matches a space, followed by a slash, some stuff, and.html
\s
stands for a space, namely the one afterGET
\.
means a literal dot. It had to be escaped%1
refers to the part that matched the patternR=301
indicates that this is a 301 Redirect
Will it work internally or externally?
I have the impression that you can't easily tell if a RedirectRule works internally or externally. There are some clues but that's it:
- When an
[R=...]
flag is included, it is an external redirection [L]
flags are typically used for internal stuff.
Complete bi-directional example
Stuff from various earlier chapters combined:
# # Bi-directionally remove ".html" from the file # # External » Internal ######################################## # # * Make Apache understand requests without ".html" # * When this part is inactive, a URL like "hello-world" # will generate an error # RewriteEngine On RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule ^(.+)$ $1.html [L] # Internal » External ######################################## # # Output URLs without ".html". When you enter a URL with extension, # the site will be redirected to the same URL but without the extension. # RewriteEngine On RewriteCond %{THE_REQUEST} \s/([^/]+)\.html [NC] RewriteRule ^ /%1 [R=301,L]
RewriteEngine On
Why include the RewriteEngine On directive? So far, everything works fine when omitted.
ChatGPT:
- You're correct that in some configurations, the RewriteEngine On directive may not be necessary if the mod_rewrite module is already enabled globally or in the server/virtual host configuration. However, including RewriteEngine On explicitly is a good practice because it ensures that the rewrite engine is enabled for the specific context (e.g., .htaccess file or Virtual Host).
- Including RewriteEngine On is especially important when you're working with .htaccess files because the configuration in these files is typically more isolated and can be subject to server-level or virtual host-level settings. Explicitly enabling the rewrite engine within the .htaccess file ensures that your rewrite rules are processed.
- So, while it may work without RewriteEngine On in some configurations, it's considered a best practice to include it to make your configuration more explicit and avoid potential issues when moving your configuration between different environments or servers.
RewriteBase
The directive RewriteBase is often used in conjunction with RewriteRule - See the example of the Apache virtual host definition file, elsewhere.
It is used to specify the base URL, for relative paths references within rewrite rules. E.g.:
RewriteBase /
indicates that the root directory of the website is the base for further relative paths.
A more complete example: Consider these directives concerning domain http://example.com
RewriteBase / RewriteRule ^page$ /somefolder/page.html [L]
Page page
will become http://example.com/somefolder/page.html
With the directives
RewriteBase /migration/ RewriteRule ^page$ /somefolder/page.html [L]
the same page page
now becomes http://example.com/migration/somefolder/page.html
.
See also
- .htaccess - Not relevant
- .htaccess-beveiligde mappen - Not relevant
- .htaccess-redirects - Only indirectly related; Doesn't contain any Apache Redirects
- Apache virtuele host-definitie - Quite related!
- Redirection & Apache Virtual Host files - Quite some examples and with a large overlap with this article
- URL Redirection of complete domain names - Including examples of Apache Redirects
- URL Redirection - Onbedoeld - Not much info.