HTTP caching in Hiawatha

The concept of caching in HTTP should be familiar to just about everybody who’s ever worked with web technology. So, it goes without saying it’s a great deal more efficient to load an asset from a client-side cache than requesting and re-downloading the same asset for every subsequent page view. While proper caching strategies are sometimes overlooked, they’re an important element of a well-optimized website and worth implementing correctly.

http://www.flickr.com/photos/martyn404/5515265450/

A few examples

Say, for the sake of this example, that you’re hosting a website based on the Laravel framework. You might have the following UrlToolkit (the Hiawatha equivalent of mod_rewrite or ngx_http_rewrite_module):

UrlToolkit {
    ToolkitID = laravel
    RequestURI exists Return
    Match .* Rewrite /index.php
}

This allows Laravel to do its pretty, SEO-friendly URLs, and correctly parse URIs through its internal routing module. Notice how the first rule in this toolkit is “RequestURI exists Return“. That line tells Hiawatha that rather than make your PHP interpreter to parse and subsequently serve up a static asset, like a picture or a cascading style sheet, Hiawatha should just serve it up directly.

Serving static assets from a well-optimized webserver is much faster and requires far fewer resources than through a high-level interpreter. However, since Laravel is no longer handling static assets directly, it can no longer affect how the client side handles caching of the object. Thus, the task of setting cache expiry falls to Hiawatha as well.

To manage how the long files should be kept around by client browsers, I’ll also add a second UrlToolKit. This one will match a series of file suffixes. If the pattern matches, the server will send a Cache-Control header, a Last-Modified header, and a custom Expires header which tells the client how long it should keep the asset in their local cache. As a simple illustration of how we might set a custom expiry header, take a look at the following UrlToolkit:

UrlToolkit {
    ToolkitID = cache-control
    Match ^/.*\.(css|eot|gif|htm|html|ico|jpeg|jpg|js|otf|pdf|png|ps|psd|svg|swf|ttf|txt|woff|woff2)(/|$) Expire 2 weeks
}

In the above, we’re essentially matching a whole bunch of typical static assets at once, and telling the browser to keep it around for a maximum of two weeks.

Coping with cache busting

Sometimes developers use “cache busting” techniques, appending version tags to newly changed static assets with the same filename. This is a trick which causes the user’s browser to interpret the asset as a new file and redownload it, even if an older version is already in their cache and hasn’t expired yet. We’ll want to be able to set the duration on their cache shelf-life as well, so our pattern matching will need to take this case into account. For instance, if we use ?v= or ?ver= strings to set a version number, so that our asset URLs with cache busting end up looking like “http://www.example.tld/img/lolcat.gif?v=2“, we could use the following regex:

UrlToolkit {
    ToolkitID = cache-busting-control
    Match ^/.*\.(css|eot|gif|htm|html|ico|jpeg|jpg|js|otf|pdf|png|ps|psd|svg|swf|ttf|txt|woff|woff2)(\?v=.*|\?ver=.*)?(/|$) Expire 1 months
}

Mixing different expiries

Using this method, you could also set only certain files to be cached by the client, set different durations for different types of assets, or whatever best suits your situation. Here’s an example of a tiered cache control strategy:

UrlToolkit {
    ToolkitID = tiered-cache-control
    Match ^/.*\.(gif|htm|html|jpeg|jpg|png)(\?v=.*|\?ver=.*)?(/|$) Expire 1 weeks
    Match ^/.*\.(css|js|svg|swf|ttf|txt)(\?v=.*|\?ver=.*)?(/|$) Expire 2 weeks
    Match ^/.*\.(eot|ico|otf|pdf|ps|psd|ttf|woff|woff2)(/|$) Expire 2 months
}

Calling the UrlToolkits

So when we define our vhost stanza, we’ll end up calling both our Laravel and our cache control URL toolkits:

VirtualHost {
 Hostname www.example.tld, *.example.tld
 EnforceFirstHostname = yes
 WebsiteRoot = /srv/www/vhosts/www_example_tld
 UseFastCGI = PHP5
 UseToolkit = laravel, cache-control
 ShowIndex = no
 ExecuteCGI = yes
 UseGZfile = yes
 PreventXSS = yes
 PreventCSRF = yes
 CustomHeader = Vary: Accept-Encoding
 CustomHeader = X-Frame-Options: sameorigin
 }

You can string together multiple toolkits in this manner, or even reference one from inside another. I’ll cover UrlToolkits in much greater detail in a later article.

The result

Since we’re specifying expiry information from within our Hiawatha configuration, the end user receives the appropriate HTTP cache headers:

:~$ curl --head http://dotbalm.org/wp-content/uploads/2012/10/paolo-pannini.jpg
HTTP/1.1 200 OK
Date: Fri, 16 Jan 2015 10:09:10 GMT
Server: Hiawatha
Accept-Ranges: bytes
Connection: keep-alive
Content-Type: image/jpeg
Cache-Control: private
Expires: Sun, 15 Feb 2015 22:09:10 GMT
Vary: Accept-Encoding
X-Frame-Options: sameorigin
Last-Modified: Tue, 28 Oct 2014 02:32:45 GMT
Content-Length: 32728

This'll make a big difference on any repeat page views, or if the cached assets happen to exist on multiple pages. For example, the first page visit will entail downloading each asset on the page, thus necessitating several concurrent requests:
hiawatha-first-hit
But with HTTP caching in place, subsequent views only involve a single connection:
hiawatha-second-hit