Static blogs and HTTP caching

As you can see in the footer, this blog is powered by Pelican, a static blog generator written in Python. It's really simple to use and fits my requirements nicely – I can write posts offline on my notebook and view the results in my browser with the included web server, it doesn't require any insecure server-side software (the output is plain HTML, CSS and a bit of JavaScript for browsers that are not quite up to date) and is very easy on server resources because by default, almost everything can be cached by web browsers.

However, there is one annoying side effect of everything being cached: Since that also includes the landing page, new posts could be invisible to recurring visitors for quite a while. In a bit more detail, here is what is going on at the HTTP level:

By default, my webserver, lighttpd, delivers all static HTML pages with no explicit caching headers, but includes the modification time of the resource (only the relevant headers are included) and an ETag:

Date: Fri, 15 Mar 2013 10:03:43 GMT
ETag: "4531062"
Last-Modified: Thu, 14 Mar 2013 20:12:06 GMT

The ETag is good to have (browsers can use it to unambiguously revalidate cached content with the server, as I'll explain later), but the Last-Modified–Header combined with no explicit statement about cacheability triggers a heuristic defined in HTTP in most browsers. Basically, browsers calculate the difference between the time the resource was retrieved and the time it was last modified on the server, and cache the resource for 10% of that value without revalidating with the server.

This means that for a blog that is daily updated with new posts, users will eventually see the posts after a few hours after their last visit, but for a blog that hasn't been updated for several weeks or months, ten percent of that time can be pretty significant.

A simple solution is to just manually define a cache validity in the HTTP headers for some or all resources. lighttpd has the expires module that does just that. Here is the relevant line in my lighttpd.conf:

expire.url = ( "/theme/" => "access plus 7 days", "" => "access plus 1 hours" )

The effect is that all resources in the subdirectory theme will have an Expires header 7 days in the future, and everything else will be valid for just an hour. This is a tradeoff between server and client resource usage and immediate updates: For me, an hour of delay is not a big deal, and users jumping back and forth between blog posts will be able to do so without any further HTTP requests. Here are the response headers of the main blog page:

Cache-Control: max-age=3600
Date: Fri, 15 Mar 2013 10:23:06 GMT
Expires: Fri, 15 Mar 2013 11:23:06 GMT
Last-Modified: Thu, 14 Mar 2013 20:12:06 GMT

As you can see, the max-age directive exlicitly states a validity of 3600 seconds, and the Expires header also points to a value one hour in the future.

Even when that time is reached, the whole resource doesn't have to be transferred again: Browsers can just perform a conditional HTTP request using the ETag or Last-Modified headers that they cache together with the resource itself. If the content is still the same, the server will be able to deduce that from the headers and reply with a 304 Not Modified HTTP response. As long as your site is not very highly frequented or references many additional resources, cache revalidation is not too expensive.

One thing that has also helped me tremendously in understanding HTTP caching was an answer on Stackoverflow that explains how to force the various browsers to revalidate a resource or to completely bypass the cache – for debugging, it's very useful to know that there is a big difference between pressing F5 or Ctrl + F5 in most browsers.

Comments !