In my previous post on the matter I incorrectly blamed NewsBlur – which I still recommend as the best feed reader I’ve ever used! – for not correctly supporting HTTP features to avoid wasting bandwidth for fetching repeatedly unmodified content.
As Daniel and Samuel pointed out immediately, NewsBlur does support those features, and indeed I even used it as an example four years ago, oops for my memory being terrible that way, and me assuming the behaviour from the logs rather than inspecting the requests. And indeed the requests were not only correct, but matched perfectly what Apache reported:
--6190ee48-B--
GET /index.xml HTTP/1.1
Host: blog.flameeyes.eu
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/atom+xml, application/rss+xml, application/xml;q=0.8, text/xml;q=0.6, */*;q=0.2
User-Agent: NewsBlur Feed Fetcher - 59 subscribers - http://www.newsblur.com/site/195958/flameeyess-weblog (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36)
A-IM: feed
If-Modified-Since: Wed, 16 Aug 2017 04:22:52 GMT
If-None-Match: "27dc5-556d73fd7fa43-gzip"
--6190ee48-F--
HTTP/1.1 200 OK
Strict-Transport-Security: max-age=31536000; includeSubDomains
Last-Modified: Wed, 16 Aug 2017 04:22:52 GMT
ETag: "27dc5-556d73fd7fa43-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=1800
Expires: Wed, 16 Aug 2017 18:56:33 GMT
Content-Length: 54071
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: application/xml
Code language: PHP (php)
So what is going on here? Well, I started looking around, both because I now felt silly, and because I owed more than just an update on the post and an apology to Samuel. And a few searches later, I found Apache bug #45023 that reports how mod_deflate
prevents all 304 responses from being issued. This is a bit misleading (as you can still have them in some situations), but it is indeed what is happening here, and it is a breakage introduced by Apache 2.4.
What’s going on? Well, let’s first start to figure out why I could see some 304, but not from NewsBlur. Willreadit was one of the agents that received 304 responses at least some of the time and in the landing page it says explicitly that it supports If-Modified-Since
. In particular, it does not support If-None-Match
.
The If-None-Match
header in the request compares with the ETag
header (Entity Tag) in the response coming from Apache. This header is generally considered opaque, and the client should have no insights in what it is meant to do. The server generally calculates its value based on either a checksum of the file (e.g. md5) or based on file size and last-modified time. On Apache HTTP Server, the FileETag
directive is used to define which properties of the served files are used to generate the value provided in the response. The default that I’m using is MTime Size
, which effectively means that changing the file in any way causes the ETag to change. The size part might actually be redundant here, since modification time is usually enough for my use cases, but this is the default…
The reason why I’m providing both Last-Modifed
and ETag
headers in the response is that HTTP client can just as well only implement one of the two methods, rather than both, particularly as they may think that handling ETag
is easier as it’s an opaque string, rather than information that can be parsed — but it really should be considered opaquely as well as it’s noted in RFC2616. Entity Tags are also more complicated because they can be used to collapse caching of different entities (identifed by an URL) within the same space (hostname) by caching proxies. I have lots of doubts that this usage is in use, so I’m not going to consider it a valid one, but your mileage may vary. In particular, since the default uses size and modification time, it ends up always matching the Last-Modified
header, for a given entity, and the If-Modified-Since
request would be just enough.
But when you provide both If-Modified-Since
and If-None-Match
, you’re asking for both conditions to be true, and so Apache will validate both. And here is where the problem happens: the -gzip
suffix – which you can see in the header of the sample request above – is added at different times in the HTTPD process, and in particular it makes it so that the If-None-Match
will never match the generated ETag
, because the comparison is with the version without -gzip
appended. This makes sense in context, because if you have a shared caching proxy, you may have different user agents that support different compression algorithms. Unfortunately, this effectively makes it so that entity tags disable Not Modified states for all the clients that do care about the tags. Those few clients that received 304 responses from my blog before were just implementing If-Modified-Since
, and were getting the right behaviour (which is why I thought the title of the bug was misleading).
So how do you solve this? In the bug I already noted above, there is a suggestion by Joost Dekeijzer to use the following directive in your Apache config:
RequestHeader edit "If-None-Match" '^"((.*)-gzip)"$' '"$1", "$2"'
Code language: JavaScript (javascript)
This adds a version of the entity tag without the suffix to the list of expected entity tags, which “fools” the server into accepting that the underlying file didn’t change and that there is no need to make any change there. I tested with that and it does indeed fix NewsBlur and a number of other use cases, including browsers! But it has the side effect of possibly poisoning shared caches. Shared caches are not that common, but why risking it? So I decided onto a slightly different option
FileETag None
This disable the generation of Entity Tags for file-based entities (i.e. static files), forcing browsers and feed readers to rely on If-Modified-Since
exclusively. If clients only implement If-None-Match
semantics, then this second option loses the ability to receive 304 responses. I have actually no idea which clients would do that, since this is the more complicated semantics, but I guess I’ll find out. I decided to give a try to this option for two reasons: it should simplify Apache’s own runtime, because it does not have to calculate these tags at any point now, and because effectively they were encoding only the modification time, which is literally what Last-Modified
provides! I had for a while assumed that the tag was calculated based on a (quick and dirty) checksum, instead of just size and modification time, but clearly I was wrong.
There is another problem at this point, though. For this to work correctly, you need to make sure that the modification time of files is consistent with them actually changing. If you’re using a static site generator that produces multiple outputs for a single invocation, which includes both Hugo and FSWS, you would have a problem, because the modification time of every file is now the execution time of the tool (or just about).
The answer to this is to build the output in a “staging” directory and just replace the files that are modified, and rsync
sounds perfect for the job. But the more obvious way to do so (rsync -a
) will do exactly the opposite of what you want, as it will preserve the timestamp from the source directory — which mean it’ll replace the old timestamp with the new one for all files. Instead, what you want to use is rsync -rc
: this uses a checksum to figure out which files have changed, and will not preserve the timestamp but rather use the timestamp of rsync
, which is still okay — theoretically, I think rsync -ac
should work, since it should only preserve the timestamp only of the files that were modified, but since the serving files are still all meant to have the same permissions, and none be links, I found being minimal made sense.
So anyway, I’ll hopefully have some more data soon about the bandwidth saving. I’m also following up with whatever may not be supporting properly If-Modified-Since
, and filing bugs for those software/services that allow it.
Update (2017-08-23): since now it’s a few days since I fixed up the Apache configuration, I can confirm that the daily bandwidth used by “viewed hits” (as counted by Awstats) went down to ⅓ of what it used to be, to around 60MB a day. This should be accounting not only for the feed readers now properly getting a 304, but also for browsers of readers who no longer have to fetch the full page when, for instance, replying to comments. Googlebot also is getting a lot more 304, which may actually have an impact on its ability to keep up with the content, so I guess I will report back.