Please note: this post was written quite some time ago, before the Typo upgrade, among other things, and while I’m going to re-read and fix it up I might have something out of sync, sorry.
Following some of the older changes to the feeds, removing RSS feeds and replacing all of them with Atom feeds, I started looking at the behaviour of news reader to make sure they do work as expected. This made me notice quite a few behaviours that I really wonder about.
First of all, most newsreaders seems to properly implement the HTTP/1.1 protocol rules that allow for 304 responses (Not Modified) to avoid re-fetching the whole feed if there has been no changes, this is very good because it saves bandwidth on both sides, on the other hand, none seems to record the 301 (Moved Permanently) replies, which causes the server to receive requests on the old URLs after a move still (and since I had a migration from an old Typo to a new one, I have lots of rewriting of URLs). Crawlers and aggregators like Google’s or Yahoo’s also fail at recording that.
While 302 is a temporary move that should not recorded, one could argue that a permanent move would be saved, at least in an application that has a collection of URLs like a feed reader. Now of course it’s also true that if you could hijack the DNS of a domain, and send a moved permanently to a different server, it would be quite nasty, but is something I think that should be looked into.
But one thing that I find disturbing is that there are some feed readers that don’t implement HTTP/1.1 features, like for instance newsbeuter (which actually is caused by the librss library they are using, I already asked about this to the author) that instead of using the If-Modified-Since
or If-None-Match
headers, runs a HEAD
request for the feed repeatedly, and a GET
when something has changed indeed. It’s not like I have a problem with that, since anyway a HEAD
request is still better than having a GET
repeated over and over and over. Which is what some service seems to be doing. Especially some “enterprise” services that seems to re-sell search services on a per-keyword basis.
In general, I’m now considering finding a way to check whether I can identify the “rogue” agents who request the feeds without conditional gets, and see if I can contact their technical support to get the thing fixed, but sure it’s tremendous to see that nowadays there are still people writing “enterprise” crawlers who don’t know HTTP/1.1 provides feature to avoid wasting others’ bandwidth! If you’re using some free feed reader and you don’t know how it behave, you can try to check with wireshark which kind of requests it does, and in case you might want to tell upstream about these features.
Remember that it doesn’t just save my bandwidth, it also saves yours, and the whole Internet’s. It’s also why the feeds are much more useful than webpages when you just want to read an article, if it’s in the feed that is. And don’t think it’s very small, my articles feed always slightly under 200KB of data, in Atom.