I’m a content publisher, whether I like it or not. This blog is relatively well followed, and I write quite a lot in it. While my hosting provider does not give me grief for my bandwidth usage, optimizing it is something I’m always keen on, especially since I have been Slashdotted once before. This is one of the reasons why my ModSecurity Ruleset validates and filters crawlers as much as spammers.
Blogs’ feeds, be them RSS or Atom (this blog only supports the latter) are a very neat way to optimize bandwidth: they get you the content of the articles without styles, scripts or images. But they can also be quite big. The average feed for my blog’s articles is 100KiB which is a fairly big page, if you consider that feed readers are supposed to keep pinging the blog to check for new items. Luckily for everybody, the authors of HTTP did consider this problem, and solved it with two main features: conditional requests and compressed responses.
Okay there’s a sense of déjà-vu in all of this, because I already complained about software not using the features even when it’s designed to monitor web pages constantly.
By using conditional requests, even if you poke my blog every fifteen minutes, you won’t use more than 10KiB an hour, if no new article has been posted. By using compressed responses, instead of a 100KiB response you’ll just have to download 33KiB. With Google Reader, things were even better: instead of 113 requests for the feed, a single request was made by the FeedFetcher, and that was it.
But now Google Reader is no more (almost). What happens now? Well, of the 113 subscribers, a few will most likely not re-subscribe to my blog at all. Others have migrated to NewsBlur (35 subscribers), the rest seem to have installed their own feed reader or aggregator, including tt-rss, owncloud, and so on. This was obvious looking at the statistics from either AWStats or Munin, both showing a higher volume of requests and delivered content compared to last month.
I’ve then decided to look into improving the bandwidth a bit more than before, among other things, by providing WebP alternative for images, but that does not really work as intended — I have enough material for a rant post or two so I won’t discuss it now. But while doing so I found out something else.
One of the changes I made while hoping to use WebP is to serve the image files from a different domain (assets.flameeyes.eu
) which meant that the access log for the blog, while still not perfect, is decidedly cleaner than before. From there I noticed that a new feed reader started requesting my blog’s feed every half an hour. Without compression. In full every time. That’s just shy of 5MiB of traffic per day, but that’s not the worst part. The worst part is that said 5MiB are for a single reader as the requests come from a commercial, proprietary feed reader webapp.
And this is not the only one! Gwene also does the same, even though I sent a pull request to get it to use compressed responses, which hasn’t had a single reply. Even Yandex’s new product has the same issue.
While 5MiB/day is not too much taken singularly, my blog’s traffic averages on 50-60 MiB/day so it’s basically a 10% traffic for less than 1% of users, just because they do not follow the best practices when writing web software. I’ve now added these crawlers to the list of stealth robots, this means that they will receive a “406 Unacceptable” unless they finally implement at least the compressed responses support (which is the easy part in all this).
This has an unfortunate implication on users of those services that were reading me, who won’t get any new updates. If I was a commercial entity, I couldn’t afford this at all. The big problem, to me, is that with Google Reader going away, I expect more and more of this kind of issues to crop up repeatedly. Even NewsBlur, which is now my feed reader of choice fixed their crawlers yet, which I commented upon before — the code is open-source but I don’t want to deal with Python just yet.
Seriously, why are there so many people who expect to be able to deal with web software and yet have no idea how the web works at all? And I wonder if somebody expected this kind of fallout from the simple shut down of a relatively minor service like Google Reader.