I’m a content publisher, whether I like it or not. This blog is relatively well followed, and I write quite a lot in it. While my hosting provider does not give me grief for my bandwidth usage, optimizing it is something I’m always keen on, especially since I have been Slashdotted once before. This is one of the reasons why my ModSecurity Ruleset validates and filters crawlers as much as spammers.
Blogs’ feeds, be them RSS or Atom (this blog only supports the latter) are a very neat way to optimize bandwidth: they get you the content of the articles without styles, scripts or images. But they can also be quite big. The average feed for my blog’s articles is 100KiB which is a fairly big page, if you consider that feed readers are supposed to keep pinging the blog to check for new items. Luckily for everybody, the authors of HTTP did consider this problem, and solved it with two main features: conditional requests and compressed responses.
Okay there’s a sense of déjà-vu in all of this, because I already complained about software not using the features even when it’s designed to monitor web pages constantly.
By using conditional requests, even if you poke my blog every fifteen minutes, you won’t use more than 10KiB an hour, if no new article has been posted. By using compressed responses, instead of a 100KiB response you’ll just have to download 33KiB. With Google Reader, things were even better: instead of 113 requests for the feed, a single request was made by the FeedFetcher, and that was it.
But now Google Reader is no more (almost). What happens now? Well, of the 113 subscribers, a few will most likely not re-subscribe to my blog at all. Others have migrated to NewsBlur (35 subscribers), the rest seem to have installed their own feed reader or aggregator, including tt-rss, owncloud, and so on. This was obvious looking at the statistics from either AWStats or Munin, both showing a higher volume of requests and delivered content compared to last month.
I’ve then decided to look into improving the bandwidth a bit more than before, among other things, by providing WebP alternative for images, but that does not really work as intended — I have enough material for a rant post or two so I won’t discuss it now. But while doing so I found out something else.
One of the changes I made while hoping to use WebP is to serve the image files from a different domain (assets.flameeyes.eu
) which meant that the access log for the blog, while still not perfect, is decidedly cleaner than before. From there I noticed that a new feed reader started requesting my blog’s feed every half an hour. Without compression. In full every time. That’s just shy of 5MiB of traffic per day, but that’s not the worst part. The worst part is that said 5MiB are for a single reader as the requests come from a commercial, proprietary feed reader webapp.
And this is not the only one! Gwene also does the same, even though I sent a pull request to get it to use compressed responses, which hasn’t had a single reply. Even Yandex’s new product has the same issue.
While 5MiB/day is not too much taken singularly, my blog’s traffic averages on 50-60 MiB/day so it’s basically a 10% traffic for less than 1% of users, just because they do not follow the best practices when writing web software. I’ve now added these crawlers to the list of stealth robots, this means that they will receive a “406 Unacceptable” unless they finally implement at least the compressed responses support (which is the easy part in all this).
This has an unfortunate implication on users of those services that were reading me, who won’t get any new updates. If I was a commercial entity, I couldn’t afford this at all. The big problem, to me, is that with Google Reader going away, I expect more and more of this kind of issues to crop up repeatedly. Even NewsBlur, which is now my feed reader of choice fixed their crawlers yet, which I commented upon before — the code is open-source but I don’t want to deal with Python just yet.
Seriously, why are there so many people who expect to be able to deal with web software and yet have no idea how the web works at all? And I wonder if somebody expected this kind of fallout from the simple shut down of a relatively minor service like Google Reader.
Never even realised the impact of publishers and hosters, hm!The compressed repsonses was known to me, as well as the lossless PNG and lossless JPG compression (the latter with /usr/bin/jpegtran -copy all -progressive -verbose -outfile “${1}” “${1}”) saving quite a bit, which you talked about on Twitter, something I planned to offer as an automated solution for the people hosting on my server. Not sure if it’s something useable, but metalink (implemented in Curl) looks cool as well, even though it’d be of limited use.Conditional requests however, is new. Started looking into it as I did indeed plan to implement something of my own as IFTTT is way too limited for my taste (fetching articles once a week instead of daily or whatever it is, among other preferences) and things like Newsblur aren’t really my taste or a direct solution to push things to my Kindle in a proper format automatically.But that’s where I lost it.. my mind went a bit spaghetti, as it drifted from partial RSS/Atom downloads, Range requests with that, to “Hey, but Readability doesn’t want an RSS feed, so I need to parse the difference there as well! Why am I even trying this?”.Oh well, it’ll get somewhere, hopefully soon.
Range requests are probably not worth it. Compressed responses and conditional requests based on etag/if-modified-since are definitely a must. In Ruby, the feedzirra gem takes care of most of the behind-the-scene work. Otherwise you could look at the approach taken by “Harvester”:http://github.com/Flameeyes… but keep in mind that version is not Ruby 1.9 compatible and it’s rotting fast (I need to work on it).As for IFTTT — it should fetch RSS quite quickly, given that you see the post coming from my blog in almost realtime… make sure you didn’t get the wrong recipe!
Not the Ruby type, but there’s probably quite some around for Python when I look at the Google results. The following Curl line works as well, just have to adapt it to add the Etag, somehow. It only overwrites the test.html file if there’s new content:curl -H “Accept: text/xml” -H “If-Modified-Since: $(date –rfc-2822 -r test.html 2>/dev/null || date –date=”01/01/1970″)” –remote-time –compressed -o test.html <url>How’s IFTTT doing it, then? I mean, almost realtime, sounds like it’s polling your site every 15 minutes or so? The recipe for your blog has been working quite fine. It’s a generated IBM feed that randomly times out, more often then not. And it can’t generate a weekly collection of articles, and is instead sending me one every day ;).
IFTTT uses FeedZirra every fifteen minutes yes — but I think it also uses PubSubHubHub so that the blog notifies new posts, even though I never confirmed if it’s working or not…
Just out of curiosity how are the self hosted solutions like tt-rss at this?
Fever _is_ a commercial, proprietary, closed-source, self-hosted solution. And it sucks as noted in the post itself.In the case of tt-rss — it does support compressed responses, which is a very good thing. On the other hand, it does not seem to keep in mind the etag or modification date, so it does not use conditional requests. Akregator instead I’m pretty sure considers both as I see lots of 304 for it.Basically we got a ton of software that is not friendly to content providers, that need to be improved and fixed — some of this software can be fixed because it is open source (including NewsBlur), and other is closed source and proprietary and I have no option but point it out to the authors and hope they decide to fix it as intended, and in the mean time graylist the crawler.
Thanks for writing such a nice detailed post about this. We’re currently developing one of those proprietary readers, and we’ve already made sure we’re using compression and If-Modified-Since, but had forgotten about ETag. We’ll be sure to implement that, and appreciate the work you do, and especially your work in documenting your findings here and elsewhere.I think it’s somewhat important to note that for terrible server implementations, the spec for If-Modified-Since actually reads more similarly to ETag in that the only valid value for it is the value of the “Last-Modified” header from the previous request, especially since some servers aren’t time synchronized, but even more so since the server isn’t required by the spec to even parse the header as a datetime.
Thanks for helping making the web friendlier!I didn’t even think about the date parsing, I assumed that it would be properly parsed — I do use it for my hwids package, using the @–time-cond@ and @–remote-time@ parameters to @curl@, and there it works.But I suppose that there are just as many broken server-side implementations as there are client-side ones.
Indeed, in a past version of our product, we were creating the field based on server-generated timestamps of when we downloaded something, and managed to miss things on too many feeds since their server was running with a time deficit or something similar, so we switched to storing the exact string of “Last-Modified” as per the spec and most of our issues went away. There are still an alarming number of feeds that don’t support If-Modified-Since, but hopefully FeedBurner will stick around, since the massive number of feeds using FeedBurner support it properly. 🙂
*Rolling my eyes*The If-Modified-Since header and the 304 status code have been part of HTTP since version 1.0. Let’s get updated to 1996, folks!At my company, it drove us nuts for a long time that Firefox wouldn’t bother doing conditional fetches of stylesheets, thinking somehow that stylesheets would never change. It led web developers to stupid kludges like putting a revision counter on stylesheet URL’s so that they would get fetched when they changed. Finally Firefox caught on to sending If-Modified-Since requests for stylesheets.Gary Keith had a similar problem with people fetching his content repeatedly, except in his case the content wasn’t little bits of XML but files that approached half a megabyte. He finally gave up, evidently largely because of having to pay for all his website bandwidth. It should be noted, though, that much of the blame was with his lame server-side implementation which didn’t send a Last-Modified header.Oh, would that everyone catch on to a 17-year-old standard!
So, you are basically saying “fuck off” to some readers to save 5 cents per month? Just because some software have bugs?Looks like service I use is fine (The Old Reader), but your approach puzzles me.
No I’m saying “fuck off” to products that are poisonous for the web. And only if they don’t fix the handling.Since the 406 answer is automatic if compressed responses are not supported and the request comes from a bot or an Atom reader, the fix for it is to .. support the compressed response! It’s easy and painless.For what it’s worth, what used not to support this, such as neusbeuter, has since been fixed, which is an improvement for both consumers and producers.. sometimes the only way is the hard way.
Just yesterday and started subscribing to feeds again… through feedly.
/s/and/I