More HTTP misbehaviours

Today I have been having some fun: while looking at the backlog on IRCCloud, I found out that it auto-linked Makefile.am which I prompty decided to register it with Gandi — unfortunately I couldn’t get Makefile.in or configure.ac as they are both already registered. After that I decided to set up Google Analytics to report how many referrer arrive to my websites through some of the many vanity domains I registered over time.

After doing that, I spent some time staring at the web server logs to make sure that everything was okay, and I found out some more interesting things: it looks like a lot of people have been fetching my blog Atom feed through very bad feed readers. This is the reification of my forecast last year when Google Reader got shut down.

Some of the fetchers are open source, so I ended up opening issues for them, but that is not the case for all of them. And even when they are open source, sometimes they don’t even accept pull requests implementing the feature, for whichever reason.

So this post is a bit of a name-and-shame, which can be positive for open-source projects when they can fix things, or negative for closed source services that are trying to replace Google Reader and failing to implement HTTP properly. It will also serve as a warning for my readers from those services, as they’ll stop being able to fetch my feed pretty soon, as I’ll update my ModSecurity rules to stop these people from fetching my blog.

As I noted above, both Stringer and Feedbin fail to properly use compressed responses (gzip compression), which means that they fetch over 90KiB every turn instead of just 25KiB. The Stringer devs already reacted and seem to be looking into fixing this very soon now. Feedbin I have no answer from yet (but it’s pretty soon anyway), but it worries me for another reason too: it does not do any caching at all. And somebody set up a Feedbin instance in the Prague University that fetches my feed, without compression, without caching, every two minutes. I’m going to soon blacklist it.

Gwene still has not replied to the pull request I sent in October 2012, but on the bright side, it has not fetched my blog since a long time ago. Feedzirra (now Feedjira) used by IFTTT still does not enable compressed responses by default, even though it seems to support the option (Stringer is also based on it, it seems).

It’s not just plain feed readers that fail at implementing HTTP. Distributed social network Friendica – that aims at doing a better job than Diaspora at that – seem also to forget about implementing either compressed responses or caching. At least it seems to only fetch my feed every twelve hours. On the other hand, it seems to also get someone’s timeline from Twitter, so when it encounters a link to my blog it first send a HEAD request, and then fetches the page. Three times. Also uncompressed.

On the side of non-open-source services, FeedWrangler has probably one of the worst implementations of HTTP I’ve ever seen: it does not support compressed responses (90KiB feed), does not do caching (every time!), and while it would fetch at one hour intervals, it does not understand that a 301 is a permanent redirection, and there’s no point in keeping around two feed IDs for /articles.rss and /articles.atom (each with one subscriber). That’s 4MiB a day, which is around 2% of the bandwidth my website serves, over a day. While this is not an important amount, and I don’t have limitation on the server’s egress, it seems silly that 2% of my bandwidth is consumed on two subscribers, when the site has over a thousand visitors a day.

But what takes the biscuit is definitely FeedMyInbox: while it fetches only every six hours, it implements neither caching nor compression. And I found it only when looking into the requests coming from bots without a User-Agent header. The requests come from 216.198.247.46 which is svr.feedmyinbox.com. I’m soon also going to blacklist this until they stop being douches and provide a valid user agent string.

They are by far not the only ones though; there is another bot that fetches my feed every three hours that will soon follow the same destiny. But this does not have an obvious service attached to it, so if whatever you’re using to read my blog tells you it can’t fetch my blog anymore, try to figure out if you’re using a douchereader.

Please remember that software on the net should be implemented for collaboration between client and server, not for exploitation. Everybody’s bandwidth gets worse when you heavily use a service that is not doing its job at optimizing bandwidth usage.

9 thoughts on “More HTTP misbehaviours

  1. I’m afraid it shows my lack of knowledge of Czech geography and institutions despite working with plenty of people from the city :P”Ceske vysoke uceni technicke v Praze” is the netblock owner, which is also named in English as “Czech Technical University” — I mistakenly applied the Italian rule of calling the University with the city name, I apologize.

    Like

  2. I run a private feed fetcher myself, since Google Reader went away.. Interestingly a lot of feed providers “misbehave” as well, not providing any caching mechanism: no ETag, no Last-Modified, nothing. It’s a really annoying waste of bandwidth, especially for something that is explicitly a timestamped feed. Shouldn’t be that hard!And yeah, things like implementing a proper reaction to a 301 is something that is sadly easy to miss. I guess it’s just too easy to ignore the semantics of HTTP and treat everything non-2xx as “let’s just try it later”.

    Like

  3. You’re violating the robustness principle (Postel’s law) (http://en.wikipedia.org/wik… by being over-strict in what you accept as client requests. While it is definitely good to name such implementation shortcomings, blocking them is plain stupid. Because new users will try with their software and find *your* site to be broken. And move on. And you lost the chance to educate these people further over time and left them stuck at reading $mediaoutlet “news” every day because they – as a commercial entities – follow the robustness principle. Because it increases their reach and their income. Every additional reader is another set of ads served.Or more technical: There is no RFC that says you need to accept compressed responses or you are not allowed to get a blog’s resources via http.

    Like

  4. While I have always did my best to follow Postel’s law, I think we are at a point where it’s no longer good practice for the web. If we were to follow it strictly, we’d end up with still catering to a minority of users who won’t use a browser supporting modern TLS technology, instead of just disabling SSLv2 and call it a day. And in the case of email, where Postel’s law was formulated, we wouldn’t be able to start providing real secure email.Yes, it is indeed the case that I could turn down a lot of readers by being too strict on the feed readers, but I do believe that my requirements are not *that* impossible to follow. In particular most HTTP base libraries support gzip compression, people just forget to enable it.My reason for ranting-and-blocking is that I don’t want the thing to be just ignored simply because I could lose a reader or two. I want people to figure out which services have good implementations and which ones don’t.In my previous linked post I was complaining about YandexBlogs by the way. While now I may be finally able to get in touch with them and ask them please to fix it, I see now that they try to fetch my blog even if it doesn’t have any subscribers. I don’t know if it’s because whoever was reading me through that stopped using the service, or stopped reading my blog. I hope the former, but I cannot really care for the latter either.I think we should all force our content providers and consumers to do things the right way, especially in today’s Web.

    Like

  5. Are those feed readers you mentioned only examples of a bigger list? or are they the only “bad” ones? I’m asking, cause I’m also on your side when it’s going at not usage of “better” methods. I haven’t checked my feedreader and I don’t now enough of http propaply to check it, but I’d like to have one, which is using “latest” features (I think gzip isn’t so “latest” at all…). So, would be nice if you can post a complete list of “bad” readers. or at least tell, if mine (http://tt-rss.org/) is fine (more or less).

    Like

  6. This is the current list of bad readers that access my blog. It probably is longer as there are a few more that try to access the feed without compression, but they have no `User-Agent` field.For what concerns tt-rss, there’s a note in my previous post about it: yes it does support compression, and it does support caching. In the case of compression things are a bit complex as it does not seem to *always* support compression. It probably depends (on Gentoo) on USE flags used. Similarly happens to Liferea.

    Like

  7. Rather than blocking the bad readers, could you redirect them (or server side alias, to avoid sending HTTP redirect) to a very simple feed that just tells the receiver that their feed reader is banned for misbehaviour and they need to switch to one that doesn’t suck? Such a feed would not show the regular post listings, and so could be short no matter how many posts your blog contains.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s