So today’s frenzy is all about Google’s dismissal of the Reader service. While I’m also upset about that, I’m afraid I cannot really get into discussing that at this point. On the other hand, I can talk once again of my ModSecurity ruleset and in particular of the rules that validate HTTP robots all over the Internet.
One of the Google Reader alternatives that are being talked about is NewsBlur — which actually looks cool at first sight, but I (and most other people) don’t seem to be able to try it out yet because their service – I’m not going to call them servers as it seems they at least partially use AWS for hosting – fails to scale.
While I’m pretty sure that it’s an exceptional amount of load they are receiving now as everybody and their droid are trying to register to the service and import their whole Google Reader subscription list, which then needs to be fetched and added to the database, – subscriptions to my blog’s feed went from 5 to 23 in the matter of hours! – there are a few things that I can infer from the way it behaves that makes me think that somebody overlooked the need for a strong HTTP implementation.
First of all what happened was that I got a report on twitter that NewsBlur was getting a 403 fetching my blog, and that was obviously caused by my rules’ validation of the request. Looking at my logs, I found out that NewsBlur sends requests with three different User-Agents, which show a likeliness that they are implemented by three different codepaths altogether:
User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com
The third is the most conspicuous string, because it’s very minimal and does not follow your average string format, using the dash as separator instead of adding the URL in parenthesis next to the fetcher name (and version, more on that later).
The other two strings show that they have been taken by the string reported by Safari on OSX — but interestingly enough from two different Safari version, and one of the two has been actually stripped as well. This is really silly. While I can understand that they might want to look like Safari when fetching a page to display – mostly because there are bad hacks like PageSpeed that serve different HTML to different browsers, messing up caching – I doubt that is warranted for feeds; and even getting the Safari HTML might be a bad idea if then it’s displayed by the user with a different browser.
The code that fetches feeds and pages is likely quite different as it can be seen by the full request. From the feed fetcher:
GET /articles.atom HTTP/1.1
A-Im: feed
Accept-Encoding: gzip, deflate
Connection: close
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
Host: blog.flameeyes.eu
If-Modified-Since: Tue, 01 Nov 2011 23:36:35 GMT
If-None-Match: "a00c0-18de5-4d10f58aa91b5"
This is a very sophisticated fetching code, as it not only properly supports compressed responses (Accept-Encoding
header) but it also uses the If-None-Match
and If-Modified-Since
headers to not re-fetch an unmodified content. The fact that it’s pointing to November 1st of two years ago is likely due to the fact that since then my ModSecurity ruleset refused to speak with this fetcher, because of the fake User-Agent string. It also includes a proper Accept
header that lists the feed types they prefer over the generic XML and other formats.
The A-Im
header is not a fake or a bug; it’s actually part of RFC3229 Delta encoding in HTTP and stands for Accept-Instance-Manipulation
. I’ve never seen that before, but a quick search turned it out, even though the standardized spelling would be A-IM
. Unfortunately, the aforementioned RFC does not define the “feed” manipulator, even though it seems to be used in the wild, and I couldn’t find a proper formal documentation of how it should work. The theory from what I can tell is that the blog engine would be able to use the If-Modified-Since
header to produce on the spot a custom feed for the fetcher, that only includes entries that has been modified since that date. Cool idea, too bad it lacks a standard as I said.
The request coming in from the page fetcher is drastically different:
GET / HTTP/1.1
Host: blog.flameeyes.eu
Connection: close
Content-Length: 0
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
So we can tell two things from the comparison: this code is older (there is an earlier version of Safari being used), and not the same care has been spent as it has been on the feed fetcher (which dropped the Safari identifier itself, at least). It’s more than likely that if libraries are used to send the request, a completely different library is used here, as this request declares support for the compress
encoding, not supported by the feed fetcher (and as far as I can tell, never ever used). It also is much less choosy on the formats to receive, as it accepts whatever you want to give it.
For the Italian readers: yes I intentionally picked the word choosy. While I can find Fornero an idiot as much as the next guy, I grew tired of copy-paste statuses on Facebook and comments that she should have said picky. Know your English, instead of complaining on idiocies.
The lack of If-Modified-Since
here does not really mean much, because it’s also possible that they were never able to fetch the page, as they might have introduced the feature later (even though the code is likely older). But the Content-Length
header sticks out like a sore thumb, and I would expect to have been put there by whatever HTTP access library they’re using.
The favicon fetcher is the one that is the most naïve and possibly the code that needs to be cleaned up the most:
GET /favicon.ico HTTP/1.1
Accept-Encoding: identity
Host: blog.flameeyes.eu
Connection: close
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com
Here we start with nigh protocol violations, by not providing an Accept
header — especially facepalming considering that this is where a static list of mime types would be the most useful, to restrict the image formats that will be handled properly! But what happens with my rules is that the Accept-Encoding
there is not suitable for a bot at all! Since it does not support any compressed response, the code will now respond with a 406 Not Acceptable status code, instead of providing the icon.
I can understand that a compressed icon is more than likely to not be useful — indeed most images should not be compressed at all to be sent over HTTP, but why should you explicitly refuse it? Especially since the other two fetches properly support a sophisticated HTTP?
All in all, it seems like some of the code in NewsBlur has been bolted on after the fact, and with different levels of care. It might not be the best of time for them now to look at the HTTP implementation, but I would still suggest for it. A single pipelined request of the three components they need – instead of using Connection: close
– could easily reduce the number of connections to blogs, and that would be very welcome to all the bloggers out there. And using the same HTTP code would make it easier for people like me to handle NewsBlur properly.
I would also like to have a way to validate that a given request comes from NewsBlur — like we do with GoogleBot and other crawlers. Unfortunately this is not really possible, because they use multiple servers, both on standard hostings and AWS, both on IPv4 and (possibly, one time) IPv6, so using FcRDNS is not an option.
Oh well, let’s see how this thing pans out.