Modern feed readers

Four years ago, I wrote a musing about the negative effects of the Google Reader shutdown on content publishers. Today I can definitely confirm that some of the problems I foretold materialized. Indeed, thanks to the fortuitous fact that people have started posting my blog articles to reddit and hacker news (neither of which I’m fond of, but let’s leave that aside), I can declare that the vast majority of the bandwidth used by my blog is consumed by bots and in particular by feed readers. But let’s start from the start.

This blog is syndicated over a feed, the URL and format of which changed a number of times before, mostly with the software, or with the update of the software. The most recent change was due to switching from Typo to Hugo, and the feed name changing. I could have kept the original feed name, but it made little sense at the time, so instead I set up permanent redirects from the old URLs to the new URLs, as I always do. I say I always do because I keep working even the original URLs from when I ran the blog off my home DSL.

Some services and feed reading software know how to deal with permanent redirects correctly, and will (eventually) replace the old feed URL with the new one. For instance NewsBlur will replace URLs after ten fetches replied with a permanent redirect (which is sensible, to avoid accepting a redirection that was set up by mistake and soon rolled back, and to avoid data poisoning attacks). Unfortunately, it seems like this behaviour is extremely rare, and so on August 14th I received over three thousands requests for the old Typo feed URL (admittedly, that was the most persistent URL I used). In addition to that, I also received over 300 requests for the very old Typo /xml/ feeds, of which 122 still pointing at my old dynamic domain, which is now pointing at the normal domain for my blog. This has been the case now for almost ten years, and yet some people still have subscription to those URLs! At least one Liferea and one Akregator pointing at those URLs.

But while NewsBlur implements sane semantics for handling permanent redirects, it is far from a perfect implementation. In particular even though I have brought this up many times, Newsblur is not actually sending If-Modified-Since or If-None-Match headers, which means it will take a copy of the feed at every request. Even though it does support compressed responses (non fetch of the feed is allowed without compressed responses), NewsBlur is requesting the same URL more than twice an hour, because it seems to have two “sites” described by the same URL. At 50KiB per request, that makes up about 1% of the total bandwidth usage of the blog. To be fair, this is not bad at all, but one has to wonder why they can’t be saving the last modified or etag values — I guess I could install my own instance of NewsBlur and figure out how to do that myself, but who knows when I would find the time for that.

Update (2017-08-16): Turns out that, as Samuel pointed out in the comments and on Twitter, I wrote something untrue. NewsBlur does send the headers, and supports this correctly. The problem is an Apache bug that causes 304 never to be issued when using If-None-Match and mod_deflate.

To be fair, even rawdog, which I use for Planet Multimedia, does not appear to support these properly. Oh and speaking of Planet Multimedia, would someone be interested in providing a more modern template so that Monty’s pictures don’t take over the page, that would be awesome!

There actually are a few other readers that do support these values correctly, and indeed receive 304 (Not Modified) status code most of the time. These include Lighting (somebody appears to be still using it!) and at least yet-another-reader-service Willreadit.com — this latter appears to be in beta and being invite only; it’s probably the best HTTP implementation I’ve seen for a service with such a rough website. Indeed the bot landing page points out how it supports If-Modified-Since and gzip-compressed responses. Alas it does not appear to learn from persistent redirects though, so it’s currently fetching my blog’s feed twice, probably because there are at least two subscribers for it.

Also note that supporting If-Modified-Since is a prerequisite for supporting delta feeds which is an interesting way to save even more bandwidth (although I don’t think this is feasible to do with a static website at all).

At the very least it looks like we won the battle for supporting compressed responses. The only 406 (Not Acceptable) responses for the feed URL are for Fever, which is no longer developed or supported. Even Gwene, which I pointed out was hammering my blog last time I wrote about this, is now content to get the compressed version. Unfortunately it does not appear like my pull request was ever merged, which means it’s likely the repository itself is completely out of sync with what is being run.

So in 2017, what is the current state of the art feed reader support? NewsBlur has recently added support for JSON Feed which is not particularly exciting – when I read the post I was reminded, by the screenshot of choice there, where I heard of Brent Simmons before: Vesper, which is an interesting connection, but I should not go into that now – but at least shows that Samuel Clay is actually paying attention to the development of the format — even though that development right now appears to just avoiding XML. Which to be honest is not that bad of an idea: since HTML (even HTML5) does not have to be well-formatted XML, you need to provide it as cdata in an XML feed. And the way you do that makes it very easy to implement it incorrectly.

Also, as I wrote this post I realized what else I would like from NewsBlur: the ability to subscribe to an OPML feed as a folder. I still subscribe to lots of Planets, even though they seem to have lost their charm, but a few people are aggregated in multiple planets and it would make sense to be able to avoid duplicate posts. If I could tell NewsBlur «I want to subscribe to this Planet, aggregate it into a folder», it would be able to tell the duplicated feeds, and mark the posts as read on all of them at the same time. Note that what I’d like is something different from just importing an OPML description of the planet! I would like for the folder to be kept in sync with the OPML feed, so that if new feeds are added, they also get added to the folder, and same for removed feeds. I should probably file that on GetSatisfaction at some point.

Planets, feeds and blogs

You have probably noticed that last month I replaced Harvester with rawdog for Planet Multimedia. The reasons was easy to explain: Harvester requires libraries that only work with Ruby 1.8 — and while on one hand moving to Ruby 1.9 or 2 would mean being able to use the feedfetcher (the same one used by IFTTT), my attempts at updating the code to work with a more modern version of Ruby have been all failures.

Since I did not intend to be swamped with one more custom tool to maintain I turned to another standard tool to implement the aggregator, rawdog — holding my nose on the use of darcs for source control, and the name (please don’t google for it without safe search on, at work). The nice part about using this tool is that it’s packaged in Gentoo already, so it’s handled straight by portage with binary packages. Unfortunately, the default templates are terrible, and the settings non-obvious, but Luca was able to make the best out of it.

But more and more problems got obvious with time. The first is that the tool is does not respect the return codes at exit — it always returns zero (success) even if the processing was incomplete; it took me two weeks to figure out that the script failed when running in cron because the environment lacked the locale settings, as the cron logs said that everything was alright, and since I use fcron, it also did not send me any email, as I set it to mail me only for errors.

A couple of days ago, I got complains again that the Planet was not updating; again, no error in the cron logs, no error in my email. I ran the command manually, and I was told by it that Luca’s feed, on blogs.gentoo.org, was unreachable. Okay, sure. But then it did not solve itself when it came back up. Today I looked back into it and J-B’s and Rémi’s blogs feed were unreachable. Once again, no non-zero exit status, thus no mail, no error in the logs. This is not the way it should behave.

But that’s not enough. the other problem with rawdog is that it does not, by default, support generating a feed for the aggregation, like Harvester (and Planet/Venus) does. I found that Jonathan Riddell actually built a plugin for Planet KDE to generate the feed, but I haven’t tested it yet because I have not found the authoritative source of it, but just multiple copies of it in different websites. It also produces RSS feeds, rather than Atom feeds. And I’m sorry to say but Atom is much preferred, for me.

So where does it leave us? I’m not going to try fixing rawdog I’m afraid. Mostly because I don’t intend spending time with darcs. My options are either go back to Harvester and fix it to not use DBI and support Ruby 1.9, or try to adapt parts of NewsBlur – that already deal with aggregating feeds and producing new feeds – to make up an alternative to rawdog. If I am to do something like that, though, I’m most likely going to take my dear time and make it a web-configurable tool, rather than something that needs to be configured on the command line or with configuration files.

The reason for that is, very simply, that I’m growing fond of doing most of my work on a browser when I can, and this looks like a perfect solution to the problem. Even more so if you can give access to someone else to look into it — and if you can avoid storing passwords.

So, any takers to help me with this project?

SNI Quest: how’s the support?

After yesterday’s incident my blog and all the other apps I’ve been hosting have moved to use SNI certificates (a downgrade to Class 1 from Class 2, but that’s okay).

SNI is still considered a partially experimental feature nowadays because Windows XP is unfortunately still a thing. Luckily for me, it doesn’t seem like I have many Windows XP users — and the few that are there are probably okay with using either Chrome, Firefox or Opera, all of which use their own implementation of SSL (two using NSS), that supports SNI just fine.

Internet Explorer uses the operating system level libraries, which are not capable of using SNI at all, even if you updated to IE8. With a bit of luck, this will also mean fewer spammers using real WinXP-based browsers will be able to post. I don’t hold my breath, but it’s still possible. A few spammers were kicked off by the HTTPS move after all, so who knows.

What turned out to be interesting is the support for dropping SNI-backed links into various web apps out there — the kind of test I’ve done many times before while testing my ModSecurity Rules. The results have been interesting. All the major websites, and RSS readers, seem to handle this pretty well, with two main exceptions.

LinkedIn has probably the worst HTTP client implementation I’ve seen on a serious web app. I already opened a ticket with them before because their fetcher does not use compressed answers. This is pretty bad, considering that non-compressed answers mean a multiple times increase, and since this is traffic upstream from your server, it means that you are paying for LinkedIn’s laziness.

Due to this, LinkedIn links to my blog were already showing a (wrong) 403 message (the actual error they would get is 406, but then they process is wrongly, and I don’t care much about that). With the new SNI certificate, the LinkedIn fetcher now can only report the hostname of my blog, and no log in Apache can be found about it, which makes me guess that they try to validate the connection’s certificate, and fail.

NewsBlur is interesting as well. At first it seemed to me like it was not supporting SNI, as the settings page for my blog’s feed showed “401 Bad URL” error messages — without any matching log in Apache, which meant that the SSL connection was not completed either. On the other hand, the feed is fetched. While Samuel at first said that he did not care enough to implement SNI support for just one customer, and that made me look for alternatives for half an hour, he’s been very helpful with debugging a bit around it. Turns out that the problem is only for real-page fetching, and I haven’t spent much more time than this working on it. If somebody wants to look at it I’m happy to point you to what’s going on.

Luckily, Python’s httplib does not verify the certificates, which means Planet Gentoo still works. I’ve not checked Planet Multimedia yet — but at least that one if it fails I can fix.

The Google Reader Exodus and its effect on content publishers

I’m a content publisher, whether I like it or not. This blog is relatively well followed, and I write quite a lot in it. While my hosting provider does not give me grief for my bandwidth usage, optimizing it is something I’m always keen on, especially since I have been Slashdotted once before. This is one of the reasons why my ModSecurity Ruleset validates and filters crawlers as much as spammers.

Blogs’ feeds, be them RSS or Atom (this blog only supports the latter) are a very neat way to optimize bandwidth: they get you the content of the articles without styles, scripts or images. But they can also be quite big. The average feed for my blog’s articles is 100KiB which is a fairly big page, if you consider that feed readers are supposed to keep pinging the blog to check for new items. Luckily for everybody, the authors of HTTP did consider this problem, and solved it with two main features: conditional requests and compressed responses.

Okay there’s a sense of déjà-vu in all of this, because I already complained about software not using the features even when it’s designed to monitor web pages constantly.

By using conditional requests, even if you poke my blog every fifteen minutes, you won’t use more than 10KiB an hour, if no new article has been posted. By using compressed responses, instead of a 100KiB response you’ll just have to download 33KiB. With Google Reader, things were even better: instead of 113 requests for the feed, a single request was made by the FeedFetcher, and that was it.

But now Google Reader is no more (almost). What happens now? Well, of the 113 subscribers, a few will most likely not re-subscribe to my blog at all. Others have migrated to NewsBlur (35 subscribers), the rest seem to have installed their own feed reader or aggregator, including tt-rss, owncloud, and so on. This was obvious looking at the statistics from either AWStats or Munin, both showing a higher volume of requests and delivered content compared to last month.

I’ve then decided to look into improving the bandwidth a bit more than before, among other things, by providing WebP alternative for images, but that does not really work as intended — I have enough material for a rant post or two so I won’t discuss it now. But while doing so I found out something else.

One of the changes I made while hoping to use WebP is to serve the image files from a different domain (assets.flameeyes.eu) which meant that the access log for the blog, while still not perfect, is decidedly cleaner than before. From there I noticed that a new feed reader started requesting my blog’s feed every half an hour. Without compression. In full every time. That’s just shy of 5MiB of traffic per day, but that’s not the worst part. The worst part is that said 5MiB are for a single reader as the requests come from a commercial, proprietary feed reader webapp.

And this is not the only one! Gwene also does the same, even though I sent a pull request to get it to use compressed responses, which hasn’t had a single reply. Even Yandex’s new product has the same issue.

While 5MiB/day is not too much taken singularly, my blog’s traffic averages on 50-60 MiB/day so it’s basically a 10% traffic for less than 1% of users, just because they do not follow the best practices when writing web software. I’ve now added these crawlers to the list of stealth robots, this means that they will receive a “406 Unacceptable” unless they finally implement at least the compressed responses support (which is the easy part in all this).

This has an unfortunate implication on users of those services that were reading me, who won’t get any new updates. If I was a commercial entity, I couldn’t afford this at all. The big problem, to me, is that with Google Reader going away, I expect more and more of this kind of issues to crop up repeatedly. Even NewsBlur, which is now my feed reader of choice fixed their crawlers yet, which I commented upon before — the code is open-source but I don’t want to deal with Python just yet.

Seriously, why are there so many people who expect to be able to deal with web software and yet have no idea how the web works at all? And I wonder if somebody expected this kind of fallout from the simple shut down of a relatively minor service like Google Reader.

Stop inventing a new ontology for each service!

Last month I wrote a post noting who makes use of semantic data for the web in particular pointing out that Facebook, Google, Readability and Flattr all use different way to provide context to the content: OpenGraph, Schema.org, hNews and their own version of microformats respectively.

Well, NewsBlur – which, even though I criticized for the HTTP implementation, is still my best suggestion for a Google Reader replacement, if anything because it’s open source even though it’s a premium service – seems to come up with its own way to get semantic data.

The FAQ for publishers states that you can use one of a number of possible selectors to provide NewsBlur with an idea of how your content is structured — completely ignoring the fact that schema.org already includes all the structure, and it would be relatively easy to get that data explicitly. Even better, since NewsBlur has a way to show public comments within the NewsBlur interface it would be possible for it to display the comments on the post themselves, as they are also tagged and structured with the same ontology. I’ve opened an idea about it — hopefully somebody, if not the author, will feel like implementing this.

But this is by far not limited to NewsBlur! While Readability added a special case for my blog so that it actually gets the right data out of it, the content guide still only describe support for the hNews format, even though Schema.org has all the same data and more. And Flattr, well, still does not seem to care about getting data via semantic information — the best match would be support for the link relation in feeds that can be autodiscovered, but then I don’t really have an idea of where Flattr would find the metadata to create the “thing” on their side.

Please, all you guys who work on services — can we get all behind the same ontology, so that we don’t have to start adding four times redundant information on pages, increasing their size for no advantage? Please!

Why you should care about your HTTP implementation

So today’s frenzy is all about Google’s dismissal of the Reader service. While I’m also upset about that, I’m afraid I cannot really get into discussing that at this point. On the other hand, I can talk once again of my ModSecurity ruleset and in particular of the rules that validate HTTP robots all over the Internet.

One of the Google Reader alternatives that are being talked about is NewsBlur — which actually looks cool at first sight, but I (and most other people) don’t seem to be able to try it out yet because their service – I’m not going to call them servers as it seems they at least partially use AWS for hosting – fails to scale.

While I’m pretty sure that it’s an exceptional amount of load they are receiving now as everybody and their droid are trying to register to the service and import their whole Google Reader subscription list, which then needs to be fetched and added to the database, – subscriptions to my blog’s feed went from 5 to 23 in the matter of hours! – there are a few things that I can infer from the way it behaves that makes me think that somebody overlooked the need for a strong HTTP implementation.

First of all what happened was that I got a report on twitter that NewsBlur was getting a 403 fetching my blog, and that was obviously caused by my rules’ validation of the request. Looking at my logs, I found out that NewsBlur sends requests with three different User-Agents, which show a likeliness that they are implemented by three different codepaths altogether:

User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

The third is the most conspicuous string, because it’s very minimal and does not follow your average string format, using the dash as separator instead of adding the URL in parenthesis next to the fetcher name (and version, more on that later).

The other two strings show that they have been taken by the string reported by Safari on OSX — but interestingly enough from two different Safari version, and one of the two has been actually stripped as well. This is really silly. While I can understand that they might want to look like Safari when fetching a page to display – mostly because there are bad hacks like PageSpeed that serve different HTML to different browsers, messing up caching – I doubt that is warranted for feeds; and even getting the Safari HTML might be a bad idea if then it’s displayed by the user with a different browser.

The code that fetches feeds and pages is likely quite different as it can be seen by the full request. From the feed fetcher:

GET /articles.atom HTTP/1.1
A-Im: feed
Accept-Encoding: gzip, deflate
Connection: close
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
Host: blog.flameeyes.eu
If-Modified-Since: Tue, 01 Nov 2011 23:36:35 GMT
If-None-Match: "a00c0-18de5-4d10f58aa91b5"

This is a very sophisticated fetching code, as it not only properly supports compressed responses (Accept-Encoding header) but it also uses the If-None-Match and If-Modified-Since headers to not re-fetch an unmodified content. The fact that it’s pointing to November 1st of two years ago is likely due to the fact that since then my ModSecurity ruleset refused to speak with this fetcher, because of the fake User-Agent string. It also includes a proper Accept header that lists the feed types they prefer over the generic XML and other formats.

The A-Im header is not a fake or a bug; it’s actually part of RFC3229 Delta encoding in HTTP and stands for Accept-Instance-Manipulation. I’ve never seen that before, but a quick search turned it out, even though the standardized spelling would be A-IM. Unfortunately, the aforementioned RFC does not define the “feed” manipulator, even though it seems to be used in the wild, and I couldn’t find a proper formal documentation of how it should work. The theory from what I can tell is that the blog engine would be able to use the If-Modified-Since header to produce on the spot a custom feed for the fetcher, that only includes entries that has been modified since that date. Cool idea, too bad it lacks a standard as I said.

The request coming in from the page fetcher is drastically different:

GET / HTTP/1.1
Host: blog.flameeyes.eu
Connection: close
Content-Length: 0
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)

So we can tell two things from the comparison: this code is older (there is an earlier version of Safari being used), and not the same care has been spent as it has been on the feed fetcher (which dropped the Safari identifier itself, at least). It’s more than likely that if libraries are used to send the request, a completely different library is used here, as this request declares support for the compress encoding, not supported by the feed fetcher (and as far as I can tell, never ever used). It also is much less choosy on the formats to receive, as it accepts whatever you want to give it.

*For the Italian readers: yes I intentionally picked the word choosy. While I can find Fornero an idiot as much as the next guy, I grew tired of copy-paste statuses on Facebook and comments that she should have said picky. Know your English, instead of complaining on idiocies.*

The lack of If-Modified-Since here does not really mean much, because it’s also possible that they were never able to fetch the page, as they might have introduced the feature later (even though the code is likely older). But the Content-Length header sticks out like a sore thumb, and I would expect to have been put there by whatever HTTP access library they’re using.

The favicon fetcher is the one that is the most naïve and possibly the code that needs to be cleaned up the most:

GET /favicon.ico HTTP/1.1
Accept-Encoding: identity
Host: blog.flameeyes.eu
Connection: close
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

Here we start with nigh protocol violations, by not providing an Accept header — especially facepalming considering that this is where a static list of mime types would be the most useful, to restrict the image formats that will be handled properly! But what happens with my rules is that the Accept-Encoding there is not suitable for a bot at all! Since it does not support any compressed response, the code will now respond with a 406 Not Acceptable status code, instead of providing the icon.

I can understand that a compressed icon is more than likely to not be useful — indeed most images should not be compressed at all to be sent over HTTP, but why should you explicitly refuse it? Especially since the other two fetches properly support a sophisticated HTTP?

All in all, it seems like some of the code in NewsBlur has been bolted on after the fact, and with different levels of care. It might not be the best of time for them now to look at the HTTP implementation, but I would still suggest for it. A single pipelined request of the three components they need – instead of using Connection: close – could easily reduce the number of connections to blogs, and that would be very welcome to all the bloggers out there. And using the same HTTP code would make it easier for people like me to handle NewsBlur properly.

I would also like to have a way to validate that a given request comes from NewsBlur — like we do with GoogleBot and other crawlers. Unfortunately this is not really possible, because they use multiple servers, both on standard hostings and AWS, both on IPv4 and (possibly, one time) IPv6, so using FcRDNS is not an option.

Oh well, let’s see how this thing pans out.