NewsBlur Review

One of the very, very common refrain I hear in my circles, probably because my circles are full of ex-users of it, and at the same time of Googlers and Xooglers, is that the Internet changed when Google Reader was shut down, and that we would never be able to come back. This is something that I don’t quite buy out right — Google Reader, like most of the similar tools, was used only by a subset of the general population, while other tools, such as social networks, started being widely used right around the same time.

But in the amount of moaning about Google Reader not existing anymore, I rarely hear enough willingness to look for alternatives. Sure there was a huge noise about options back then, which I previously called the “Google Reader Exodus“, but I rarely hear of much else. I see tweets going by of people wishing that Reader still existed, but I don’t think I have seen many willing to go out of their way to do something about it.

Important aside here: while I did work at Google when Reader was shut down in effect, the plan was announced in-between me signing my contract and my start date. And obviously it was not something that was decided there and then, but rather a long-term decision taken who knows how long before. So while I was at Google for the “funeral”, I had no saying, or knowledge, of any of it.

Well, the good news is that NewsBlur, which I have started using right before the Reader shut down, is still my favourite tool for this, it’s open source, and it has a hosted service that costs a reasonable $36/year. And it doesn’t even have a referral program, so if you had any doubt of me shilling, you can vacate it now.

So first of all, NewsBlur has enough options for layout that look so much like Google Reader “of back then” — before Google+ and before losing the “Shared Stories” feature. Indeed, it supports both its own list of followers/following, and global sharing of stories on the platform. And you don’t even need to be an user to follow what I share on it, since it also automatically creates a blurblog, which you can subscribe to with whatever you want.

I have in the past used IFTTT to integrate further features, including saving stories to Pocket, and sharing stories on Twitter, Facebook, and LinkedIn. Unfortunately while NewsBlur has great integration, IFTTT is now a $4/month service, which does not have nearly enough features for me to consider subscribing to, sorry. So for now I’m talking about direct features only.

In addition to the sharing features, NewsBlur has what is for me one of the killer features: the “Intelligence Trainer”. Which is not any type of machine learning system, but rather a way for you to tell NewsBlur to hide, or highlight, certain content. This is very similar to a feature I would have wanted twelve years ago: filtering. Indeed, this allowed me to hide my own posts from Gentoo Universe – back when I was involved in the project – and to only read Matthew’s blog posts in one of the many Planets he’s syndicated, like I wanted. But there’s much more to it.

I have used this up to this day to hide repetitive posts (e.g. status updates for certain projects being aggregated together with blogs), to stop reading authors that didn’t interest me, or wrote in languages I couldn’t read. But I also used the “highlighting” feature to know when a friend posted on another Planet, or to get information about new releases or tours from metal bands I followed, through some of the dedicated websites’ feeds.

But where this becomes extremely interesting is when you combine it with another feature that nowadays I couldn’t go without, particularly as so much content that used to be available as blogs, sites, and feeds is becoming newsletters: it’s the ability to receive email newsletters and turn them into a feed. I do this for quite a few of them: the Adafruit Python for Microcontrollers newsletter (which admittedly is also available through their blog), the new tickets alerts from a bunch of different venues (admittedly not very useful this year),, and Patreon.

And since the intelligence trainer does not need to have tags or authors to go along, but can match a substring in the title (subject), this makes it an awesome tool to filter out certain particular messages from a newsletter. For instance, while I do support a number of creators on Patreon, a few of them share all their public videos as updates — I don’t need to see those in the Patreon feed, as I get them directly at source, so I can hide those particular series from the Patreon feed for myself. And instead, while I can wait for most of the releases, I do want to know quickly if they are giving away a free book, or if there’s a new release from John Scalzi that I missed. And again, the highlighting helps me there: it makes a green counter appear next to the “feed”, that tells me there’s something I want to look at sooner, rather than later.

As I said the intelligence trainer doesn’t have to use tags — but it can use them if they are there at all. So for instance for this very blog, if I were to post something in Italian and you wouldn’t be able to read it, you could train NewsBlur to hide posts in Italian. Or if you think my opinions are useless, you can just hide those, too.

But this is not where it ends. Beside having an awesome implementation of HTTP, which supports all bandwidth-saving optimizations I know of, NewsBlur thinks about the user a lot more than Google Reader would have. Whenever you decide to do some spring cleaning of your subscription, NewsBlur will send you by email an OPML file with all of your subscribed feed before you made the first change (for the day, I think). That way you never risk deleting a subscription without having a way to find it agian. And it supports muting sites, so you don’t need to unsubscribe not to get a high count of unread posts of, say, a frequent flyers’ blog during a pandemic.

Plus it’s extremely tweakable and customizable — you can choose to see the stories as they appear in the feed, or load into a frame the original website linked by the story, or try to extract the story content from the linked site (the “reader mode”).

Overall, I can only suggest to those who keep complaining about Google Reader’s demies, that it’s always a good time to join NewsBlur instead.

Planets, Clouds, Python

Half a year ago, I wrote some thoughts about writing a cloud-native feed aggregator. I actually started drawing some ideas of how I would design this myself since, and I even went through the (limited) trouble of having it approved for release. But I have not actually released any code, or to be honest, I have not written any code either. The repository has been sitting idle.

Now, with the Python 2 demise coming soon, and me not interested in keeping around a server nearly only to run Planet Multimedia, I started looking into this again. The first thing that I realized is that I both want to reuse as much code exist out there as I can, and I want to integrate with “modern” professional technologies such as OpenTelemetry, which I appreciate from work, even if it sounds like overkill.

But that’s where things get complicated: while going full “left-pad” of having a module for literally everything is not something you’ll find me happy about, a quick look at feedparser, probably the most common module to read feeds in Python, shows just how much code is spent trying to cover for old Python versions (before 2.7, even), or to implement minimal-viable-interfaces to avoid mandatory dependencies at all.

Thankfully, as Samuel from NewsBlur pointed out, it’s relatively trivial to just fetch the feed with requests, and then pass it down to feedparser. And since there are integration points for OpenTelemetry and requests, having an instrumented feed fetcher shouldn’t be too hard. That’s going to probably be my first focus when writing Tanuga, next weekend.

Speaking of NewsBlur, the chat with Samuel also made me realize how much of it is still tied to Python 2. Since I’ve gathered quite a bit of experience in porting to Python 3 at work, I’m trying to find some personal time to contribute smaller fixes to run this in Python 3. The biggest hurdle I’m having right now is to set it up on a VM so that I can start it up in Python 2 to begin with.

Why am I back looking at this pseudo-actively? Well, the main reason is that rawdog is still using Python 2, and that is going to be a major pain with security next year. But it’s also the last non-static website that I run on my own infrastructure, and I really would love to get rid of entirely. Once I do that, I can at least stop running my own (dedicated or virtual) servers. And that’s going to save me time (and money, but time is the most important one here too.)

My hope is that once I find a good solution to migrate Planet Multimedia to a Cloud solution, I can move the remaining static websites to other solutions, likely Netlify like I did for my photography page. And after that, I can stop the last remaining server, and be done with sysadmin work outside of my flat. Because honestly, it’s not worth my time to run all of these.

I can already hear a few folks complaining with the usual remarks of “it’s someone else’s computer!” — but the answer is that yes, it’s someone else’s computer, but a computer of someone who’s paid to do a good job with it. This is possibly the only way for me to manage to cut away some time to work on more Open Source software.

“Planets” in the World of Cloud

As I have written recently, I’m trying to reduce the amount of servers I directly manage, as it’s getting annoying and, honestly, out of touch with what my peers are doing right now. I already hired another company to run the blog for me, although I do keep access to all its information at hand and can migrate where needed. I also give it a try to use Firebase Hosting for my tiny photography page, to see if it would be feasible to replace my homepage with that.

But one of the things that I still definitely need a server for is keep running Planet Multimedia, despite its tiny userbase and dwindling content (if you work in FLOSS multimedia, and you want to be added to the Planet, drop me an email!)

Right now, the Planet is maintained through rawdog, which is a Python script that works locally with no database. This is great to run on a vserver, but in a word where most of the investments and improvements go on Cloud services, that’s not really viable as an option. And to be honest, the fact that this is still using Python 2 worries me no little, particularly when the author insists that Python 3 is a different language (it isn’t).

So, I’m now in the market to replace the Planet Multimedia backend with something that is “Cloud native” — that is, designed to be run on some cloud, and possibly lightweight. I don’t really want to start dealing with Kubernetes, running my own PostgreSQL instances, or setting up Apache. I really would like something that looks more like the redirector I blogged about before, or like the stuff I deal with for a living at work. Because it is 2019.

So sketching this “on paper” very roughly, I expect such a software to be along the lines of a single binary with a configuration file, that outputs static files that are served by the web server. Kind of like rawdog, but long-running. Changing the configuration would require restarting the binary, but that’s acceptable. No database access is really needed, as caching can be maintained to process level — although that would men that permanent redirects couldn’t be rewritten in the configuration. So maybe some configuration database would help, but it seems most clouds support some simple unstructured data storage that would solve that particular problem.

From experience with work, I would expect the long running binary to be itself a webapp, so that you can either inspect (read-only) what’s going on, or make changes to the database configuration with it. And it should probably have independent parallel execution of fetchers for the various feeds, that then store the received content into a shared (in-memory only) structure, that is used by the generation routine to produce the output files. It may sounds like over-engineering the problem, but that’s a bit of a given for me, nowadays.

To be fair, the part that makes me more uneasy of all is authentication, but Identity-Aware Proxy might be a good solution for this. I have not looked into that but used something similar at work.

I’m explicitly ignoring the serving-side problem: serving static files is a problem that has mostly been solved, and I think all cloud providers have some service that allows you to do that.

I’m not sure if I will be able to work more on this, rather than just providing a sketched-out idea. If anyone knows of something like this already, or feels like giving a try to building this, I’d be happy to help (employer-permitting of course). Otherwise, if I find some time to builds stuff like this, I’ll try to get it released as open-source, to build upon.

Tiny Tiny RSS: don’t support Nazi sympathisers

XKCD #1357 — Free Speech

After complaining about the lack of cache hits from feed readers, and figuring out why NewsBlur (that was doing the right thing), and then again fixing the problem, I started looking at what other readers kept being unfixed. It turned out that about a dozen people used to read my blog using Tiny Tiny RSS, a PHP-based personal feed reader for the web. I say “used to” because, as of 2017-08-17, TT-RSS is banned from accessing anything from my blog via ModSecurity rule.

The reason why I went to this extent is not merely technical, which is why you get the title of this blog the way it is. But it all started with me filing requests to support modern HTTP features for feeds, particularly regarding the semantics of permanent redirects, but also about the lack of If-Modified-Since, which allows significant reduction on the bandwidth usage of a blog1. Now, the first response I got about the permanent redirect request was disappointing but it was a technical answer, so I provided more information. After that?

After that the responses stopped being focused on the technical issues, but rather appear to be – and that’s not terribly surprising in FLOSS of course – “not my problem”. Except, the answers also came from someone with a Pepe the Frog avatar.2 And this is August of 2017, when America shown having a real Nazi problem, and willingly associating themselves to alt-right is effectively being Nazi sympathizers. The tone of the further answers also show that it is no mistake or misunderstanding.

You can read the two bugs here: and . Trigger warning: extreme right and ableist views ahead.

While I try to not spend too much time on political activism on my blog, there is a difference from debating whether universal basic income (or even universal health care) is a right nor not, and arguing for ethnic cleansing and the death of part of a population. So no, no way I’ll refrain from commenting or throwing a light on this kind of toxic behaviour from developers in the Free Software community. Particularly when they are not even holding these beliefs for themselves but effectively boasting them by using a loaded avatar on their official support forum.

So what you can do about this? If you get to read this post, and have subscribed to my blog through TT-RSS, you now know why you don’t get any updates from it. I would suggest you look for a new feed reader. I will as usual suggest NewsBlur, since its implementation is the best one out there. You can set it up by yourself, since it’s open source. Not only you will be cutting your support to Nazi sympathisers, but you also will save bandwidth for the web as a whole, by using a reader that actually implements the protocol correctly.

Update (2017-08-06): as pointed out in the comments by candrewswpi, FreshRSS is another option if you don’t want to set up NewsBlur (which admittedly may be a bit heavy). It uses PHP so it should be easier to migrate given the same or similar stack. It supports at least proper caching, but I’m not sure about the permanent redirects, it needs testing.

You could of course, as the developers said on those bugs, change the User-Agent string that TT-RSS reports, and keep using it to read my blog. But in that case, you’d be supporting Nazi sympathisers. If you don’t mind doing that, I may ask you a favour and stop reading my blog altogether. And maybe reconsider your life choices.

I’ll repeat here that the reason why I’m going to this extent is that there is a huge difference between the political opinions and debates that we can all have, and supporting Nazis. You don’t have to agree with my political point of view to read my blog, you don’t have to agree with me to talk with me or being my friend. But if you are a Nazi sympathiser, you can get lost.

  1. you could try to argue that in this day and age there is no point in worrying about bandwidth, but then you don’t get to ever complain about the existence of CDNs, or the fact that AMP and similar tools are “undemocratizing” the web.
  2. Update (2017-08-03): as many people have asked: no it’s not just any frog or any Pepe that automatically makes you a Nazi sympathisers. But the avatar was not one of the original illustrations, and the attitude of the commenter made it very clear what their “alignment” was. I mean, if they were fans of the original character, they would probably have the funeral scene as their avatar instead.

Apache, ETag and “Not Modified”

In my previous post on the matter I incorrectly blamed NewsBlur – which I still recommend as the best feed reader I’ve ever used! – for not correctly supporting HTTP features to avoid wasting bandwidth for fetching repeatedly unmodified content.

As Daniel and Samuel pointed out immediately, NewsBlur does support those features, and indeed I even used it as an example four years ago, oops for my memory being terrible that way, and me assuming the behaviour from the logs rather than inspecting the requests. And indeed the requests were not only correct, but matched perfectly what Apache reported:

GET /index.xml HTTP/1.1
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/atom+xml, application/rss+xml, application/xml;q=0.8, text/xml;q=0.6, */*;q=0.2
User-Agent: NewsBlur Feed Fetcher - 59 subscribers - (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36)
A-IM: feed
If-Modified-Since: Wed, 16 Aug 2017 04:22:52 GMT
If-None-Match: "27dc5-556d73fd7fa43-gzip"

HTTP/1.1 200 OK
Strict-Transport-Security: max-age=31536000; includeSubDomains
Last-Modified: Wed, 16 Aug 2017 04:22:52 GMT
ETag: "27dc5-556d73fd7fa43-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=1800
Expires: Wed, 16 Aug 2017 18:56:33 GMT
Content-Length: 54071
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: application/xml

So what is going on here? Well, I started looking around, both because I now felt silly, and because I owed more than just an update on the post and an apology to Samuel. And a few searches later, I found Apache bug #45023 that reports how mod_deflate prevents all 304 responses from being issued. This is a bit misleading (as you can still have them in some situations), but it is indeed what is happening here, and it is a breakage introduced by Apache 2.4.

What’s going on? Well, let’s first start to figure out why I could see some 304, but not from NewsBlur. Willreadit was one of the agents that received 304 responses at least some of the time and in the landing page it says explicitly that it supports If-Modified-Since. In particular, it does not support If-None-Match.

The If-None-Match header in the request compares with the ETag header (Entity Tag) in the response coming from Apache. This header is generally considered opaque, and the client should have no insights in what it is meant to do. The server generally calculates its value based on either a checksum of the file (e.g. md5) or based on file size and last-modified time. On Apache HTTP Server, the FileETag directive is used to define which properties of the served files are used to generate the value provided in the response. The default that I’m using is MTime Size, which effectively means that changing the file in any way causes the ETag to change. The size part might actually be redundant here, since modification time is usually enough for my use cases, but this is the default…

The reason why I’m providing both Last-Modifed and ETag headers in the response is that HTTP client can just as well only implement one of the two methods, rather than both, particularly as they may think that handling ETag is easier as it’s an opaque string, rather than information that can be parsed — but it really should be considered opaquely as well as it’s noted in RFC2616. Entity Tags are also more complicated because they can be used to collapse caching of different entities (identifed by an URL) within the same space (hostname) by caching proxies. I have lots of doubts that this usage is in use, so I’m not going to consider it a valid one, but your mileage may vary. In particular, since the default uses size and modification time, it ends up always matching the Last-Modified header, for a given entity, and the If-Modified-Since request would be just enough.

But when you provide both If-Modified-Since and If-None-Match, you’re asking for both conditions to be true, and so Apache will validate both. And here is where the problem happens: the -gzip suffix – which you can see in the header of the sample request above – is added at different times in the HTTPD process, and in particular it makes it so that the If-None-Match will never match the generated ETag, because the comparison is with the version without -gzip appended. This makes sense in context, because if you have a shared caching proxy, you may have different user agents that support different compression algorithms. Unfortunately, this effectively makes it so that entity tags disable Not Modified states for all the clients that do care about the tags. Those few clients that received 304 responses from my blog before were just implementing If-Modified-Since, and were getting the right behaviour (which is why I thought the title of the bug was misleading).

So how do you solve this? In the bug I already noted above, there is a suggestion by Joost Dekeijzer to use the following directive in your Apache config:

RequestHeader edit "If-None-Match" '^"((.*)-gzip)"$' '"$1", "$2"'

This adds a version of the entity tag without the suffix to the list of expected entity tags, which “fools” the server into accepting that the underlying file didn’t change and that there is no need to make any change there. I tested with that and it does indeed fix NewsBlur and a number of other use cases, including browsers! But it has the side effect of possibly poisoning shared caches. Shared caches are not that common, but why risking it? So I decided onto a slightly different option

FileETag None

This disable the generation of Entity Tags for file-based entities (i.e. static files), forcing browsers and feed readers to rely on If-Modified-Since exclusively. If clients only implement If-None-Match semantics, then this second option loses the ability to receive 304 responses. I have actually no idea which clients would do that, since this is the more complicated semantics, but I guess I’ll find out. I decided to give a try to this option for two reasons: it should simplify Apache’s own runtime, because it does not have to calculate these tags at any point now, and because effectively they were encoding only the modification time, which is literally what Last-Modified provides! I had for a while assumed that the tag was calculated based on a (quick and dirty) checksum, instead of just size and modification time, but clearly I was wrong.

There is another problem at this point, though. For this to work correctly, you need to make sure that the modification time of files is consistent with them actually changing. If you’re using a static site generator that produces multiple outputs for a single invocation, which includes both Hugo and FSWS, you would have a problem, because the modification time of every file is now the execution time of the tool (or just about).

The answer to this is to build the output in a “staging” directory and just replace the files that are modified, and rsync sounds perfect for the job. But the more obvious way to do so (rsync -a) will do exactly the opposite of what you want, as it will preserve the timestamp from the source directory — which mean it’ll replace the old timestamp with the new one for all files. Instead, what you want to use is rsync -rc: this uses a checksum to figure out which files have changed, and will not preserve the timestamp but rather use the timestamp of rsync, which is still okay — theoretically, I think rsync -ac should work, since it should only preserve the timestamp only of the files that were modified, but since the serving files are still all meant to have the same permissions, and none be links, I found being minimal made sense.

So anyway, I’ll hopefully have some more data soon about the bandwidth saving. I’m also following up with whatever may not be supporting properly If-Modified-Since, and filing bugs for those software/services that allow it.

Update (2017-08-23): since now it’s a few days since I fixed up the Apache configuration, I can confirm that the daily bandwidth used by “viewed hits” (as counted by Awstats) went down to ⅓ of what it used to be, to around 60MB a day. This should be accounting not only for the feed readers now properly getting a 304, but also for browsers of readers who no longer have to fetch the full page when, for instance, replying to comments. Googlebot also is getting a lot more 304, which may actually have an impact on its ability to keep up with the content, so I guess I will report back.

Modern feed readers

Four years ago, I wrote a musing about the negative effects of the Google Reader shutdown on content publishers. Today I can definitely confirm that some of the problems I foretold materialized. Indeed, thanks to the fortuitous fact that people have started posting my blog articles to reddit and hacker news (neither of which I’m fond of, but let’s leave that aside), I can declare that the vast majority of the bandwidth used by my blog is consumed by bots and in particular by feed readers. But let’s start from the start.

This blog is syndicated over a feed, the URL and format of which changed a number of times before, mostly with the software, or with the update of the software. The most recent change was due to switching from Typo to Hugo, and the feed name changing. I could have kept the original feed name, but it made little sense at the time, so instead I set up permanent redirects from the old URLs to the new URLs, as I always do. I say I always do because I keep working even the original URLs from when I ran the blog off my home DSL.

Some services and feed reading software know how to deal with permanent redirects correctly, and will (eventually) replace the old feed URL with the new one. For instance NewsBlur will replace URLs after ten fetches replied with a permanent redirect (which is sensible, to avoid accepting a redirection that was set up by mistake and soon rolled back, and to avoid data poisoning attacks). Unfortunately, it seems like this behaviour is extremely rare, and so on August 14th I received over three thousands requests for the old Typo feed URL (admittedly, that was the most persistent URL I used). In addition to that, I also received over 300 requests for the very old Typo /xml/ feeds, of which 122 still pointing at my old dynamic domain, which is now pointing at the normal domain for my blog. This has been the case now for almost ten years, and yet some people still have subscription to those URLs! At least one Liferea and one Akregator pointing at those URLs.

But while NewsBlur implements sane semantics for handling permanent redirects, it is far from a perfect implementation. In particular even though I have brought this up many times, Newsblur is not actually sending If-Modified-Since or If-None-Match headers, which means it will take a copy of the feed at every request. Even though it does support compressed responses (non fetch of the feed is allowed without compressed responses), NewsBlur is requesting the same URL more than twice an hour, because it seems to have two “sites” described by the same URL. At 50KiB per request, that makes up about 1% of the total bandwidth usage of the blog. To be fair, this is not bad at all, but one has to wonder why they can’t be saving the last modified or etag values — I guess I could install my own instance of NewsBlur and figure out how to do that myself, but who knows when I would find the time for that.

Update (2017-08-16): Turns out that, as Samuel pointed out in the comments and on Twitter, I wrote something untrue. NewsBlur does send the headers, and supports this correctly. The problem is an Apache bug that causes 304 never to be issued when using If-None-Match and mod_deflate.

To be fair, even rawdog, which I use for Planet Multimedia, does not appear to support these properly. Oh and speaking of Planet Multimedia, would someone be interested in providing a more modern template so that Monty’s pictures don’t take over the page, that would be awesome!

There actually are a few other readers that do support these values correctly, and indeed receive 304 (Not Modified) status code most of the time. These include Lighting (somebody appears to be still using it!) and at least yet-another-reader-service — this latter appears to be in beta and being invite only; it’s probably the best HTTP implementation I’ve seen for a service with such a rough website. Indeed the bot landing page points out how it supports If-Modified-Since and gzip-compressed responses. Alas it does not appear to learn from persistent redirects though, so it’s currently fetching my blog’s feed twice, probably because there are at least two subscribers for it.

Also note that supporting If-Modified-Since is a prerequisite for supporting delta feeds which is an interesting way to save even more bandwidth (although I don’t think this is feasible to do with a static website at all).

At the very least it looks like we won the battle for supporting compressed responses. The only 406 (Not Acceptable) responses for the feed URL are for Fever, which is no longer developed or supported. Even Gwene, which I pointed out was hammering my blog last time I wrote about this, is now content to get the compressed version. Unfortunately it does not appear like my pull request was ever merged, which means it’s likely the repository itself is completely out of sync with what is being run.

So in 2017, what is the current state of the art feed reader support? NewsBlur has recently added support for JSON Feed which is not particularly exciting – when I read the post I was reminded, by the screenshot of choice there, where I heard of Brent Simmons before: Vesper, which is an interesting connection, but I should not go into that now – but at least shows that Samuel Clay is actually paying attention to the development of the format — even though that development right now appears to just avoiding XML. Which to be honest is not that bad of an idea: since HTML (even HTML5) does not have to be well-formatted XML, you need to provide it as cdata in an XML feed. And the way you do that makes it very easy to implement it incorrectly.

Also, as I wrote this post I realized what else I would like from NewsBlur: the ability to subscribe to an OPML feed as a folder. I still subscribe to lots of Planets, even though they seem to have lost their charm, but a few people are aggregated in multiple planets and it would make sense to be able to avoid duplicate posts. If I could tell NewsBlur «I want to subscribe to this Planet, aggregate it into a folder», it would be able to tell the duplicated feeds, and mark the posts as read on all of them at the same time. Note that what I’d like is something different from just importing an OPML description of the planet! I would like for the folder to be kept in sync with the OPML feed, so that if new feeds are added, they also get added to the folder, and same for removed feeds. I should probably file that on GetSatisfaction at some point.

More HTTP misbehaviours

Today I have been having some fun: while looking at the backlog on IRCCloud, I found out that it auto-linked which I prompty decided to register it with Gandi — unfortunately I couldn’t get or as they are both already registered. After that I decided to set up Google Analytics to report how many referrer arrive to my websites through some of the many vanity domains I registered over time.

After doing that, I spent some time staring at the web server logs to make sure that everything was okay, and I found out some more interesting things: it looks like a lot of people have been fetching my blog Atom feed through very bad feed readers. This is the reification of my forecast last year when Google Reader got shut down.

Some of the fetchers are open source, so I ended up opening issues for them, but that is not the case for all of them. And even when they are open source, sometimes they don’t even accept pull requests implementing the feature, for whichever reason.

So this post is a bit of a name-and-shame, which can be positive for open-source projects when they can fix things, or negative for closed source services that are trying to replace Google Reader and failing to implement HTTP properly. It will also serve as a warning for my readers from those services, as they’ll stop being able to fetch my feed pretty soon, as I’ll update my ModSecurity rules to stop these people from fetching my blog.

As I noted above, both Stringer and Feedbin fail to properly use compressed responses (gzip compression), which means that they fetch over 90KiB every turn instead of just 25KiB. The Stringer devs already reacted and seem to be looking into fixing this very soon now. Feedbin I have no answer from yet (but it’s pretty soon anyway), but it worries me for another reason too: it does not do any caching at all. And somebody set up a Feedbin instance in the Prague University that fetches my feed, without compression, without caching, every two minutes. I’m going to soon blacklist it.

Gwene still has not replied to the pull request I sent in October 2012, but on the bright side, it has not fetched my blog since a long time ago. Feedzirra (now Feedjira) used by IFTTT still does not enable compressed responses by default, even though it seems to support the option (Stringer is also based on it, it seems).

It’s not just plain feed readers that fail at implementing HTTP. Distributed social network Friendica – that aims at doing a better job than Diaspora at that – seem also to forget about implementing either compressed responses or caching. At least it seems to only fetch my feed every twelve hours. On the other hand, it seems to also get someone’s timeline from Twitter, so when it encounters a link to my blog it first send a HEAD request, and then fetches the page. Three times. Also uncompressed.

On the side of non-open-source services, FeedWrangler has probably one of the worst implementations of HTTP I’ve ever seen: it does not support compressed responses (90KiB feed), does not do caching (every time!), and while it would fetch at one hour intervals, it does not understand that a 301 is a permanent redirection, and there’s no point in keeping around two feed IDs for /articles.rss and /articles.atom (each with one subscriber). That’s 4MiB a day, which is around 2% of the bandwidth my website serves, over a day. While this is not an important amount, and I don’t have limitation on the server’s egress, it seems silly that 2% of my bandwidth is consumed on two subscribers, when the site has over a thousand visitors a day.

But what takes the biscuit is definitely FeedMyInbox: while it fetches only every six hours, it implements neither caching nor compression. And I found it only when looking into the requests coming from bots without a User-Agent header. The requests come from which is I’m soon also going to blacklist this until they stop being douches and provide a valid user agent string.

They are by far not the only ones though; there is another bot that fetches my feed every three hours that will soon follow the same destiny. But this does not have an obvious service attached to it, so if whatever you’re using to read my blog tells you it can’t fetch my blog anymore, try to figure out if you’re using a douchereader.

Please remember that software on the net should be implemented for collaboration between client and server, not for exploitation. Everybody’s bandwidth gets worse when you heavily use a service that is not doing its job at optimizing bandwidth usage.

Planets, feeds and blogs

You have probably noticed that last month I replaced Harvester with rawdog for Planet Multimedia. The reasons was easy to explain: Harvester requires libraries that only work with Ruby 1.8 — and while on one hand moving to Ruby 1.9 or 2 would mean being able to use the feedfetcher (the same one used by IFTTT), my attempts at updating the code to work with a more modern version of Ruby have been all failures.

Since I did not intend to be swamped with one more custom tool to maintain I turned to another standard tool to implement the aggregator, rawdog — holding my nose on the use of darcs for source control, and the name (please don’t google for it without safe search on, at work). The nice part about using this tool is that it’s packaged in Gentoo already, so it’s handled straight by portage with binary packages. Unfortunately, the default templates are terrible, and the settings non-obvious, but Luca was able to make the best out of it.

But more and more problems got obvious with time. The first is that the tool is does not respect the return codes at exit — it always returns zero (success) even if the processing was incomplete; it took me two weeks to figure out that the script failed when running in cron because the environment lacked the locale settings, as the cron logs said that everything was alright, and since I use fcron, it also did not send me any email, as I set it to mail me only for errors.

A couple of days ago, I got complains again that the Planet was not updating; again, no error in the cron logs, no error in my email. I ran the command manually, and I was told by it that Luca’s feed, on, was unreachable. Okay, sure. But then it did not solve itself when it came back up. Today I looked back into it and J-B’s and Rémi’s blogs feed were unreachable. Once again, no non-zero exit status, thus no mail, no error in the logs. This is not the way it should behave.

But that’s not enough. the other problem with rawdog is that it does not, by default, support generating a feed for the aggregation, like Harvester (and Planet/Venus) does. I found that Jonathan Riddell actually built a plugin for Planet KDE to generate the feed, but I haven’t tested it yet because I have not found the authoritative source of it, but just multiple copies of it in different websites. It also produces RSS feeds, rather than Atom feeds. And I’m sorry to say but Atom is much preferred, for me.

So where does it leave us? I’m not going to try fixing rawdog I’m afraid. Mostly because I don’t intend spending time with darcs. My options are either go back to Harvester and fix it to not use DBI and support Ruby 1.9, or try to adapt parts of NewsBlur – that already deal with aggregating feeds and producing new feeds – to make up an alternative to rawdog. If I am to do something like that, though, I’m most likely going to take my dear time and make it a web-configurable tool, rather than something that needs to be configured on the command line or with configuration files.

The reason for that is, very simply, that I’m growing fond of doing most of my work on a browser when I can, and this looks like a perfect solution to the problem. Even more so if you can give access to someone else to look into it — and if you can avoid storing passwords.

So, any takers to help me with this project?

The Google Reader Exodus and its effect on content publishers

I’m a content publisher, whether I like it or not. This blog is relatively well followed, and I write quite a lot in it. While my hosting provider does not give me grief for my bandwidth usage, optimizing it is something I’m always keen on, especially since I have been Slashdotted once before. This is one of the reasons why my ModSecurity Ruleset validates and filters crawlers as much as spammers.

Blogs’ feeds, be them RSS or Atom (this blog only supports the latter) are a very neat way to optimize bandwidth: they get you the content of the articles without styles, scripts or images. But they can also be quite big. The average feed for my blog’s articles is 100KiB which is a fairly big page, if you consider that feed readers are supposed to keep pinging the blog to check for new items. Luckily for everybody, the authors of HTTP did consider this problem, and solved it with two main features: conditional requests and compressed responses.

Okay there’s a sense of déjà-vu in all of this, because I already complained about software not using the features even when it’s designed to monitor web pages constantly.

By using conditional requests, even if you poke my blog every fifteen minutes, you won’t use more than 10KiB an hour, if no new article has been posted. By using compressed responses, instead of a 100KiB response you’ll just have to download 33KiB. With Google Reader, things were even better: instead of 113 requests for the feed, a single request was made by the FeedFetcher, and that was it.

But now Google Reader is no more (almost). What happens now? Well, of the 113 subscribers, a few will most likely not re-subscribe to my blog at all. Others have migrated to NewsBlur (35 subscribers), the rest seem to have installed their own feed reader or aggregator, including tt-rss, owncloud, and so on. This was obvious looking at the statistics from either AWStats or Munin, both showing a higher volume of requests and delivered content compared to last month.

I’ve then decided to look into improving the bandwidth a bit more than before, among other things, by providing WebP alternative for images, but that does not really work as intended — I have enough material for a rant post or two so I won’t discuss it now. But while doing so I found out something else.

One of the changes I made while hoping to use WebP is to serve the image files from a different domain ( which meant that the access log for the blog, while still not perfect, is decidedly cleaner than before. From there I noticed that a new feed reader started requesting my blog’s feed every half an hour. Without compression. In full every time. That’s just shy of 5MiB of traffic per day, but that’s not the worst part. The worst part is that said 5MiB are for a single reader as the requests come from a commercial, proprietary feed reader webapp.

And this is not the only one! Gwene also does the same, even though I sent a pull request to get it to use compressed responses, which hasn’t had a single reply. Even Yandex’s new product has the same issue.

While 5MiB/day is not too much taken singularly, my blog’s traffic averages on 50-60 MiB/day so it’s basically a 10% traffic for less than 1% of users, just because they do not follow the best practices when writing web software. I’ve now added these crawlers to the list of stealth robots, this means that they will receive a “406 Unacceptable” unless they finally implement at least the compressed responses support (which is the easy part in all this).

This has an unfortunate implication on users of those services that were reading me, who won’t get any new updates. If I was a commercial entity, I couldn’t afford this at all. The big problem, to me, is that with Google Reader going away, I expect more and more of this kind of issues to crop up repeatedly. Even NewsBlur, which is now my feed reader of choice fixed their crawlers yet, which I commented upon before — the code is open-source but I don’t want to deal with Python just yet.

Seriously, why are there so many people who expect to be able to deal with web software and yet have no idea how the web works at all? And I wonder if somebody expected this kind of fallout from the simple shut down of a relatively minor service like Google Reader.

Why you should care about your HTTP implementation

So today’s frenzy is all about Google’s dismissal of the Reader service. While I’m also upset about that, I’m afraid I cannot really get into discussing that at this point. On the other hand, I can talk once again of my ModSecurity ruleset and in particular of the rules that validate HTTP robots all over the Internet.

One of the Google Reader alternatives that are being talked about is NewsBlur — which actually looks cool at first sight, but I (and most other people) don’t seem to be able to try it out yet because their service – I’m not going to call them servers as it seems they at least partially use AWS for hosting – fails to scale.

While I’m pretty sure that it’s an exceptional amount of load they are receiving now as everybody and their droid are trying to register to the service and import their whole Google Reader subscription list, which then needs to be fetched and added to the database, – subscriptions to my blog’s feed went from 5 to 23 in the matter of hours! – there are a few things that I can infer from the way it behaves that makes me think that somebody overlooked the need for a strong HTTP implementation.

First of all what happened was that I got a report on twitter that NewsBlur was getting a 403 fetching my blog, and that was obviously caused by my rules’ validation of the request. Looking at my logs, I found out that NewsBlur sends requests with three different User-Agents, which show a likeliness that they are implemented by three different codepaths altogether:

User-Agent: NewsBlur Feed Fetcher - 5 subscribers - (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
User-Agent: NewsBlur Page Fetcher (5 subscribers) - (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
User-Agent: NewsBlur Favicon Fetcher -

The third is the most conspicuous string, because it’s very minimal and does not follow your average string format, using the dash as separator instead of adding the URL in parenthesis next to the fetcher name (and version, more on that later).

The other two strings show that they have been taken by the string reported by Safari on OSX — but interestingly enough from two different Safari version, and one of the two has been actually stripped as well. This is really silly. While I can understand that they might want to look like Safari when fetching a page to display – mostly because there are bad hacks like PageSpeed that serve different HTML to different browsers, messing up caching – I doubt that is warranted for feeds; and even getting the Safari HTML might be a bad idea if then it’s displayed by the user with a different browser.

The code that fetches feeds and pages is likely quite different as it can be seen by the full request. From the feed fetcher:

GET /articles.atom HTTP/1.1
A-Im: feed
Accept-Encoding: gzip, deflate
Connection: close
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: NewsBlur Feed Fetcher - 5 subscribers - (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
If-Modified-Since: Tue, 01 Nov 2011 23:36:35 GMT
If-None-Match: "a00c0-18de5-4d10f58aa91b5"

This is a very sophisticated fetching code, as it not only properly supports compressed responses (Accept-Encoding header) but it also uses the If-None-Match and If-Modified-Since headers to not re-fetch an unmodified content. The fact that it’s pointing to November 1st of two years ago is likely due to the fact that since then my ModSecurity ruleset refused to speak with this fetcher, because of the fake User-Agent string. It also includes a proper Accept header that lists the feed types they prefer over the generic XML and other formats.

The A-Im header is not a fake or a bug; it’s actually part of RFC3229 Delta encoding in HTTP and stands for Accept-Instance-Manipulation. I’ve never seen that before, but a quick search turned it out, even though the standardized spelling would be A-IM. Unfortunately, the aforementioned RFC does not define the “feed” manipulator, even though it seems to be used in the wild, and I couldn’t find a proper formal documentation of how it should work. The theory from what I can tell is that the blog engine would be able to use the If-Modified-Since header to produce on the spot a custom feed for the fetcher, that only includes entries that has been modified since that date. Cool idea, too bad it lacks a standard as I said.

The request coming in from the page fetcher is drastically different:

GET / HTTP/1.1
Connection: close
Content-Length: 0
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: NewsBlur Page Fetcher (5 subscribers) - (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)

So we can tell two things from the comparison: this code is older (there is an earlier version of Safari being used), and not the same care has been spent as it has been on the feed fetcher (which dropped the Safari identifier itself, at least). It’s more than likely that if libraries are used to send the request, a completely different library is used here, as this request declares support for the compress encoding, not supported by the feed fetcher (and as far as I can tell, never ever used). It also is much less choosy on the formats to receive, as it accepts whatever you want to give it.

*For the Italian readers: yes I intentionally picked the word choosy. While I can find Fornero an idiot as much as the next guy, I grew tired of copy-paste statuses on Facebook and comments that she should have said picky. Know your English, instead of complaining on idiocies.*

The lack of If-Modified-Since here does not really mean much, because it’s also possible that they were never able to fetch the page, as they might have introduced the feature later (even though the code is likely older). But the Content-Length header sticks out like a sore thumb, and I would expect to have been put there by whatever HTTP access library they’re using.

The favicon fetcher is the one that is the most naïve and possibly the code that needs to be cleaned up the most:

GET /favicon.ico HTTP/1.1
Accept-Encoding: identity
Connection: close
User-Agent: NewsBlur Favicon Fetcher -

Here we start with nigh protocol violations, by not providing an Accept header — especially facepalming considering that this is where a static list of mime types would be the most useful, to restrict the image formats that will be handled properly! But what happens with my rules is that the Accept-Encoding there is not suitable for a bot at all! Since it does not support any compressed response, the code will now respond with a 406 Not Acceptable status code, instead of providing the icon.

I can understand that a compressed icon is more than likely to not be useful — indeed most images should not be compressed at all to be sent over HTTP, but why should you explicitly refuse it? Especially since the other two fetches properly support a sophisticated HTTP?

All in all, it seems like some of the code in NewsBlur has been bolted on after the fact, and with different levels of care. It might not be the best of time for them now to look at the HTTP implementation, but I would still suggest for it. A single pipelined request of the three components they need – instead of using Connection: close – could easily reduce the number of connections to blogs, and that would be very welcome to all the bloggers out there. And using the same HTTP code would make it easier for people like me to handle NewsBlur properly.

I would also like to have a way to validate that a given request comes from NewsBlur — like we do with GoogleBot and other crawlers. Unfortunately this is not really possible, because they use multiple servers, both on standard hostings and AWS, both on IPv4 and (possibly, one time) IPv6, so using FcRDNS is not an option.

Oh well, let’s see how this thing pans out.