“Planets” in the World of Cloud

As I have written recently, I’m trying to reduce the amount of servers I directly manage, as it’s getting annoying and, honestly, out of touch with what my peers are doing right now. I already hired another company to run the blog for me, although I do keep access to all its information at hand and can migrate where needed. I also give it a try to use Firebase Hosting for my tiny photography page, to see if it would be feasible to replace my homepage with that.

But one of the things that I still definitely need a server for is keep running Planet Multimedia, despite its tiny userbase and dwindling content (if you work in FLOSS multimedia, and you want to be added to the Planet, drop me an email!)

Right now, the Planet is maintained through rawdog, which is a Python script that works locally with no database. This is great to run on a vserver, but in a word where most of the investments and improvements go on Cloud services, that’s not really viable as an option. And to be honest, the fact that this is still using Python 2 worries me no little, particularly when the author insists that Python 3 is a different language (it isn’t).

So, I’m now in the market to replace the Planet Multimedia backend with something that is “Cloud native” — that is, designed to be run on some cloud, and possibly lightweight. I don’t really want to start dealing with Kubernetes, running my own PostgreSQL instances, or setting up Apache. I really would like something that looks more like the redirector I blogged about before, or like the stuff I deal with for a living at work. Because it is 2019.

So sketching this “on paper” very roughly, I expect such a software to be along the lines of a single binary with a configuration file, that outputs static files that are served by the web server. Kind of like rawdog, but long-running. Changing the configuration would require restarting the binary, but that’s acceptable. No database access is really needed, as caching can be maintained to process level — although that would men that permanent redirects couldn’t be rewritten in the configuration. So maybe some configuration database would help, but it seems most clouds support some simple unstructured data storage that would solve that particular problem.

From experience with work, I would expect the long running binary to be itself a webapp, so that you can either inspect (read-only) what’s going on, or make changes to the database configuration with it. And it should probably have independent parallel execution of fetchers for the various feeds, that then store the received content into a shared (in-memory only) structure, that is used by the generation routine to produce the output files. It may sounds like over-engineering the problem, but that’s a bit of a given for me, nowadays.

To be fair, the part that makes me more uneasy of all is authentication, but Identity-Aware Proxy might be a good solution for this. I have not looked into that but used something similar at work.

I’m explicitly ignoring the serving-side problem: serving static files is a problem that has mostly been solved, and I think all cloud providers have some service that allows you to do that.

I’m not sure if I will be able to work more on this, rather than just providing a sketched-out idea. If anyone knows of something like this already, or feels like giving a try to building this, I’d be happy to help (employer-permitting of course). Otherwise, if I find some time to builds stuff like this, I’ll try to get it released as open-source, to build upon.

Tiny Tiny RSS: don’t support Nazi sympathisers


XKCD #1357 — Free Speech

After complaining about the lack of cache hits from feed readers, and figuring out why NewsBlur (that was doing the right thing), and then again fixing the problem, I started looking at what other readers kept being unfixed. It turned out that about a dozen people used to read my blog using Tiny Tiny RSS, a PHP-based personal feed reader for the web. I say “used to” because, as of 2017-08-17, TT-RSS is banned from accessing anything from my blog via ModSecurity rule.

The reason why I went to this extent is not merely technical, which is why you get the title of this blog the way it is. But it all started with me filing requests to support modern HTTP features for feeds, particularly regarding the semantics of permanent redirects, but also about the lack of If-Modified-Since, which allows significant reduction on the bandwidth usage of a blog1. Now, the first response I got about the permanent redirect request was disappointing but it was a technical answer, so I provided more information. After that?

After that the responses stopped being focused on the technical issues, but rather appear to be – and that’s not terribly surprising in FLOSS of course – “not my problem”. Except, the answers also came from someone with a Pepe the Frog avatar.2 And this is August of 2017, when America shown having a real Nazi problem, and willingly associating themselves to alt-right is effectively being Nazi sympathizers. The tone of the further answers also show that it is no mistake or misunderstanding.

You can read the two bugs here: and . Trigger warning: extreme right and ableist views ahead.

While I try to not spend too much time on political activism on my blog, there is a difference from debating whether universal basic income (or even universal health care) is a right nor not, and arguing for ethnic cleansing and the death of part of a population. So no, no way I’ll refrain from commenting or throwing a light on this kind of toxic behaviour from developers in the Free Software community. Particularly when they are not even holding these beliefs for themselves but effectively boasting them by using a loaded avatar on their official support forum.

So what you can do about this? If you get to read this post, and have subscribed to my blog through TT-RSS, you now know why you don’t get any updates from it. I would suggest you look for a new feed reader. I will as usual suggest NewsBlur, since its implementation is the best one out there. You can set it up by yourself, since it’s open source. Not only you will be cutting your support to Nazi sympathisers, but you also will save bandwidth for the web as a whole, by using a reader that actually implements the protocol correctly.

Update (2017-08-06): as pointed out in the comments by candrewswpi, FreshRSS is another option if you don’t want to set up NewsBlur (which admittedly may be a bit heavy). It uses PHP so it should be easier to migrate given the same or similar stack. It supports at least proper caching, but I’m not sure about the permanent redirects, it needs testing.

You could of course, as the developers said on those bugs, change the User-Agent string that TT-RSS reports, and keep using it to read my blog. But in that case, you’d be supporting Nazi sympathisers. If you don’t mind doing that, I may ask you a favour and stop reading my blog altogether. And maybe reconsider your life choices.

I’ll repeat here that the reason why I’m going to this extent is that there is a huge difference between the political opinions and debates that we can all have, and supporting Nazis. You don’t have to agree with my political point of view to read my blog, you don’t have to agree with me to talk with me or being my friend. But if you are a Nazi sympathiser, you can get lost.


  1. you could try to argue that in this day and age there is no point in worrying about bandwidth, but then you don’t get to ever complain about the existence of CDNs, or the fact that AMP and similar tools are “undemocratizing” the web.
    [return]
  2. Update (2017-08-03): as many people have asked: no it’s not just any frog or any Pepe that automatically makes you a Nazi sympathisers. But the avatar was not one of the original illustrations, and the attitude of the commenter made it very clear what their “alignment” was. I mean, if they were fans of the original character, they would probably have the funeral scene as their avatar instead.
    [return]

Apache, ETag and “Not Modified”

In my previous post on the matter I incorrectly blamed NewsBlur – which I still recommend as the best feed reader I’ve ever used! – for not correctly supporting HTTP features to avoid wasting bandwidth for fetching repeatedly unmodified content.

As Daniel and Samuel pointed out immediately, NewsBlur does support those features, and indeed I even used it as an example four years ago, oops for my memory being terrible that way, and me assuming the behaviour from the logs rather than inspecting the requests. And indeed the requests were not only correct, but matched perfectly what Apache reported:

--6190ee48-B--
GET /index.xml HTTP/1.1
Host: blog.flameeyes.eu
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/atom+xml, application/rss+xml, application/xml;q=0.8, text/xml;q=0.6, */*;q=0.2
User-Agent: NewsBlur Feed Fetcher - 59 subscribers - http://www.newsblur.com/site/195958/flameeyess-weblog (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36)
A-IM: feed
If-Modified-Since: Wed, 16 Aug 2017 04:22:52 GMT
If-None-Match: "27dc5-556d73fd7fa43-gzip"

--6190ee48-F--
HTTP/1.1 200 OK
Strict-Transport-Security: max-age=31536000; includeSubDomains
Last-Modified: Wed, 16 Aug 2017 04:22:52 GMT
ETag: "27dc5-556d73fd7fa43-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=1800
Expires: Wed, 16 Aug 2017 18:56:33 GMT
Content-Length: 54071
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: application/xml

So what is going on here? Well, I started looking around, both because I now felt silly, and because I owed more than just an update on the post and an apology to Samuel. And a few searches later, I found Apache bug #45023 that reports how mod_deflate prevents all 304 responses from being issued. This is a bit misleading (as you can still have them in some situations), but it is indeed what is happening here, and it is a breakage introduced by Apache 2.4.

What’s going on? Well, let’s first start to figure out why I could see some 304, but not from NewsBlur. Willreadit was one of the agents that received 304 responses at least some of the time and in the landing page it says explicitly that it supports If-Modified-Since. In particular, it does not support If-None-Match.

The If-None-Match header in the request compares with the ETag header (Entity Tag) in the response coming from Apache. This header is generally considered opaque, and the client should have no insights in what it is meant to do. The server generally calculates its value based on either a checksum of the file (e.g. md5) or based on file size and last-modified time. On Apache HTTP Server, the FileETag directive is used to define which properties of the served files are used to generate the value provided in the response. The default that I’m using is MTime Size, which effectively means that changing the file in any way causes the ETag to change. The size part might actually be redundant here, since modification time is usually enough for my use cases, but this is the default…

The reason why I’m providing both Last-Modifed and ETag headers in the response is that HTTP client can just as well only implement one of the two methods, rather than both, particularly as they may think that handling ETag is easier as it’s an opaque string, rather than information that can be parsed — but it really should be considered opaquely as well as it’s noted in RFC2616. Entity Tags are also more complicated because they can be used to collapse caching of different entities (identifed by an URL) within the same space (hostname) by caching proxies. I have lots of doubts that this usage is in use, so I’m not going to consider it a valid one, but your mileage may vary. In particular, since the default uses size and modification time, it ends up always matching the Last-Modified header, for a given entity, and the If-Modified-Since request would be just enough.

But when you provide both If-Modified-Since and If-None-Match, you’re asking for both conditions to be true, and so Apache will validate both. And here is where the problem happens: the -gzip suffix – which you can see in the header of the sample request above – is added at different times in the HTTPD process, and in particular it makes it so that the If-None-Match will never match the generated ETag, because the comparison is with the version without -gzip appended. This makes sense in context, because if you have a shared caching proxy, you may have different user agents that support different compression algorithms. Unfortunately, this effectively makes it so that entity tags disable Not Modified states for all the clients that do care about the tags. Those few clients that received 304 responses from my blog before were just implementing If-Modified-Since, and were getting the right behaviour (which is why I thought the title of the bug was misleading).

So how do you solve this? In the bug I already noted above, there is a suggestion by Joost Dekeijzer to use the following directive in your Apache config:

RequestHeader edit "If-None-Match" '^"((.*)-gzip)"$' '"$1", "$2"'

This adds a version of the entity tag without the suffix to the list of expected entity tags, which “fools” the server into accepting that the underlying file didn’t change and that there is no need to make any change there. I tested with that and it does indeed fix NewsBlur and a number of other use cases, including browsers! But it has the side effect of possibly poisoning shared caches. Shared caches are not that common, but why risking it? So I decided onto a slightly different option

FileETag None

This disable the generation of Entity Tags for file-based entities (i.e. static files), forcing browsers and feed readers to rely on If-Modified-Since exclusively. If clients only implement If-None-Match semantics, then this second option loses the ability to receive 304 responses. I have actually no idea which clients would do that, since this is the more complicated semantics, but I guess I’ll find out. I decided to give a try to this option for two reasons: it should simplify Apache’s own runtime, because it does not have to calculate these tags at any point now, and because effectively they were encoding only the modification time, which is literally what Last-Modified provides! I had for a while assumed that the tag was calculated based on a (quick and dirty) checksum, instead of just size and modification time, but clearly I was wrong.

There is another problem at this point, though. For this to work correctly, you need to make sure that the modification time of files is consistent with them actually changing. If you’re using a static site generator that produces multiple outputs for a single invocation, which includes both Hugo and FSWS, you would have a problem, because the modification time of every file is now the execution time of the tool (or just about).

The answer to this is to build the output in a “staging” directory and just replace the files that are modified, and rsync sounds perfect for the job. But the more obvious way to do so (rsync -a) will do exactly the opposite of what you want, as it will preserve the timestamp from the source directory — which mean it’ll replace the old timestamp with the new one for all files. Instead, what you want to use is rsync -rc: this uses a checksum to figure out which files have changed, and will not preserve the timestamp but rather use the timestamp of rsync, which is still okay — theoretically, I think rsync -ac should work, since it should only preserve the timestamp only of the files that were modified, but since the serving files are still all meant to have the same permissions, and none be links, I found being minimal made sense.

So anyway, I’ll hopefully have some more data soon about the bandwidth saving. I’m also following up with whatever may not be supporting properly If-Modified-Since, and filing bugs for those software/services that allow it.

Update (2017-08-23): since now it’s a few days since I fixed up the Apache configuration, I can confirm that the daily bandwidth used by “viewed hits” (as counted by Awstats) went down to ⅓ of what it used to be, to around 60MB a day. This should be accounting not only for the feed readers now properly getting a 304, but also for browsers of readers who no longer have to fetch the full page when, for instance, replying to comments. Googlebot also is getting a lot more 304, which may actually have an impact on its ability to keep up with the content, so I guess I will report back.

Modern feed readers

Four years ago, I wrote a musing about the negative effects of the Google Reader shutdown on content publishers. Today I can definitely confirm that some of the problems I foretold materialized. Indeed, thanks to the fortuitous fact that people have started posting my blog articles to reddit and hacker news (neither of which I’m fond of, but let’s leave that aside), I can declare that the vast majority of the bandwidth used by my blog is consumed by bots and in particular by feed readers. But let’s start from the start.

This blog is syndicated over a feed, the URL and format of which changed a number of times before, mostly with the software, or with the update of the software. The most recent change was due to switching from Typo to Hugo, and the feed name changing. I could have kept the original feed name, but it made little sense at the time, so instead I set up permanent redirects from the old URLs to the new URLs, as I always do. I say I always do because I keep working even the original URLs from when I ran the blog off my home DSL.

Some services and feed reading software know how to deal with permanent redirects correctly, and will (eventually) replace the old feed URL with the new one. For instance NewsBlur will replace URLs after ten fetches replied with a permanent redirect (which is sensible, to avoid accepting a redirection that was set up by mistake and soon rolled back, and to avoid data poisoning attacks). Unfortunately, it seems like this behaviour is extremely rare, and so on August 14th I received over three thousands requests for the old Typo feed URL (admittedly, that was the most persistent URL I used). In addition to that, I also received over 300 requests for the very old Typo /xml/ feeds, of which 122 still pointing at my old dynamic domain, which is now pointing at the normal domain for my blog. This has been the case now for almost ten years, and yet some people still have subscription to those URLs! At least one Liferea and one Akregator pointing at those URLs.

But while NewsBlur implements sane semantics for handling permanent redirects, it is far from a perfect implementation. In particular even though I have brought this up many times, Newsblur is not actually sending If-Modified-Since or If-None-Match headers, which means it will take a copy of the feed at every request. Even though it does support compressed responses (non fetch of the feed is allowed without compressed responses), NewsBlur is requesting the same URL more than twice an hour, because it seems to have two “sites” described by the same URL. At 50KiB per request, that makes up about 1% of the total bandwidth usage of the blog. To be fair, this is not bad at all, but one has to wonder why they can’t be saving the last modified or etag values — I guess I could install my own instance of NewsBlur and figure out how to do that myself, but who knows when I would find the time for that.

Update (2017-08-16): Turns out that, as Samuel pointed out in the comments and on Twitter, I wrote something untrue. NewsBlur does send the headers, and supports this correctly. The problem is an Apache bug that causes 304 never to be issued when using If-None-Match and mod_deflate.

To be fair, even rawdog, which I use for Planet Multimedia, does not appear to support these properly. Oh and speaking of Planet Multimedia, would someone be interested in providing a more modern template so that Monty’s pictures don’t take over the page, that would be awesome!

There actually are a few other readers that do support these values correctly, and indeed receive 304 (Not Modified) status code most of the time. These include Lighting (somebody appears to be still using it!) and at least yet-another-reader-service Willreadit.com — this latter appears to be in beta and being invite only; it’s probably the best HTTP implementation I’ve seen for a service with such a rough website. Indeed the bot landing page points out how it supports If-Modified-Since and gzip-compressed responses. Alas it does not appear to learn from persistent redirects though, so it’s currently fetching my blog’s feed twice, probably because there are at least two subscribers for it.

Also note that supporting If-Modified-Since is a prerequisite for supporting delta feeds which is an interesting way to save even more bandwidth (although I don’t think this is feasible to do with a static website at all).

At the very least it looks like we won the battle for supporting compressed responses. The only 406 (Not Acceptable) responses for the feed URL are for Fever, which is no longer developed or supported. Even Gwene, which I pointed out was hammering my blog last time I wrote about this, is now content to get the compressed version. Unfortunately it does not appear like my pull request was ever merged, which means it’s likely the repository itself is completely out of sync with what is being run.

So in 2017, what is the current state of the art feed reader support? NewsBlur has recently added support for JSON Feed which is not particularly exciting – when I read the post I was reminded, by the screenshot of choice there, where I heard of Brent Simmons before: Vesper, which is an interesting connection, but I should not go into that now – but at least shows that Samuel Clay is actually paying attention to the development of the format — even though that development right now appears to just avoiding XML. Which to be honest is not that bad of an idea: since HTML (even HTML5) does not have to be well-formatted XML, you need to provide it as cdata in an XML feed. And the way you do that makes it very easy to implement it incorrectly.

Also, as I wrote this post I realized what else I would like from NewsBlur: the ability to subscribe to an OPML feed as a folder. I still subscribe to lots of Planets, even though they seem to have lost their charm, but a few people are aggregated in multiple planets and it would make sense to be able to avoid duplicate posts. If I could tell NewsBlur «I want to subscribe to this Planet, aggregate it into a folder», it would be able to tell the duplicated feeds, and mark the posts as read on all of them at the same time. Note that what I’d like is something different from just importing an OPML description of the planet! I would like for the folder to be kept in sync with the OPML feed, so that if new feeds are added, they also get added to the folder, and same for removed feeds. I should probably file that on GetSatisfaction at some point.

More HTTP misbehaviours

Today I have been having some fun: while looking at the backlog on IRCCloud, I found out that it auto-linked Makefile.am which I prompty decided to register it with Gandi — unfortunately I couldn’t get Makefile.in or configure.ac as they are both already registered. After that I decided to set up Google Analytics to report how many referrer arrive to my websites through some of the many vanity domains I registered over time.

After doing that, I spent some time staring at the web server logs to make sure that everything was okay, and I found out some more interesting things: it looks like a lot of people have been fetching my blog Atom feed through very bad feed readers. This is the reification of my forecast last year when Google Reader got shut down.

Some of the fetchers are open source, so I ended up opening issues for them, but that is not the case for all of them. And even when they are open source, sometimes they don’t even accept pull requests implementing the feature, for whichever reason.

So this post is a bit of a name-and-shame, which can be positive for open-source projects when they can fix things, or negative for closed source services that are trying to replace Google Reader and failing to implement HTTP properly. It will also serve as a warning for my readers from those services, as they’ll stop being able to fetch my feed pretty soon, as I’ll update my ModSecurity rules to stop these people from fetching my blog.

As I noted above, both Stringer and Feedbin fail to properly use compressed responses (gzip compression), which means that they fetch over 90KiB every turn instead of just 25KiB. The Stringer devs already reacted and seem to be looking into fixing this very soon now. Feedbin I have no answer from yet (but it’s pretty soon anyway), but it worries me for another reason too: it does not do any caching at all. And somebody set up a Feedbin instance in the Prague University that fetches my feed, without compression, without caching, every two minutes. I’m going to soon blacklist it.

Gwene still has not replied to the pull request I sent in October 2012, but on the bright side, it has not fetched my blog since a long time ago. Feedzirra (now Feedjira) used by IFTTT still does not enable compressed responses by default, even though it seems to support the option (Stringer is also based on it, it seems).

It’s not just plain feed readers that fail at implementing HTTP. Distributed social network Friendica – that aims at doing a better job than Diaspora at that – seem also to forget about implementing either compressed responses or caching. At least it seems to only fetch my feed every twelve hours. On the other hand, it seems to also get someone’s timeline from Twitter, so when it encounters a link to my blog it first send a HEAD request, and then fetches the page. Three times. Also uncompressed.

On the side of non-open-source services, FeedWrangler has probably one of the worst implementations of HTTP I’ve ever seen: it does not support compressed responses (90KiB feed), does not do caching (every time!), and while it would fetch at one hour intervals, it does not understand that a 301 is a permanent redirection, and there’s no point in keeping around two feed IDs for /articles.rss and /articles.atom (each with one subscriber). That’s 4MiB a day, which is around 2% of the bandwidth my website serves, over a day. While this is not an important amount, and I don’t have limitation on the server’s egress, it seems silly that 2% of my bandwidth is consumed on two subscribers, when the site has over a thousand visitors a day.

But what takes the biscuit is definitely FeedMyInbox: while it fetches only every six hours, it implements neither caching nor compression. And I found it only when looking into the requests coming from bots without a User-Agent header. The requests come from 216.198.247.46 which is svr.feedmyinbox.com. I’m soon also going to blacklist this until they stop being douches and provide a valid user agent string.

They are by far not the only ones though; there is another bot that fetches my feed every three hours that will soon follow the same destiny. But this does not have an obvious service attached to it, so if whatever you’re using to read my blog tells you it can’t fetch my blog anymore, try to figure out if you’re using a douchereader.

Please remember that software on the net should be implemented for collaboration between client and server, not for exploitation. Everybody’s bandwidth gets worse when you heavily use a service that is not doing its job at optimizing bandwidth usage.

Planets, feeds and blogs

You have probably noticed that last month I replaced Harvester with rawdog for Planet Multimedia. The reasons was easy to explain: Harvester requires libraries that only work with Ruby 1.8 — and while on one hand moving to Ruby 1.9 or 2 would mean being able to use the feedfetcher (the same one used by IFTTT), my attempts at updating the code to work with a more modern version of Ruby have been all failures.

Since I did not intend to be swamped with one more custom tool to maintain I turned to another standard tool to implement the aggregator, rawdog — holding my nose on the use of darcs for source control, and the name (please don’t google for it without safe search on, at work). The nice part about using this tool is that it’s packaged in Gentoo already, so it’s handled straight by portage with binary packages. Unfortunately, the default templates are terrible, and the settings non-obvious, but Luca was able to make the best out of it.

But more and more problems got obvious with time. The first is that the tool is does not respect the return codes at exit — it always returns zero (success) even if the processing was incomplete; it took me two weeks to figure out that the script failed when running in cron because the environment lacked the locale settings, as the cron logs said that everything was alright, and since I use fcron, it also did not send me any email, as I set it to mail me only for errors.

A couple of days ago, I got complains again that the Planet was not updating; again, no error in the cron logs, no error in my email. I ran the command manually, and I was told by it that Luca’s feed, on blogs.gentoo.org, was unreachable. Okay, sure. But then it did not solve itself when it came back up. Today I looked back into it and J-B’s and Rémi’s blogs feed were unreachable. Once again, no non-zero exit status, thus no mail, no error in the logs. This is not the way it should behave.

But that’s not enough. the other problem with rawdog is that it does not, by default, support generating a feed for the aggregation, like Harvester (and Planet/Venus) does. I found that Jonathan Riddell actually built a plugin for Planet KDE to generate the feed, but I haven’t tested it yet because I have not found the authoritative source of it, but just multiple copies of it in different websites. It also produces RSS feeds, rather than Atom feeds. And I’m sorry to say but Atom is much preferred, for me.

So where does it leave us? I’m not going to try fixing rawdog I’m afraid. Mostly because I don’t intend spending time with darcs. My options are either go back to Harvester and fix it to not use DBI and support Ruby 1.9, or try to adapt parts of NewsBlur – that already deal with aggregating feeds and producing new feeds – to make up an alternative to rawdog. If I am to do something like that, though, I’m most likely going to take my dear time and make it a web-configurable tool, rather than something that needs to be configured on the command line or with configuration files.

The reason for that is, very simply, that I’m growing fond of doing most of my work on a browser when I can, and this looks like a perfect solution to the problem. Even more so if you can give access to someone else to look into it — and if you can avoid storing passwords.

So, any takers to help me with this project?

The Google Reader Exodus and its effect on content publishers

I’m a content publisher, whether I like it or not. This blog is relatively well followed, and I write quite a lot in it. While my hosting provider does not give me grief for my bandwidth usage, optimizing it is something I’m always keen on, especially since I have been Slashdotted once before. This is one of the reasons why my ModSecurity Ruleset validates and filters crawlers as much as spammers.

Blogs’ feeds, be them RSS or Atom (this blog only supports the latter) are a very neat way to optimize bandwidth: they get you the content of the articles without styles, scripts or images. But they can also be quite big. The average feed for my blog’s articles is 100KiB which is a fairly big page, if you consider that feed readers are supposed to keep pinging the blog to check for new items. Luckily for everybody, the authors of HTTP did consider this problem, and solved it with two main features: conditional requests and compressed responses.

Okay there’s a sense of déjà-vu in all of this, because I already complained about software not using the features even when it’s designed to monitor web pages constantly.

By using conditional requests, even if you poke my blog every fifteen minutes, you won’t use more than 10KiB an hour, if no new article has been posted. By using compressed responses, instead of a 100KiB response you’ll just have to download 33KiB. With Google Reader, things were even better: instead of 113 requests for the feed, a single request was made by the FeedFetcher, and that was it.

But now Google Reader is no more (almost). What happens now? Well, of the 113 subscribers, a few will most likely not re-subscribe to my blog at all. Others have migrated to NewsBlur (35 subscribers), the rest seem to have installed their own feed reader or aggregator, including tt-rss, owncloud, and so on. This was obvious looking at the statistics from either AWStats or Munin, both showing a higher volume of requests and delivered content compared to last month.

I’ve then decided to look into improving the bandwidth a bit more than before, among other things, by providing WebP alternative for images, but that does not really work as intended — I have enough material for a rant post or two so I won’t discuss it now. But while doing so I found out something else.

One of the changes I made while hoping to use WebP is to serve the image files from a different domain (assets.flameeyes.eu) which meant that the access log for the blog, while still not perfect, is decidedly cleaner than before. From there I noticed that a new feed reader started requesting my blog’s feed every half an hour. Without compression. In full every time. That’s just shy of 5MiB of traffic per day, but that’s not the worst part. The worst part is that said 5MiB are for a single reader as the requests come from a commercial, proprietary feed reader webapp.

And this is not the only one! Gwene also does the same, even though I sent a pull request to get it to use compressed responses, which hasn’t had a single reply. Even Yandex’s new product has the same issue.

While 5MiB/day is not too much taken singularly, my blog’s traffic averages on 50-60 MiB/day so it’s basically a 10% traffic for less than 1% of users, just because they do not follow the best practices when writing web software. I’ve now added these crawlers to the list of stealth robots, this means that they will receive a “406 Unacceptable” unless they finally implement at least the compressed responses support (which is the easy part in all this).

This has an unfortunate implication on users of those services that were reading me, who won’t get any new updates. If I was a commercial entity, I couldn’t afford this at all. The big problem, to me, is that with Google Reader going away, I expect more and more of this kind of issues to crop up repeatedly. Even NewsBlur, which is now my feed reader of choice fixed their crawlers yet, which I commented upon before — the code is open-source but I don’t want to deal with Python just yet.

Seriously, why are there so many people who expect to be able to deal with web software and yet have no idea how the web works at all? And I wonder if somebody expected this kind of fallout from the simple shut down of a relatively minor service like Google Reader.

Why you should care about your HTTP implementation

So today’s frenzy is all about Google’s dismissal of the Reader service. While I’m also upset about that, I’m afraid I cannot really get into discussing that at this point. On the other hand, I can talk once again of my ModSecurity ruleset and in particular of the rules that validate HTTP robots all over the Internet.

One of the Google Reader alternatives that are being talked about is NewsBlur — which actually looks cool at first sight, but I (and most other people) don’t seem to be able to try it out yet because their service – I’m not going to call them servers as it seems they at least partially use AWS for hosting – fails to scale.

While I’m pretty sure that it’s an exceptional amount of load they are receiving now as everybody and their droid are trying to register to the service and import their whole Google Reader subscription list, which then needs to be fetched and added to the database, – subscriptions to my blog’s feed went from 5 to 23 in the matter of hours! – there are a few things that I can infer from the way it behaves that makes me think that somebody overlooked the need for a strong HTTP implementation.

First of all what happened was that I got a report on twitter that NewsBlur was getting a 403 fetching my blog, and that was obviously caused by my rules’ validation of the request. Looking at my logs, I found out that NewsBlur sends requests with three different User-Agents, which show a likeliness that they are implemented by three different codepaths altogether:

User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

The third is the most conspicuous string, because it’s very minimal and does not follow your average string format, using the dash as separator instead of adding the URL in parenthesis next to the fetcher name (and version, more on that later).

The other two strings show that they have been taken by the string reported by Safari on OSX — but interestingly enough from two different Safari version, and one of the two has been actually stripped as well. This is really silly. While I can understand that they might want to look like Safari when fetching a page to display – mostly because there are bad hacks like PageSpeed that serve different HTML to different browsers, messing up caching – I doubt that is warranted for feeds; and even getting the Safari HTML might be a bad idea if then it’s displayed by the user with a different browser.

The code that fetches feeds and pages is likely quite different as it can be seen by the full request. From the feed fetcher:

GET /articles.atom HTTP/1.1
A-Im: feed
Accept-Encoding: gzip, deflate
Connection: close
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
Host: blog.flameeyes.eu
If-Modified-Since: Tue, 01 Nov 2011 23:36:35 GMT
If-None-Match: "a00c0-18de5-4d10f58aa91b5"

This is a very sophisticated fetching code, as it not only properly supports compressed responses (Accept-Encoding header) but it also uses the If-None-Match and If-Modified-Since headers to not re-fetch an unmodified content. The fact that it’s pointing to November 1st of two years ago is likely due to the fact that since then my ModSecurity ruleset refused to speak with this fetcher, because of the fake User-Agent string. It also includes a proper Accept header that lists the feed types they prefer over the generic XML and other formats.

The A-Im header is not a fake or a bug; it’s actually part of RFC3229 Delta encoding in HTTP and stands for Accept-Instance-Manipulation. I’ve never seen that before, but a quick search turned it out, even though the standardized spelling would be A-IM. Unfortunately, the aforementioned RFC does not define the “feed” manipulator, even though it seems to be used in the wild, and I couldn’t find a proper formal documentation of how it should work. The theory from what I can tell is that the blog engine would be able to use the If-Modified-Since header to produce on the spot a custom feed for the fetcher, that only includes entries that has been modified since that date. Cool idea, too bad it lacks a standard as I said.

The request coming in from the page fetcher is drastically different:

GET / HTTP/1.1
Host: blog.flameeyes.eu
Connection: close
Content-Length: 0
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)

So we can tell two things from the comparison: this code is older (there is an earlier version of Safari being used), and not the same care has been spent as it has been on the feed fetcher (which dropped the Safari identifier itself, at least). It’s more than likely that if libraries are used to send the request, a completely different library is used here, as this request declares support for the compress encoding, not supported by the feed fetcher (and as far as I can tell, never ever used). It also is much less choosy on the formats to receive, as it accepts whatever you want to give it.

*For the Italian readers: yes I intentionally picked the word choosy. While I can find Fornero an idiot as much as the next guy, I grew tired of copy-paste statuses on Facebook and comments that she should have said picky. Know your English, instead of complaining on idiocies.*

The lack of If-Modified-Since here does not really mean much, because it’s also possible that they were never able to fetch the page, as they might have introduced the feature later (even though the code is likely older). But the Content-Length header sticks out like a sore thumb, and I would expect to have been put there by whatever HTTP access library they’re using.

The favicon fetcher is the one that is the most naïve and possibly the code that needs to be cleaned up the most:

GET /favicon.ico HTTP/1.1
Accept-Encoding: identity
Host: blog.flameeyes.eu
Connection: close
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

Here we start with nigh protocol violations, by not providing an Accept header — especially facepalming considering that this is where a static list of mime types would be the most useful, to restrict the image formats that will be handled properly! But what happens with my rules is that the Accept-Encoding there is not suitable for a bot at all! Since it does not support any compressed response, the code will now respond with a 406 Not Acceptable status code, instead of providing the icon.

I can understand that a compressed icon is more than likely to not be useful — indeed most images should not be compressed at all to be sent over HTTP, but why should you explicitly refuse it? Especially since the other two fetches properly support a sophisticated HTTP?

All in all, it seems like some of the code in NewsBlur has been bolted on after the fact, and with different levels of care. It might not be the best of time for them now to look at the HTTP implementation, but I would still suggest for it. A single pipelined request of the three components they need – instead of using Connection: close – could easily reduce the number of connections to blogs, and that would be very welcome to all the bloggers out there. And using the same HTTP code would make it easier for people like me to handle NewsBlur properly.

I would also like to have a way to validate that a given request comes from NewsBlur — like we do with GoogleBot and other crawlers. Unfortunately this is not really possible, because they use multiple servers, both on standard hostings and AWS, both on IPv4 and (possibly, one time) IPv6, so using FcRDNS is not an option.

Oh well, let’s see how this thing pans out.

A déjà vu is usually a glitch in the… Rails?

All of you who read my blog through Gentoo Universe and either Google Reader or Liferea (not sure about other feed readers) will probably have noticed that over the last few months my post appeared duplicated as “new” from time to time. I have had this reported quite a few times but I couldn’t find any reason why that happened; each time the feed looked in order for me.

Thankfully, there are people with more attention to details than me, and Torsten found the one difference between the two feeds: the <id> tag reported blog.flameeyes.eu first and farragut.flameeyes.is-a-geek.org afterwards. You probably remember the latter as my original URL for the blog, and the one that, some time ago, I restored for access without adding further redirects. And that was a bad idea, as it happens.

The problem here is actually multi-layered, and that shows why the problem happens only on Gentoo Universe (and not Planet Gentoo). First of all, while Typo publishes feed for each category and each tag, the default feed (everything), and the Gentoo-specific one are wrapped in FeedBurner; the reason was mostly reducing the bandwidth spent on serving these feeds to subscribers, given that more than a couple of time I had people using broken feed-fetchers with one minute refresh time, and no If-Modified-Since, If-None-Match support which meant serving the same feed, well, too many times.

In this situation, the feed as seen by Planet Gentoo and that seen by any other requester is quite different, as the former is whitelisted not to be redirected to FeedBurner; this made it totally safe. Unfortunately, the same wasn’t true of the English category I use to skip over Italian posts on Gentoo Universe (by policy), so the same feed requested by users was the one given to the Venus software.

Since, starting September, I did allow straight requests without redirection between the old hostnames on the flameeyes.is-a-geek.org domain and the new ones, the requests on the feed for people who subscribed a long time ago appeared with both domains in the requests. Again, this is nothing particularly strange and nothing I’d expect to cause problem, if it wasn’t for Rails.

But before reaching Rails, my first thought was to check out the Typo source code; I was expecting to find there the source for the Atom feeds, and indeed a file that was at some point used to build the Atom feed was there; unfortunately that file is totally unlinked to all the rest nowadays, as ActionPack (2.3.8) takes care of the basis of Atom feed generation. This unfortunately means that it’s too generic to know about the Typo-configured canonical URL for the blog.

Indeed the code generating the start of the feed itself contains this:

        xml.feed(feed_opts) do
          xml.id(options[:id] || "tag:#{request.host},#{options[:schema_date]}:#{request.request_uri.split(".")[0]}")

You can see here what the problem is: request.host will make the feed content change between one request and the other depending on the URL used to reach the application. So what? Given Venus will always request the feed on the same URL it would receive always the same output! Yes that would be the default, if Typo wasn’t being smart slightly more than Rails is and wasn’t thus caching the results. But it is, and rightfully so most of the time because unless I’m posting something, changing the theme, or somebody comments, the blog’s pages are mostly static.

Unfortunately, by caching the results, the saved copy of the feed will cause requests coming from the two domains to have the same tag base: that of the latest request which was cache, which could very well be the “wrong” one. And for software that tracks down the posts by their reported IDs, such as Venus, this is quite a bother.

To solve the problem, so that you’re not further bothered by my posts appearing twice, I’ve restored the redirect and dropped the replies coming from the alias, although that bothers me a bit. As it turns out, it is not safe to run a Rails web application to answer to more than a single vhost, at least as of 2.3.8 (I haven’t upgraded to 2.3.10 because of some trouble with Typo 5, and I sure don’t want to update to Typo 6/Rails 3 any time soon).

Syndication rules

I have noticed that, beside Planet (and Universe) Gentoo and Linux-Planet — both of which I asked for my blog to be syndacated on, there are a number of other websites that seem to fetch and use my own posts, fetched from the atom feeds.

Some of these are community websites, other are “hubs” that supposedly focus content for users; quite a few though seem to be scams and websites that simply take others’ posts and snap a bunch of AdSense units over them. While I could go as far as pushing for DMCA violation for those – as often CC-BY-NC-ND is not respected – I’d rather not, and especially I’d rather go with technical solutions of some kind.

What I would like to do is to have different feeds depending on who’s requesting them; so have a “pure” feed, with the original unmangled content from Typo for the syndications I allow explicitly (with host-based detection is even better), one Flattr-enabled feed to provide to feed readers and Google, and one, eventually, with AdSense units to shove down the abusive websites.

Now, for the moment I set up the two most-subscribed feeds on Google’s FeedBurner; I’d still ask you to keep using my blog’s host URLs though, as that might not be a definitive solution; if I can find some software to deal with this, or even write a Typo plugin to handle this for me, I’ll drop FeedBurner again; I’m just not happy to have to outsource feeds handling for my own blog. I’ll call this an experiment. If you happen to subscribe to the FeedBurner feeds, don’t worry, I’ll make sure to keep them updated, and hopefully without mangling content if I do handle that on my side.

Now, the important information on this post relates to the “rules” for syndicating my content, so that it is clear whether you’ll get the Ad-encumbered feed or not. Obviously if I asked to be syndicated, as is the case for Planet Gentoo and Linux Planet, then you’ll get the absolutely clear feed: no ads, no flattr buttons, nothing else. If the website has no content, a lot of ads, and even worse no indication of where the content truly comes from, I’m going to consider technical means to hinder the syndication, such as pushing ads on the feed myself, or removing the content, or adding a footer making the author and the terms of the content’s license explicit.

For the rest of the websites, I’m generally not going to bother; if you’re a decent news aggregator, and don’t have advertisement as main content on your site, and you abide to the content’s license giving proper attribution rather than showing off my content as your own, I’ve got no problem. I’ll be happy if you were to flattr me and I might decide to push a Flattr button on the content pushed on.. if you wish to avoid that, you’re free to contact me and we can work out the details.. in general I have no objection to that as long as your aggregation software is decent enough to cache responses using If-None-Match and If-Modified-Since, and to use deflated (compressed) content transfer.