Four years ago, I wrote a musing about the negative effects of the Google Reader shutdown on content publishers. Today I can definitely confirm that some of the problems I foretold materialized. Indeed, thanks to the fortuitous fact that people have started posting my blog articles to reddit and hacker news (neither of which I’m fond of, but let’s leave that aside), I can declare that the vast majority of the bandwidth used by my blog is consumed by bots and in particular by feed readers. But let’s start from the start.
This blog is syndicated over a feed, the URL and format of which changed a number of times before, mostly with the software, or with the update of the software. The most recent change was due to switching from Typo to Hugo, and the feed name changing. I could have kept the original feed name, but it made little sense at the time, so instead I set up permanent redirects from the old URLs to the new URLs, as I always do. I say I always do because I keep working even the original URLs from when I ran the blog off my home DSL.
Some services and feed reading software know how to deal with permanent redirects correctly, and will (eventually) replace the old feed URL with the new one. For instance NewsBlur will replace URLs after ten fetches replied with a permanent redirect (which is sensible, to avoid accepting a redirection that was set up by mistake and soon rolled back, and to avoid data poisoning attacks). Unfortunately, it seems like this behaviour is extremely rare, and so on August 14th I received over three thousands requests for the old Typo feed URL (admittedly, that was the most persistent URL I used). In addition to that, I also received over 300 requests for the very old Typo /xml/ feeds, of which 122 still pointing at my old dynamic domain, which is now pointing at the normal domain for my blog. This has been the case now for almost ten years, and yet some people still have subscription to those URLs! At least one Liferea and one Akregator pointing at those URLs.
But while NewsBlur implements sane semantics for handling permanent redirects, it is far from a perfect implementation. In particular even though I have brought this up many times, Newsblur is not actually sending If-Modified-Since or If-None-Match headers, which means it will take a copy of the feed at every request. Even though it does support compressed responses (non fetch of the feed is allowed without compressed responses), NewsBlur is requesting the same URL more than twice an hour, because it seems to have two “sites” described by the same URL. At 50KiB per request, that makes up about 1% of the total bandwidth usage of the blog. To be fair, this is not bad at all, but one has to wonder why they can’t be saving the last modified or etag values — I guess I could install my own instance of NewsBlur and figure out how to do that myself, but who knows when I would find the time for that.
Update (2017-08-16): Turns out that, as Samuel pointed out in the comments and on Twitter, I wrote something untrue. NewsBlur does send the headers, and supports this correctly. The problem is an Apache bug that causes 304 never to be issued when using If-None-Match and mod_deflate.
To be fair, even rawdog, which I use for Planet Multimedia, does not appear to support these properly. Oh and speaking of Planet Multimedia, would someone be interested in providing a more modern template so that Monty’s pictures don’t take over the page, that would be awesome!
There actually are a few other readers that do support these values correctly, and indeed receive 304 (Not Modified) status code most of the time. These include Lighting (somebody appears to be still using it!) and at least yet-another-reader-service Willreadit.com — this latter appears to be in beta and being invite only; it’s probably the best HTTP implementation I’ve seen for a service with such a rough website. Indeed the bot landing page points out how it supports If-Modified-Since and gzip-compressed responses. Alas it does not appear to learn from persistent redirects though, so it’s currently fetching my blog’s feed twice, probably because there are at least two subscribers for it.
Also note that supporting If-Modified-Since is a prerequisite for supporting delta feeds which is an interesting way to save even more bandwidth (although I don’t think this is feasible to do with a static website at all).
At the very least it looks like we won the battle for supporting compressed responses. The only 406 (Not Acceptable) responses for the feed URL are for Fever, which is no longer developed or supported. Even Gwene, which I pointed out was hammering my blog last time I wrote about this, is now content to get the compressed version. Unfortunately it does not appear like my pull request was ever merged, which means it’s likely the repository itself is completely out of sync with what is being run.
So in 2017, what is the current state of the art feed reader support? NewsBlur has recently added support for JSON Feed which is not particularly exciting – when I read the post I was reminded, by the screenshot of choice there, where I heard of Brent Simmons before: Vesper, which is an interesting connection, but I should not go into that now – but at least shows that Samuel Clay is actually paying attention to the development of the format — even though that development right now appears to just avoiding XML. Which to be honest is not that bad of an idea: since HTML (even HTML5) does not have to be well-formatted XML, you need to provide it as cdata in an XML feed. And the way you do that makes it very easy to implement it incorrectly.
Also, as I wrote this post I realized what else I would like from NewsBlur: the ability to subscribe to an OPML feed as a folder. I still subscribe to lots of Planets, even though they seem to have lost their charm, but a few people are aggregated in multiple planets and it would make sense to be able to avoid duplicate posts. If I could tell NewsBlur «I want to subscribe to this Planet, aggregate it into a folder», it would be able to tell the duplicated feeds, and mark the posts as read on all of them at the same time. Note that what I’d like is something different from just importing an OPML description of the planet! I would like for the folder to be kept in sync with the OPML feed, so that if new feeds are added, they also get added to the folder, and same for removed feeds. I should probably file that on GetSatisfaction at some point.
I was using Planet to aggregate a few feeds I was interested in, including Gentoo Universe, which is how I receive this blog.One thing that got me to shift off Planet was its Unicode handling. It kept buggering up the ó in your surname… and that annoyed me intently. I’m not sure it’s actually being maintained anymore, so wound up writing my own using TornadoWeb’s asynchronous framework. (More or less as a teach-myself project… python-requests+jinja2 could do the same job.)https://github.com/sjlongla… was the result, intended to perform the same aggregation function as Planet.By the sounds of things though, it’d be worth my while reading up on OPML and figure out some mechanism for de-duplicating posts.
To be fair, your feed’s Cache-Control response header suggest fetching your feed every 30 minutes even though you don’t publish more than a few times per week.I actually did a cleaning spree in open source feed readers and fixed bugs and added support for HTTP cache revalidation and delta feeds in every open source feed reader of any note. The situation today, now that updated clients have started reaching end-users, should be much better than it was a year ago. At least when it comes to the various open source clients and services.Newsblur does indeed send both If-Modified-Since and If-None-Match request headers (HTTP cache revalidation request headers). I’m looking at my logs right now and seeing those headers. Newsblur even supports delta feed updates. Regarding OPML subscriptions and Newsblur, you can find Newsblur on GitHub. Subscribing to OPML feeds and pulling them for updates shouldn’t be hard to add.Speaking of HTTP redirects, software is actually expected to update linked references when encountering permanent redirects. Yet almost no software actually treats a permanent redirect like an actual permanent redirect. It’s even in the HTTP specification:
For anyone interested in the topic, I wrote an article titled “Best practices for caching of syndication feeds for feed readers and publishers” that deals with the topics discussed here in greater detail.Lastly, I’d like to add that JSON Feed is a pile of U+1F4A9. The format adds no value over existing formats, and the authors really didn’t know much about content delivery over nor about the existing capabilities of HTTP. Which should be a required minimum when purposing a new syndication format.
NewsBlur does support If-Modified-Since and If-None-Match headers: https://github.com/samuelcl…
Thanks for the code pointer!That looks indeed like it should be working fine, I’m not sure why NewsBlur is not getting the 304 status then. I’ve added ModSecurity auditlog for NewsBlur requests, and will try to figure out why the requests appear not to trigger the caching.This is Apache with static files, and as I noted it is handled correctly with some other feed readers, so it feels strange it’s not triggering. I’ll update the blog post once I know exactly what’s going on!
I know you don’t like the JSON Feed spec (neither I do), but can you really argue that the semantics of providing HTML code within an XML feed is clear? I can’t. I would have preferred a more sensible delivery format myself, but that’s a different problem, I think.And yes I know NewsBlur is on GitHub 🙂 It’s one of the reason why I’m a happy customer since early days 😀 I just don’t know if I’ll have time to implement it myself any time soon 🙂
I’ve updated the post to remove the blame on NewsBlur and put it where it should lay: Apache ☹I’ll see if the workaround I found on the bug works.
On the proactive side (!) I filed bugs with TT-RSS for implementing what is needed:https://discourse.tt-rss.or…https://discourse.tt-rss.or…And Thomas (who wrote WillReadIt) is already fixing up the permanent redirect handling 🙂
You can work around it with mod_disk_cache. I don’t recall the exact details, but mod_cache runs after mod_deflate in the output chain. That way, the validating etag and the cached etag will match. Otherwise the [stupid] gzip suffix can cause problem with cache revalidation. It was introduced to solve problems, but I do beleive it has done more harm than good. You could of course you any other HTTP-aware cache like Varnish.
I’ve just disabled etags altogether for now. Since I was using it with the default (mtime+size), it didn’t really encode anything that last-modified wasn’t encoding already, so this avoids the whole problem.Of course there could be software that only supported etags-based revalidation, but… that would be just as broken.
Speaking of the cache control — turns out that I meant to have it set to 1hr but that was set for when I was providing the atom+rss feeds explicitly; since Hugo just calls it .xml, they get the generic mime-type and then ignore the 1hr setting for the other two, oops!But that’s the request for re-validation rather than for how often to request it anew. Responding 304 every hour would be perfectly fine, particularly if it’s going to pick up new posts more quickly 😉
And the poopy attitude of tt-rss upstream means that software is now banned from fetching from my websites.
He really isn’t a pleasant fellow.
I think the words you’re looking for are “Nazi sympathiser” 😉