Free Idea: structured access logs for Apache HTTPD

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

I have been commenting on Twitter a bit about the lack of decent tooling to deal with Apache HTTPD’s Combined Logging Format (inherited from NCSA). For those who do not know about it, this is hte format used by standard access_log files, which include information about requests, including the source IP, the time, the requested path, the status code and the User-Agent used.

These logs are useful for debugging but are also consumed by tools such as AWStats to produce useful statistics about the request patterns of a website. I used these extensively when writing my ModSecurity rulesets, and I still keep an eye out on them for instance to report wasteful feed readers.

The files are simple text files, and that makes it easy to act on them: you can use tail and grep, and logrotate needs no special code beside moving the file and reloading Apache to have it re-open the paths. This makes it hard to query for particular entries in fields, such as to get the list of User-Agent strings present in a log. Some of the suggestions I got over Twitter to solve this were to use awk, but as it happens, these logs are not actually parseable with a straightforward field separation.

Lacking finding a good set of tools to handle these formats directly, I have been complaining that we should probably start moving away from simple text files into more structured log formats. Indeed, I know that there used to be at least some support for logging directly to MySQL and other relational databases, and that there are more complicated machinery often used by companies and startups that process these access logs into analysis software and so on. But all of these tend to be high overhead, much more than what I or someone else with a small personal blog would care about implementing.

Instead I think it’s time to start using structured file logs. A few people including thresh from VideoLAN suggested using JSON to write the log files. This is not a terrible idea, as the format is at least well understood and easy to interface with most other software, but honestly I would prefer something with an actual structure, a schema that can be followed. Of course I’m not meaning XML, and I would rather suggest having a standardized schema for proto3. Part of that I guess is because I’m used to use this at work, but also because I like the idea of being able to just define my schema and have it generate the code to parse the messages.

Unfortunately currently there is no support or library to access a sequence of protocol buffer messages. Using a single message with repeated sub-messages would work, but it is not append-friendly so there is no way to just keep writing this to a file, and being able to truncate and resume writing to it, which is a property needed for a proper structured log format to actually fit in the space previously occupied by text formats. This is something I don’t usually have to deal with at work, but I would assume that a simple LV (Length-Value) or LVC (Length-Value-Checksum) encoding would be okay to solve this problem.

But what about other properties of the current format? Well, the obvious answer is that, assuming your structured log contains at least as much information (but possibly more) as the current log, you can always have tools that convert on the fly to the old format. This would for instance allow to have a special tail-like command and a grep-like command that provides compatibility with the way the files are currently looked at manually by your friendly sysadmin.

Having more structured information would also allow easier, or deeper analysis of the logs. For instance you could log the full set of headers (like ModSecurity does) instead of just the referrer and User-Agent. And allow for customizing the output on the conversion side rather than lose the details when writing.

Of course this is just one possible way to solve this problem, and just because I would prefer working with technologies that I’m already friendly with it does not mean I wouldn’t take another format that is similarly low-dependency and easy to deal with. I’m just thinking that the change-averse solution of not changing anything and keeping logs in text format may be counterproductive in this situation.

Apache, ETag and “Not Modified”

In my previous post on the matter I incorrectly blamed NewsBlur – which I still recommend as the best feed reader I’ve ever used! – for not correctly supporting HTTP features to avoid wasting bandwidth for fetching repeatedly unmodified content.

As Daniel and Samuel pointed out immediately, NewsBlur does support those features, and indeed I even used it as an example four years ago, oops for my memory being terrible that way, and me assuming the behaviour from the logs rather than inspecting the requests. And indeed the requests were not only correct, but matched perfectly what Apache reported:

--6190ee48-B--
GET /index.xml HTTP/1.1
Host: blog.flameeyes.eu
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: application/atom+xml, application/rss+xml, application/xml;q=0.8, text/xml;q=0.6, */*;q=0.2
User-Agent: NewsBlur Feed Fetcher - 59 subscribers - http://www.newsblur.com/site/195958/flameeyess-weblog (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36)
A-IM: feed
If-Modified-Since: Wed, 16 Aug 2017 04:22:52 GMT
If-None-Match: "27dc5-556d73fd7fa43-gzip"

--6190ee48-F--
HTTP/1.1 200 OK
Strict-Transport-Security: max-age=31536000; includeSubDomains
Last-Modified: Wed, 16 Aug 2017 04:22:52 GMT
ETag: "27dc5-556d73fd7fa43-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=1800
Expires: Wed, 16 Aug 2017 18:56:33 GMT
Content-Length: 54071
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: application/xml

So what is going on here? Well, I started looking around, both because I now felt silly, and because I owed more than just an update on the post and an apology to Samuel. And a few searches later, I found Apache bug #45023 that reports how mod_deflate prevents all 304 responses from being issued. This is a bit misleading (as you can still have them in some situations), but it is indeed what is happening here, and it is a breakage introduced by Apache 2.4.

What’s going on? Well, let’s first start to figure out why I could see some 304, but not from NewsBlur. Willreadit was one of the agents that received 304 responses at least some of the time and in the landing page it says explicitly that it supports If-Modified-Since. In particular, it does not support If-None-Match.

The If-None-Match header in the request compares with the ETag header (Entity Tag) in the response coming from Apache. This header is generally considered opaque, and the client should have no insights in what it is meant to do. The server generally calculates its value based on either a checksum of the file (e.g. md5) or based on file size and last-modified time. On Apache HTTP Server, the FileETag directive is used to define which properties of the served files are used to generate the value provided in the response. The default that I’m using is MTime Size, which effectively means that changing the file in any way causes the ETag to change. The size part might actually be redundant here, since modification time is usually enough for my use cases, but this is the default…

The reason why I’m providing both Last-Modifed and ETag headers in the response is that HTTP client can just as well only implement one of the two methods, rather than both, particularly as they may think that handling ETag is easier as it’s an opaque string, rather than information that can be parsed — but it really should be considered opaquely as well as it’s noted in RFC2616. Entity Tags are also more complicated because they can be used to collapse caching of different entities (identifed by an URL) within the same space (hostname) by caching proxies. I have lots of doubts that this usage is in use, so I’m not going to consider it a valid one, but your mileage may vary. In particular, since the default uses size and modification time, it ends up always matching the Last-Modified header, for a given entity, and the If-Modified-Since request would be just enough.

But when you provide both If-Modified-Since and If-None-Match, you’re asking for both conditions to be true, and so Apache will validate both. And here is where the problem happens: the -gzip suffix – which you can see in the header of the sample request above – is added at different times in the HTTPD process, and in particular it makes it so that the If-None-Match will never match the generated ETag, because the comparison is with the version without -gzip appended. This makes sense in context, because if you have a shared caching proxy, you may have different user agents that support different compression algorithms. Unfortunately, this effectively makes it so that entity tags disable Not Modified states for all the clients that do care about the tags. Those few clients that received 304 responses from my blog before were just implementing If-Modified-Since, and were getting the right behaviour (which is why I thought the title of the bug was misleading).

So how do you solve this? In the bug I already noted above, there is a suggestion by Joost Dekeijzer to use the following directive in your Apache config:

RequestHeader edit "If-None-Match" '^"((.*)-gzip)"$' '"$1", "$2"'

This adds a version of the entity tag without the suffix to the list of expected entity tags, which “fools” the server into accepting that the underlying file didn’t change and that there is no need to make any change there. I tested with that and it does indeed fix NewsBlur and a number of other use cases, including browsers! But it has the side effect of possibly poisoning shared caches. Shared caches are not that common, but why risking it? So I decided onto a slightly different option

FileETag None

This disable the generation of Entity Tags for file-based entities (i.e. static files), forcing browsers and feed readers to rely on If-Modified-Since exclusively. If clients only implement If-None-Match semantics, then this second option loses the ability to receive 304 responses. I have actually no idea which clients would do that, since this is the more complicated semantics, but I guess I’ll find out. I decided to give a try to this option for two reasons: it should simplify Apache’s own runtime, because it does not have to calculate these tags at any point now, and because effectively they were encoding only the modification time, which is literally what Last-Modified provides! I had for a while assumed that the tag was calculated based on a (quick and dirty) checksum, instead of just size and modification time, but clearly I was wrong.

There is another problem at this point, though. For this to work correctly, you need to make sure that the modification time of files is consistent with them actually changing. If you’re using a static site generator that produces multiple outputs for a single invocation, which includes both Hugo and FSWS, you would have a problem, because the modification time of every file is now the execution time of the tool (or just about).

The answer to this is to build the output in a “staging” directory and just replace the files that are modified, and rsync sounds perfect for the job. But the more obvious way to do so (rsync -a) will do exactly the opposite of what you want, as it will preserve the timestamp from the source directory — which mean it’ll replace the old timestamp with the new one for all files. Instead, what you want to use is rsync -rc: this uses a checksum to figure out which files have changed, and will not preserve the timestamp but rather use the timestamp of rsync, which is still okay — theoretically, I think rsync -ac should work, since it should only preserve the timestamp only of the files that were modified, but since the serving files are still all meant to have the same permissions, and none be links, I found being minimal made sense.

So anyway, I’ll hopefully have some more data soon about the bandwidth saving. I’m also following up with whatever may not be supporting properly If-Modified-Since, and filing bugs for those software/services that allow it.

Update (2017-08-23): since now it’s a few days since I fixed up the Apache configuration, I can confirm that the daily bandwidth used by “viewed hits” (as counted by Awstats) went down to ⅓ of what it used to be, to around 60MB a day. This should be accounting not only for the feed readers now properly getting a 304, but also for browsers of readers who no longer have to fetch the full page when, for instance, replying to comments. Googlebot also is getting a lot more 304, which may actually have an impact on its ability to keep up with the content, so I guess I will report back.

Free Idea: a filtering HTTP proxy for securing web applications

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

Going back to a previous topic I wrote about, and the fact that I’m trying to set up a secure WordPress instance, I would like to throw out another idea I won’t have time to implement myself any time soon.

When running complex web applications, such as WordPress, defense-in-depth is a good security practice. This means that in addition to locking down what the code itself can do on to the state of the local machine, it also makes sense to limit what it can do to the external state and the Internet at large. Indeed, even if you cannot drop a shell on a remote server, there is value (negative for the world, positive for the attacker) to at least being able to use it form DDoS (e.g. through an amplification attack).

With that in mind, if your app does not require network at all, or the network dependency can be sacrificed (like I did for Typo), just blocking the user from making outgoing connection with iptables would be enough. The --uid-owner option makes it very easy to figure out who’s trying to open new connections, and thus stop a single user transmitting unwanted traffic. Unfortunately, this does not always work because sometimes the application really needs network support. In the case of WordPress, there is a definite need to contact the WordPress servers, both to install plugins and to check if it should self-update.

You could try to limit access to what the user can access by hosts. But that’s not easy to implement right either. Take WordPress as an example still: if you wanted to limit access to the WordPress infrastructure, you would effectively have to allow it accessing *.wordpress.org, and this can’t really be done in iptables, at far as I know, since those connections go to IP literal addresses. You could rely on FcRDNS to verify the connections, but that can be slow, and if you happen to have access to poison the DNS cache of the server, you’re effectively in control of this kind of ACL. I ignored the option of just using “standard” reverse DNS resolution, because in that case you don’t even need to poison DNS, you can just decide what your IP will reverse-resolve to.

So what you need to do is actually filter at the connection-request level, which is what proxies are designed for. I’ll be assuming we want to have a non-terminating proxy (because terminating proxies are hard), but even in that case you can now know what (forward) address you want to connect to, and in that case *.wordpress.org becomes a valid ACL to use. And this is something you can actually do relatively easily with Squid, for instance. Indeed, this is the whole point of tools such as ufdbguard (which I used to maintain for Gentoo), and the ICP protocol. But Squid is particularly designed as a caching proxy, it’s not lightweight at all, and it can easily become a liability to have it in your server stack.

Up to now, what I have used to reduce the surface of attacks of my webapps is set them behind a tinyproxy, which does not really allow for per-connection ACLs. This only provides isolation against random non-proxied connections, but it’s a starting point. And here is where I want to provide a free idea for anyone who has the time and would like to provide better security tools for srver-side defense-in-depth.

A server-side proxy for this kind of security usage would have to be able to provide ACLs, with both positive and negative lists. You may want to provide all access to *.wordpress.org, but at the same time block all non-TLS-encrypted traffic, to avoid the possibility of downgrade (given that WordPress has a silent downgrade for requests to api.wordpress.org, that I talked about before).

Even better, such a proxy should have the ability to distinguish the ACLs based on which user (i.e. which webapp) is making the request. The obvious way would be to provide separate usernames to authenticate to the proxy — which again Squid can do, but it’s designed for clients for which the validation of username and password is actually important. Indeed, for this target usage, I would ignore the password altogether, and just use the user at face value, since the connection should always only be local. I would be even happier if instead of pseudo-authenticating to the proxy, the proxy could figure out which (local) user the connection came from, by inspecting the TCP socket connection, kind of like querying the ident protocol used to work for IRC.

So to summarise, what I would like to have is an HTTP(S) proxy that focuses on securing server-side web applications. Does not have to support TLS transport (because it should only accept local connections), nor it should be a terminating proxy. It should support ACLs that allow/deny access to a subset of hosts, possibly per-user, without needing a user database of any sort, and even better if it can tell by itself which user the connection came from. I’m more than happy if someone tells me this already exists, or if not, someone starts writing this… thank you!

Modern feed readers

Four years ago, I wrote a musing about the negative effects of the Google Reader shutdown on content publishers. Today I can definitely confirm that some of the problems I foretold materialized. Indeed, thanks to the fortuitous fact that people have started posting my blog articles to reddit and hacker news (neither of which I’m fond of, but let’s leave that aside), I can declare that the vast majority of the bandwidth used by my blog is consumed by bots and in particular by feed readers. But let’s start from the start.

This blog is syndicated over a feed, the URL and format of which changed a number of times before, mostly with the software, or with the update of the software. The most recent change was due to switching from Typo to Hugo, and the feed name changing. I could have kept the original feed name, but it made little sense at the time, so instead I set up permanent redirects from the old URLs to the new URLs, as I always do. I say I always do because I keep working even the original URLs from when I ran the blog off my home DSL.

Some services and feed reading software know how to deal with permanent redirects correctly, and will (eventually) replace the old feed URL with the new one. For instance NewsBlur will replace URLs after ten fetches replied with a permanent redirect (which is sensible, to avoid accepting a redirection that was set up by mistake and soon rolled back, and to avoid data poisoning attacks). Unfortunately, it seems like this behaviour is extremely rare, and so on August 14th I received over three thousands requests for the old Typo feed URL (admittedly, that was the most persistent URL I used). In addition to that, I also received over 300 requests for the very old Typo /xml/ feeds, of which 122 still pointing at my old dynamic domain, which is now pointing at the normal domain for my blog. This has been the case now for almost ten years, and yet some people still have subscription to those URLs! At least one Liferea and one Akregator pointing at those URLs.

But while NewsBlur implements sane semantics for handling permanent redirects, it is far from a perfect implementation. In particular even though I have brought this up many times, Newsblur is not actually sending If-Modified-Since or If-None-Match headers, which means it will take a copy of the feed at every request. Even though it does support compressed responses (non fetch of the feed is allowed without compressed responses), NewsBlur is requesting the same URL more than twice an hour, because it seems to have two “sites” described by the same URL. At 50KiB per request, that makes up about 1% of the total bandwidth usage of the blog. To be fair, this is not bad at all, but one has to wonder why they can’t be saving the last modified or etag values — I guess I could install my own instance of NewsBlur and figure out how to do that myself, but who knows when I would find the time for that.

Update (2017-08-16): Turns out that, as Samuel pointed out in the comments and on Twitter, I wrote something untrue. NewsBlur does send the headers, and supports this correctly. The problem is an Apache bug that causes 304 never to be issued when using If-None-Match and mod_deflate.

To be fair, even rawdog, which I use for Planet Multimedia, does not appear to support these properly. Oh and speaking of Planet Multimedia, would someone be interested in providing a more modern template so that Monty’s pictures don’t take over the page, that would be awesome!

There actually are a few other readers that do support these values correctly, and indeed receive 304 (Not Modified) status code most of the time. These include Lighting (somebody appears to be still using it!) and at least yet-another-reader-service Willreadit.com — this latter appears to be in beta and being invite only; it’s probably the best HTTP implementation I’ve seen for a service with such a rough website. Indeed the bot landing page points out how it supports If-Modified-Since and gzip-compressed responses. Alas it does not appear to learn from persistent redirects though, so it’s currently fetching my blog’s feed twice, probably because there are at least two subscribers for it.

Also note that supporting If-Modified-Since is a prerequisite for supporting delta feeds which is an interesting way to save even more bandwidth (although I don’t think this is feasible to do with a static website at all).

At the very least it looks like we won the battle for supporting compressed responses. The only 406 (Not Acceptable) responses for the feed URL are for Fever, which is no longer developed or supported. Even Gwene, which I pointed out was hammering my blog last time I wrote about this, is now content to get the compressed version. Unfortunately it does not appear like my pull request was ever merged, which means it’s likely the repository itself is completely out of sync with what is being run.

So in 2017, what is the current state of the art feed reader support? NewsBlur has recently added support for JSON Feed which is not particularly exciting – when I read the post I was reminded, by the screenshot of choice there, where I heard of Brent Simmons before: Vesper, which is an interesting connection, but I should not go into that now – but at least shows that Samuel Clay is actually paying attention to the development of the format — even though that development right now appears to just avoiding XML. Which to be honest is not that bad of an idea: since HTML (even HTML5) does not have to be well-formatted XML, you need to provide it as cdata in an XML feed. And the way you do that makes it very easy to implement it incorrectly.

Also, as I wrote this post I realized what else I would like from NewsBlur: the ability to subscribe to an OPML feed as a folder. I still subscribe to lots of Planets, even though they seem to have lost their charm, but a few people are aggregated in multiple planets and it would make sense to be able to avoid duplicate posts. If I could tell NewsBlur «I want to subscribe to this Planet, aggregate it into a folder», it would be able to tell the duplicated feeds, and mark the posts as read on all of them at the same time. Note that what I’d like is something different from just importing an OPML description of the planet! I would like for the folder to be kept in sync with the OPML feed, so that if new feeds are added, they also get added to the folder, and same for removed feeds. I should probably file that on GetSatisfaction at some point.

More HTTP misbehaviours

Today I have been having some fun: while looking at the backlog on IRCCloud, I found out that it auto-linked Makefile.am which I prompty decided to register it with Gandi — unfortunately I couldn’t get Makefile.in or configure.ac as they are both already registered. After that I decided to set up Google Analytics to report how many referrer arrive to my websites through some of the many vanity domains I registered over time.

After doing that, I spent some time staring at the web server logs to make sure that everything was okay, and I found out some more interesting things: it looks like a lot of people have been fetching my blog Atom feed through very bad feed readers. This is the reification of my forecast last year when Google Reader got shut down.

Some of the fetchers are open source, so I ended up opening issues for them, but that is not the case for all of them. And even when they are open source, sometimes they don’t even accept pull requests implementing the feature, for whichever reason.

So this post is a bit of a name-and-shame, which can be positive for open-source projects when they can fix things, or negative for closed source services that are trying to replace Google Reader and failing to implement HTTP properly. It will also serve as a warning for my readers from those services, as they’ll stop being able to fetch my feed pretty soon, as I’ll update my ModSecurity rules to stop these people from fetching my blog.

As I noted above, both Stringer and Feedbin fail to properly use compressed responses (gzip compression), which means that they fetch over 90KiB every turn instead of just 25KiB. The Stringer devs already reacted and seem to be looking into fixing this very soon now. Feedbin I have no answer from yet (but it’s pretty soon anyway), but it worries me for another reason too: it does not do any caching at all. And somebody set up a Feedbin instance in the Prague University that fetches my feed, without compression, without caching, every two minutes. I’m going to soon blacklist it.

Gwene still has not replied to the pull request I sent in October 2012, but on the bright side, it has not fetched my blog since a long time ago. Feedzirra (now Feedjira) used by IFTTT still does not enable compressed responses by default, even though it seems to support the option (Stringer is also based on it, it seems).

It’s not just plain feed readers that fail at implementing HTTP. Distributed social network Friendica – that aims at doing a better job than Diaspora at that – seem also to forget about implementing either compressed responses or caching. At least it seems to only fetch my feed every twelve hours. On the other hand, it seems to also get someone’s timeline from Twitter, so when it encounters a link to my blog it first send a HEAD request, and then fetches the page. Three times. Also uncompressed.

On the side of non-open-source services, FeedWrangler has probably one of the worst implementations of HTTP I’ve ever seen: it does not support compressed responses (90KiB feed), does not do caching (every time!), and while it would fetch at one hour intervals, it does not understand that a 301 is a permanent redirection, and there’s no point in keeping around two feed IDs for /articles.rss and /articles.atom (each with one subscriber). That’s 4MiB a day, which is around 2% of the bandwidth my website serves, over a day. While this is not an important amount, and I don’t have limitation on the server’s egress, it seems silly that 2% of my bandwidth is consumed on two subscribers, when the site has over a thousand visitors a day.

But what takes the biscuit is definitely FeedMyInbox: while it fetches only every six hours, it implements neither caching nor compression. And I found it only when looking into the requests coming from bots without a User-Agent header. The requests come from 216.198.247.46 which is svr.feedmyinbox.com. I’m soon also going to blacklist this until they stop being douches and provide a valid user agent string.

They are by far not the only ones though; there is another bot that fetches my feed every three hours that will soon follow the same destiny. But this does not have an obvious service attached to it, so if whatever you’re using to read my blog tells you it can’t fetch my blog anymore, try to figure out if you’re using a douchereader.

Please remember that software on the net should be implemented for collaboration between client and server, not for exploitation. Everybody’s bandwidth gets worse when you heavily use a service that is not doing its job at optimizing bandwidth usage.

WebP and the effect on my antispam

When I was posting my notes about WebP I found out at the time of posting that I could not post on my blog any more. The reason was to be found in my own ModSecurity rules as quite a long time ago I added to the antispam rules one that stops POST requests if they included image/webp in the Accept header.

Unfortunately, for whatever reason, instead of just adding image/webp to all the image requests, they added it to every single request that Chrome makes, including the POST requests when submitting a form… It does not entirely sound correct, to be honest, but there probably was a reason for that.

So I dropped the WebP check from my rules. And today I check my comments, and I found four spam elements. Turns out that the particular check was very effective, and it’s going to be a pain to leave it be. On the other hand, it seems like it’s accepting image/x-bitmap and coming from Firefox, two conditions that I expect are never met by real-life browsers, so I can probably look into adding a rule for that.

Another interesting rule I added recently and that I did not discuss yet is related to the fact that this blog is now only available over HTTPS. Most of the spam comments I receive are posted directly over HTTPS, but they report as referrer the original post’s URL over plain HTTP. Filter these out, and most of my spam is gone.

Long live ModSecurity — the problem is going to be when HTTP2 will be out, as it’s binary and leaves much less space to request fingerprinting.

The WebP experiment

You might have noticed over the last few days that my blog underwent some surgery, and in particular that some even now, on some browsers, the home page does not really look all that well. In particular, I’ve removed all but one of the background images and replaced them with CSS3 linear gradients. Users browsing the site with the latest version of Chrome, or with Firefox, will have no problem and will see a “shinier” and faster website, others will see something “flatter”, I’m debating whether I want to provide them with a better-looking fallback or not; for now, not.

But this was also a plan B — the original plan I had in mind was to leverage HTTP content negotiation to provide WebP variants of the images of the website. This was a win-win situation because, ludicrous as it was when WebP was announced, it turns out that with its dual-mode, lossy and lossless, it can in one case or the other outperform both PNG and JPEG without a substantial loss of quality. In particular, lossless behaves like a charm with “art” images, such as the CC logos, or my diagrams, while lossy works great for logos, like the Autotools Mythbuster one you see on the sidebar, or the (previous) gradient images you’d see on backgrounds.

So my obvious instinct was to set up content negotiation — I’ve used it before for multiple-language websites, I expected it to work for multiple times as well, as it’s designed to… but after setting all up, it turns out that most modern web browsers still do not support WebP *at all*… and they don’t handle content negotiation as intended. For this to work we need either of two options.

The first, best option, would be for browsers only Accept the image formats they support, or at least prefer them — this is what Opera for Android does: Accept: text/html, application/xml;q=0.9, application/xhtml+xml, multipart/mixed, image/png, image/webp, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1 but that seems to be the only browser doing it properly. In particular, in this listing you’ll see that it supports PNG, WebP, JPEG, GIF and bimap — and then it accepts whatever else with a lower reference. If WebP was not in the list, even if it had an higher preference for the server, it would not be sent to the client. Unfortunately, this is not going to work, as most browsers send Accept: */* without explicitly providing the list of supported image formats. This includes Safari, Chrome, and MSIE.

Point of interest: Firefox does explicit one image format before others: PNG.

The other alternative is for the server to default to the “classic” image formats (PNG, JPEG, GIF) and then expect the browsers supporting WebP prioritizing it over the other image formats. Again this is not the case; as shown above, Opera lists it but does not prioritize, and again, Firefox prioritizes PNG over anything else, and makes no special exception for WebP.

Issues are open at Chrome and Mozilla to improve the support but they haven’t reached mainstream yet. Google’s own suggested solution is to use mod_pagespeed instead — but this module – which I already named in passing in my post about unfriendly projects – is doing something else. It’s on-the-fly changing the content that is provided, based on the reported User-Agent.

Given that I’ve spent some time on user agents, I would say I have the experiences to say that this is a huge pandora’s vase. If I have trouble with some low-development browsers reporting themselves as Chrome to fake their way in with sites that check the user agent field in JavaScript, you can guess how many of those are going to actually support the features that PageSpeed thinks they support.

I’m going to go back to PageSpeed in another post, for now I’ll stop to say that WebP has the numbers to become the next generation format out there, but unless browser developers, as well as web app developers start to get their act straight, we’re going to have hacks over hacks over hacks for the years to come… Currently, my blog is using a CSS3 feature with the standardized syntax — not all browsers understand it, and they’ll see a flat website without gradients; I don’t care and I won’t start adding workarounds for that just because (although I might use SCSS which will fix it for Safari)… new browsers will fix the problem, so just upgrade, or use a sane browser.

User-Agent strings and entropy

It was 2008 when I first got the idea to filter User-Agents as an antispam measure. It worked for quite a while on its own, but recently my ruleset had to come up with more sophisticated fingerprinting to discover spammers. It still works better than a captcha, but it did worsen a bit.

One of the reasons why the User-Agent itself is not enough anymore is that my filtering has been hindered by a more important project. EFF’s Panopticlick has shown that the uniqueness of the strings used in User-Agent is actually an easy way to track a specific user across requests. This got so important, that Mozilla standardized their User-Agents starting with Firefox 4, to reduce their size and thus their entropy. Among other things, the “trail” component has been fixed on the desktop to 20100101 and to the same version as Firefox for the mobile version.

_Unfortunately, Mozilla lies on that page. Not only the trail is not fixed for Firefox Aurora (i.e. the alpha version), which means that my first set of rules was refusing access to all the users of that version, but also their own Lightning extension for SeaMonkey appends to the User-Agent, when they said that it wasn’t supported anymore._

A number of spambots seem to get this wrong, by the way. My guess is that they have some code that generates the User-Agent by adding a bunch of fragments, and make it randomize it, so you can’t just kick a particular agent. Damn smart if you ask me, unfortunately, as ModSecurity hashes the IP collection by remote address and user-agent, so if they cycle different user agents, it’s harder for ModSecurity to understand that it’s actually the same IP address.

I do have some reserves on Mozilla’s handling of identification of extensions. First they say that extensions and plugins should not edit the agent string anymore – but Lightning does! – then they suggest that instead they can send an extra header to identify themselves. But that just means that fingerprinting systems only need to start counting those headers as well as the generic ones that Panopticlick already considers.

On the other hand, other browsers don’t seem to have gotten the memo yet — indeed, both Safari’s and Chrome’s strings are long and include a bunch of almost-independent version numbers (AppleWebKit, Chrome, Safari — and Mobile on the iOS versions). It gets worse on Android, as both the standard browser and Chrome provide a full build identifier, which is not only different from one device to the next, but also from one firmware to the next. Given that each mobile provider has its own builds, I would be very surprised if among my friends I was able to find two with the same identifier in their browsers. Firefox is a bit better on that but it sucks in other ways so I’m not using it as my main browser anymore there.

Why you should care about your HTTP implementation

So today’s frenzy is all about Google’s dismissal of the Reader service. While I’m also upset about that, I’m afraid I cannot really get into discussing that at this point. On the other hand, I can talk once again of my ModSecurity ruleset and in particular of the rules that validate HTTP robots all over the Internet.

One of the Google Reader alternatives that are being talked about is NewsBlur — which actually looks cool at first sight, but I (and most other people) don’t seem to be able to try it out yet because their service – I’m not going to call them servers as it seems they at least partially use AWS for hosting – fails to scale.

While I’m pretty sure that it’s an exceptional amount of load they are receiving now as everybody and their droid are trying to register to the service and import their whole Google Reader subscription list, which then needs to be fetched and added to the database, – subscriptions to my blog’s feed went from 5 to 23 in the matter of hours! – there are a few things that I can infer from the way it behaves that makes me think that somebody overlooked the need for a strong HTTP implementation.

First of all what happened was that I got a report on twitter that NewsBlur was getting a 403 fetching my blog, and that was obviously caused by my rules’ validation of the request. Looking at my logs, I found out that NewsBlur sends requests with three different User-Agents, which show a likeliness that they are implemented by three different codepaths altogether:

User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

The third is the most conspicuous string, because it’s very minimal and does not follow your average string format, using the dash as separator instead of adding the URL in parenthesis next to the fetcher name (and version, more on that later).

The other two strings show that they have been taken by the string reported by Safari on OSX — but interestingly enough from two different Safari version, and one of the two has been actually stripped as well. This is really silly. While I can understand that they might want to look like Safari when fetching a page to display – mostly because there are bad hacks like PageSpeed that serve different HTML to different browsers, messing up caching – I doubt that is warranted for feeds; and even getting the Safari HTML might be a bad idea if then it’s displayed by the user with a different browser.

The code that fetches feeds and pages is likely quite different as it can be seen by the full request. From the feed fetcher:

GET /articles.atom HTTP/1.1
A-Im: feed
Accept-Encoding: gzip, deflate
Connection: close
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
Host: blog.flameeyes.eu
If-Modified-Since: Tue, 01 Nov 2011 23:36:35 GMT
If-None-Match: "a00c0-18de5-4d10f58aa91b5"

This is a very sophisticated fetching code, as it not only properly supports compressed responses (Accept-Encoding header) but it also uses the If-None-Match and If-Modified-Since headers to not re-fetch an unmodified content. The fact that it’s pointing to November 1st of two years ago is likely due to the fact that since then my ModSecurity ruleset refused to speak with this fetcher, because of the fake User-Agent string. It also includes a proper Accept header that lists the feed types they prefer over the generic XML and other formats.

The A-Im header is not a fake or a bug; it’s actually part of RFC3229 Delta encoding in HTTP and stands for Accept-Instance-Manipulation. I’ve never seen that before, but a quick search turned it out, even though the standardized spelling would be A-IM. Unfortunately, the aforementioned RFC does not define the “feed” manipulator, even though it seems to be used in the wild, and I couldn’t find a proper formal documentation of how it should work. The theory from what I can tell is that the blog engine would be able to use the If-Modified-Since header to produce on the spot a custom feed for the fetcher, that only includes entries that has been modified since that date. Cool idea, too bad it lacks a standard as I said.

The request coming in from the page fetcher is drastically different:

GET / HTTP/1.1
Host: blog.flameeyes.eu
Connection: close
Content-Length: 0
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)

So we can tell two things from the comparison: this code is older (there is an earlier version of Safari being used), and not the same care has been spent as it has been on the feed fetcher (which dropped the Safari identifier itself, at least). It’s more than likely that if libraries are used to send the request, a completely different library is used here, as this request declares support for the compress encoding, not supported by the feed fetcher (and as far as I can tell, never ever used). It also is much less choosy on the formats to receive, as it accepts whatever you want to give it.

*For the Italian readers: yes I intentionally picked the word choosy. While I can find Fornero an idiot as much as the next guy, I grew tired of copy-paste statuses on Facebook and comments that she should have said picky. Know your English, instead of complaining on idiocies.*

The lack of If-Modified-Since here does not really mean much, because it’s also possible that they were never able to fetch the page, as they might have introduced the feature later (even though the code is likely older). But the Content-Length header sticks out like a sore thumb, and I would expect to have been put there by whatever HTTP access library they’re using.

The favicon fetcher is the one that is the most naïve and possibly the code that needs to be cleaned up the most:

GET /favicon.ico HTTP/1.1
Accept-Encoding: identity
Host: blog.flameeyes.eu
Connection: close
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

Here we start with nigh protocol violations, by not providing an Accept header — especially facepalming considering that this is where a static list of mime types would be the most useful, to restrict the image formats that will be handled properly! But what happens with my rules is that the Accept-Encoding there is not suitable for a bot at all! Since it does not support any compressed response, the code will now respond with a 406 Not Acceptable status code, instead of providing the icon.

I can understand that a compressed icon is more than likely to not be useful — indeed most images should not be compressed at all to be sent over HTTP, but why should you explicitly refuse it? Especially since the other two fetches properly support a sophisticated HTTP?

All in all, it seems like some of the code in NewsBlur has been bolted on after the fact, and with different levels of care. It might not be the best of time for them now to look at the HTTP implementation, but I would still suggest for it. A single pipelined request of the three components they need – instead of using Connection: close – could easily reduce the number of connections to blogs, and that would be very welcome to all the bloggers out there. And using the same HTTP code would make it easier for people like me to handle NewsBlur properly.

I would also like to have a way to validate that a given request comes from NewsBlur — like we do with GoogleBot and other crawlers. Unfortunately this is not really possible, because they use multiple servers, both on standard hostings and AWS, both on IPv4 and (possibly, one time) IPv6, so using FcRDNS is not an option.

Oh well, let’s see how this thing pans out.

Passive web logs analysis. Replacing AWStats?

You probably don’t know that, but for my blog I do analyse the Apache logs with AWStats over and over again. This is especially useful at the start of the month to identify referrer spam and other similar issues, which in turn allows me to update my ModSecurity ruleset so that more spammers are caught and dealt with.

To do so, I’ve been using for, at this point, years, AWStats which is a Perl analyzer, generator and CGI application. It used to work nicely, but nowadays it’s definitely lacking. It doesn’t filter referrers search engines as much as it used to be (it’s still important to filter out requests coming from Google, but newer search engines are not recognized), and most of the new “social bookmark” websites are not there at all — yes it’s possible to keep adding to them, but with upstream not moving, this is getting harder and harder.

Even more important, for my ruleset work, is the lack of identification of modern browsers. Things like Android versions and other fringe OSes would be extremely useful for me, but adding support for all of them is a pain and I have enough things on my plates that this is not something I’m looking forward to tackle myself. It’s even more bothersome when you consider that there is no way to reconsider the already analyzed data, if a new URL is identified as a search engine, or an user agent a bot.

One of the most obvious choices for this kind of work is to use Google Analytics — unfortunately, this means that it will only work if it’s not blacklisted from the user side — that includes NoScript users and of course most of the spammers. So this is not a job for them. It’s something that has to be done on the backend, on the logs side.

The obvious point at that point is to find something capable to import the data out of the current awstats datafiles I got, and keep importing data from the Apache log files. Hopefully this should be done by saving the data in a PostgreSQL database (which is what I usually use); native support for vhost data, but the ability to collapse it in a single view would also be nice.

If somebody knows of a similar piece of software, I’d love to give it a try — hopefully, something that is written in Ruby or Perl might be the best for me (because I can hack on those) but I wouldn’t say no to Python or even Java (the latter if somebody helped me making sure the dependencies are all packed up properly). This will bring you better modsec rules, I’m sure!