LOLprivacy, or Misunderstanding Panopticlick for the Worst

So Sebastian posted recently about Panopticlick, but I’m afraid he has not grasped just how many subtleties are present when dealing with tracking by User-Agent and with the limitations of the tool as it is.

First of all, let’s take a moment to realize what «Your browser fingerprint appears to be unique among the 5,207,918 tested so far.» (emphasis mine) means. If I try the exact same request as Incognito, the message is «Within our dataset of several million visitors, only one in 2,603,994 browsers have the same fingerprint as yours.» (emphasis mine). I’m not sure why EFF does not expose the numbers in the second situation, hiding the five millions under the word “several”. I can’t tell how they identify further requests on the same browser not to be a new hit altogether. So I’m not sure what the number represents.

Understanding what the number represents is a major problem, too: if you count that even just in his post Sebastian tried at least three browsers; I tried twice just to write this post — so one thing that the number does not count is unique users. I would venture a guess that the number of users is well below the million, and that does count into play for multiple factors. Because Panopticlick was born in 2010, and if less than a million real users hit it, in five years, it might not be that statistically relevant.

Indeed, according to the current reading, just the Accept headers would be enough to boil me down to one in four sessions — that would be encoding and language. I doubt that is so clear-cut, as I’m most definitely not one of four people in the UKIE area speaking Italian. A lot of this has to do with the self-selection of “privacy conscious” people who use this tool from EFF.

But what worries me is the reaction from Sebastian and, even more so, the first comment on his post. Suggesting that you can hide in the crowd by looking for a “more popular” User-Agent or by using a random bunch of extensions and disabling JavaScript or blocking certain domains is naïve to say the least, but most likely missing and misunderstanding the point that Panopticlick tries to make.

The whole idea of browser fingerprinting is the ability to identify an user across a set of sessions — it responds to a similar threat model as Tor. While I already pointed out I disagree on the threat model, I would like to point out again that the kind of “surveillance” that this counters is ideally the one that is executed by an external entity able to monitor your communications from different source connections — if you don’t use Tor and you only use a desktop PC from the same connection, then it doesn’t really matter: you can just check for the IP address! And if you use different devices, then it also does not really matter, because you’re now using different profiles anyway; the power is in the correlation.

In particular, when trying to tweak User-Agent or other headers to make them “more common”, you’re now dealing with something that is more likely to backfire than not; as my ModSecurity Ruleset shows you very well, it’s not so difficult to tell apart a real Chrome request by Firefox masquerading as Chrome, or IE masquerading as Safari, they have different Accept-Encoding, and other differences in style of request headers, making it quite straightforward to check for them. And while you could mix up the Accept headers enough to “look the part” it’s more than likely that you’ll be served bad data (e.g. sdch to IE, or webp to Firefox) and that would make your browsing useless.

More importantly, the then-unique combination of, say, a Chrome User-Agent for an obviously IE-generated request would make it very obvious to follow a session aggregated across different websites with a similar fingerprint. The answer I got by Sebastian is not good either: even if you tried to use a “more common” version string, you could still, very easily, create unwanted unique fingerprints; take Firefox 37: it started supporting the alt-svc extension to use HTTP2 when available, if you were to report your browser as Firefox 28 and then it followed alt-svc, then it would clearly be a fake version string, and again an easy one to follow. Similar version-dependent request fingerprinting, paired with a modified User-Agent string would make you light up as a Christmas tree during Earth Day.

There are more problems though; the suggestion of installing extensions such as AdBlock also adds to the fingerprinting rather than block from it; as long as JavaScript is allowed to run, it can detect AdBlock presence, and with a bit of work you can identify presence of one out of the set of different blocking lists, too. You could use NoScript to avoid running JavaScript at all, but given this is by far not something most users will do, it’ll also add up to the entropy of a fingerprint for your browser, not remove from it, even if it prevents client-side fingerprinting to access things like the list of available plugins (which in my case is not that common, either!)

But even ignoring the fact that Panopticlick does not try to identify the set of installed extensions (finding Chrome’s Readability is trivial, as it injects content into the DOM, and so do a lot more), there is one more aspect that it almost entirely ignores: server-side fingerprinting. Beside not trying to correlate the purported User-Agent against the request fingerprint, it does not seem to use a custom server at all, so it does not leverage TLS handshake fingerprints! As can be seen through Qualys analysis, there are some almost-unique handshake sequences on a given server depending on the client used; while this does not add up much more data when matched against a vanilla User-Agent, a faked User-Agent and a somewhat more rare TLS handshake would be just as easy to track.

Finally, there is the problem with self-selection: Sebastian has blogged about this while using Firefox 37.0.1 which was just released, and testing with that; I assume he also had the latest Chrome. While Mozilla increased the rate of release of Firefox, Chrome has definitely a very hectic one with many people updating all the time. Most people wouldn’t go to Panopticlick every time they update their browser, so two entries that are exactly the same apart from the User-Agent version would be reported as unique… even though it’s most likely that the person who tried two months ago updated since, and now has the same fingerprint as the person who tried recently with the same browser and settings.

Now this is a double-edged sword: if you rely on the User-Agent to track someone across connections, a ephemeral User-Agent that changes every other day due to updates is going to disrupt your plans quickly; on the other hand lagging behind or jumping ahead on the update train for a browser would make it more likely for you to have a quite unique version number, even more so if you’re tracking beta or developer channels.

Interestingly, though, Mozilla has thought about this before, and their Gecko user agent string reference shows which restricted fields are used, and references the bugs that disallowed extensions and various software to inject into the User-Agent string — funnily enough I know of quite a few badware cases in which a unique identifier was injected into the User-Agent for fake ads and other similar websites to recognize a “referral”.

Indeed, especially on Mobile, I think that User-Agents are a bit too liberal with the information they push; not only they include the full build number of the mobile browser such as Chrome, but they usually include the model of the device and the build number of the operating system: do you want to figure out if a new build of Android is available for some random device out there? Make sure you have access to HTTP logs for big enough websites and look for new build IDs. I think that in this particular sub-topic, Chrome and Safari could help a lot more by reducing the amount of details of the engine version as well as the underlying operating system.

So, for my parting words, I would like to point out that Panopticlick is a nice proof-of-concept that shows how powerful browser fingerprinting is, without having to rely on tracking cookies. I think lots of people both underestimate the power of fingerprinting and overestimate the threat. From one side, because Panopticlick does not have enough current data to make it feasible to evaluate the current uniqueness of a session across the world; from the other, because you get the wrong impression that if Panopticlick can’t put you down as unique, you’re safe — you’re not, there are many more techniques that Panopticlick does not think of trying!

My personal advice is to stop worrying about the NSA and instead start safekeeping yourself: using click-to-play for Flash and Java is good prophylaxis for security, not just privacy, and NoScript can be useful too, in some cases, but don’t just kill everything on sight. Even using the Data Saver extension for non-HTTPS websites can help (unfortunately I know of more than a few blocking it, and then there is the problem with captive portals bringing it to be clear-text HTTP too).

Browser fingerprinting

I’ve posted some notes about browser fingerprinting back in March, and noted how easy it is to identify a given user across requests just by the few passive scans that are possible without even having to have Flash enabled. Indeed, EFF’s Panopticlick considers my browser unique even with Flash disabled.

But even if Panopticlick is only counting it among the people who actually ran it, which means it’s just a percentage of all the possible users out there, it is also not exercising the full force of fingerprinting. In particular it does not try to detect the installed Chrome extensions, which is actually trivial to do in JavaScript for some of these extensions. In particular in my case I can easily identify the presence of the Readabily extension because it injects an “indicator” as an iframe with a fixed ID. Similarly it’s relatively easy to identify adblock users, as you probably have noticed in a bunch of different sites already that beg you to disable the adblocker so that they can make some money with the ads.

Given how paranoid some of my readers are, I’m looking forward for somebody to add Chrome and Firefox extensions identification to Panopticlick, it’ll be definitely interesting going forward.

User-Agent strings and entropy

It was 2008 when I first got the idea to filter User-Agents as an antispam measure. It worked for quite a while on its own, but recently my ruleset had to come up with more sophisticated fingerprinting to discover spammers. It still works better than a captcha, but it did worsen a bit.

One of the reasons why the User-Agent itself is not enough anymore is that my filtering has been hindered by a more important project. EFF’s Panopticlick has shown that the uniqueness of the strings used in User-Agent is actually an easy way to track a specific user across requests. This got so important, that Mozilla standardized their User-Agents starting with Firefox 4, to reduce their size and thus their entropy. Among other things, the “trail” component has been fixed on the desktop to 20100101 and to the same version as Firefox for the mobile version.

_Unfortunately, Mozilla lies on that page. Not only the trail is not fixed for Firefox Aurora (i.e. the alpha version), which means that my first set of rules was refusing access to all the users of that version, but also their own Lightning extension for SeaMonkey appends to the User-Agent, when they said that it wasn’t supported anymore._

A number of spambots seem to get this wrong, by the way. My guess is that they have some code that generates the User-Agent by adding a bunch of fragments, and make it randomize it, so you can’t just kick a particular agent. Damn smart if you ask me, unfortunately, as ModSecurity hashes the IP collection by remote address and user-agent, so if they cycle different user agents, it’s harder for ModSecurity to understand that it’s actually the same IP address.

I do have some reserves on Mozilla’s handling of identification of extensions. First they say that extensions and plugins should not edit the agent string anymore – but Lightning does! – then they suggest that instead they can send an extra header to identify themselves. But that just means that fingerprinting systems only need to start counting those headers as well as the generic ones that Panopticlick already considers.

On the other hand, other browsers don’t seem to have gotten the memo yet — indeed, both Safari’s and Chrome’s strings are long and include a bunch of almost-independent version numbers (AppleWebKit, Chrome, Safari — and Mobile on the iOS versions). It gets worse on Android, as both the standard browser and Chrome provide a full build identifier, which is not only different from one device to the next, but also from one firmware to the next. Given that each mobile provider has its own builds, I would be very surprised if among my friends I was able to find two with the same identifier in their browsers. Firefox is a bit better on that but it sucks in other ways so I’m not using it as my main browser anymore there.

Why you should care about your HTTP implementation

So today’s frenzy is all about Google’s dismissal of the Reader service. While I’m also upset about that, I’m afraid I cannot really get into discussing that at this point. On the other hand, I can talk once again of my ModSecurity ruleset and in particular of the rules that validate HTTP robots all over the Internet.

One of the Google Reader alternatives that are being talked about is NewsBlur — which actually looks cool at first sight, but I (and most other people) don’t seem to be able to try it out yet because their service – I’m not going to call them servers as it seems they at least partially use AWS for hosting – fails to scale.

While I’m pretty sure that it’s an exceptional amount of load they are receiving now as everybody and their droid are trying to register to the service and import their whole Google Reader subscription list, which then needs to be fetched and added to the database, – subscriptions to my blog’s feed went from 5 to 23 in the matter of hours! – there are a few things that I can infer from the way it behaves that makes me think that somebody overlooked the need for a strong HTTP implementation.

First of all what happened was that I got a report on twitter that NewsBlur was getting a 403 fetching my blog, and that was obviously caused by my rules’ validation of the request. Looking at my logs, I found out that NewsBlur sends requests with three different User-Agents, which show a likeliness that they are implemented by three different codepaths altogether:

User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

The third is the most conspicuous string, because it’s very minimal and does not follow your average string format, using the dash as separator instead of adding the URL in parenthesis next to the fetcher name (and version, more on that later).

The other two strings show that they have been taken by the string reported by Safari on OSX — but interestingly enough from two different Safari version, and one of the two has been actually stripped as well. This is really silly. While I can understand that they might want to look like Safari when fetching a page to display – mostly because there are bad hacks like PageSpeed that serve different HTML to different browsers, messing up caching – I doubt that is warranted for feeds; and even getting the Safari HTML might be a bad idea if then it’s displayed by the user with a different browser.

The code that fetches feeds and pages is likely quite different as it can be seen by the full request. From the feed fetcher:

GET /articles.atom HTTP/1.1
A-Im: feed
Accept-Encoding: gzip, deflate
Connection: close
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: NewsBlur Feed Fetcher - 5 subscribers - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.2.3 (KHTML, like Gecko) Version/5.2)
Host: blog.flameeyes.eu
If-Modified-Since: Tue, 01 Nov 2011 23:36:35 GMT
If-None-Match: "a00c0-18de5-4d10f58aa91b5"

This is a very sophisticated fetching code, as it not only properly supports compressed responses (Accept-Encoding header) but it also uses the If-None-Match and If-Modified-Since headers to not re-fetch an unmodified content. The fact that it’s pointing to November 1st of two years ago is likely due to the fact that since then my ModSecurity ruleset refused to speak with this fetcher, because of the fake User-Agent string. It also includes a proper Accept header that lists the feed types they prefer over the generic XML and other formats.

The A-Im header is not a fake or a bug; it’s actually part of RFC3229 Delta encoding in HTTP and stands for Accept-Instance-Manipulation. I’ve never seen that before, but a quick search turned it out, even though the standardized spelling would be A-IM. Unfortunately, the aforementioned RFC does not define the “feed” manipulator, even though it seems to be used in the wild, and I couldn’t find a proper formal documentation of how it should work. The theory from what I can tell is that the blog engine would be able to use the If-Modified-Since header to produce on the spot a custom feed for the fetcher, that only includes entries that has been modified since that date. Cool idea, too bad it lacks a standard as I said.

The request coming in from the page fetcher is drastically different:

GET / HTTP/1.1
Host: blog.flameeyes.eu
Connection: close
Content-Length: 0
Accept-Encoding: gzip, deflate, compress
Accept: */*
User-Agent: NewsBlur Page Fetcher (5 subscribers) - http://www.newsblur.com (Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko) Version/5.1 Safari/534.48.3)

So we can tell two things from the comparison: this code is older (there is an earlier version of Safari being used), and not the same care has been spent as it has been on the feed fetcher (which dropped the Safari identifier itself, at least). It’s more than likely that if libraries are used to send the request, a completely different library is used here, as this request declares support for the compress encoding, not supported by the feed fetcher (and as far as I can tell, never ever used). It also is much less choosy on the formats to receive, as it accepts whatever you want to give it.

*For the Italian readers: yes I intentionally picked the word choosy. While I can find Fornero an idiot as much as the next guy, I grew tired of copy-paste statuses on Facebook and comments that she should have said picky. Know your English, instead of complaining on idiocies.*

The lack of If-Modified-Since here does not really mean much, because it’s also possible that they were never able to fetch the page, as they might have introduced the feature later (even though the code is likely older). But the Content-Length header sticks out like a sore thumb, and I would expect to have been put there by whatever HTTP access library they’re using.

The favicon fetcher is the one that is the most naïve and possibly the code that needs to be cleaned up the most:

GET /favicon.ico HTTP/1.1
Accept-Encoding: identity
Host: blog.flameeyes.eu
Connection: close
User-Agent: NewsBlur Favicon Fetcher - http://www.newsblur.com

Here we start with nigh protocol violations, by not providing an Accept header — especially facepalming considering that this is where a static list of mime types would be the most useful, to restrict the image formats that will be handled properly! But what happens with my rules is that the Accept-Encoding there is not suitable for a bot at all! Since it does not support any compressed response, the code will now respond with a 406 Not Acceptable status code, instead of providing the icon.

I can understand that a compressed icon is more than likely to not be useful — indeed most images should not be compressed at all to be sent over HTTP, but why should you explicitly refuse it? Especially since the other two fetches properly support a sophisticated HTTP?

All in all, it seems like some of the code in NewsBlur has been bolted on after the fact, and with different levels of care. It might not be the best of time for them now to look at the HTTP implementation, but I would still suggest for it. A single pipelined request of the three components they need – instead of using Connection: close – could easily reduce the number of connections to blogs, and that would be very welcome to all the bloggers out there. And using the same HTTP code would make it easier for people like me to handle NewsBlur properly.

I would also like to have a way to validate that a given request comes from NewsBlur — like we do with GoogleBot and other crawlers. Unfortunately this is not really possible, because they use multiple servers, both on standard hostings and AWS, both on IPv4 and (possibly, one time) IPv6, so using FcRDNS is not an option.

Oh well, let’s see how this thing pans out.

The importance of HTTP request fingerprinting

I started looking at ModSecurity when I wanted to implement a Uesr-Agent based antispam method which has proven time and time again working quite well to the point I started publishing the ruleset which takes care not only of working as an antispam method, as well as a way to avoid tons of bad crawlers from finding my email addresses and so on.

When I first proposed this kind of filtering I received quite a few complains, that the HTTP protocol didn’t define the User-Agent in such a way, but thanks first to EFF’s Panopticlick – demonstrating clearly that the “anonymised” requests are not as anonymous as their perpetrators would expect them to be – and most recently SpiderLabs’s work I am now fully certain that I took the right road.

I’ve spent a bit more work on the rules this week, to make them further resilient to fake the requests such as those coming from scriptkiddies’ tools such as the HOIC tool described in the SpiderLabs’s blog post linked above. One of the most interesting detection I came up with is for real Chrome requests: while it seems to me like Google itself does not leverage it, Chrome as of version 18 is still implementing their own proposed Shared Dictionary Compression for HTTP even though I don’t think it’ll ever be used in the real world. Being the only browser actually requesting such an encoding, I can easily assume a connection between the two — this was only disattended by Epiphany, which in its most recent versions declares to be Chrome… which means you then have a browser claiming to be another (Chrome), which in turn claims to be a third (Safari), which uses an engine (KHTML) claiming to be the same as another (Gecko), all the while declaring it’s all compatible with Mozilla/5.0.

One issue I found while doing this work had to do with Android. For both versions 2 and 3 (is somebody really hoping to use Android 1?), the (default, AOSP) browser sends a full-fledged HTTP request, which among other things include an Accept header. This is what every browser I ever tried does, to the point that ModSecurity’s own Core Rule Set assigns negative points to requests coming without one; in my ruleset it’s further tightened by checking whether the request is purportedly from a known browser, and if so rejecting it if it doesn’t include that header; this worked up to now — note that requests coming through a Proxy, making that explicit through a Via header, are not validated against these checks simply because many proxies are known to muck with the headers.

Anyway as I was saying this is disattended badly by Android 4 (up to 4.0.3, and CyanogenMod as well); it might have started as a way to minimise the bandwidth usage, but for whatever reason in this version, the AOSP browser does not send an Accept reader at all — actually it seems like it dropped most of the headers that it was sending before and that are not strictly necessary for the server to process the request. I could have sworn that Accept was mandatory for the HTTP protocol, but it seems that either I was totally mistaken, or it was only noted in some recommendation that never made it to the standard. The ruleset now exonerates Android 4 from that particular test, but I’m not really too happy about it.

But that’s definitely not the only thing that is out of place with Android. Indeed, if you take an HTC Android device, the browser you open is not the AOSP one, but it’s HTC’s own implementation. This version … does not fully declare itself as an Android device, using a browser compatible with Mobile Safari. Instead, what it reports itself as is a complete Safari, and not in the way that Chrome does it, but by pretending it’s Mac OS X 10.6.3 running on an Intel Mac. Honestly, that’s way crazy to do.

There are a few more things that I hope to be able to handle in my ruleset to make it even tighter, without adding substantial false positives. This means not only fewer spam comments, but also fewer crawlers finding our email addresses, and fewer risks associated with Denial of Service attacks, distributed or not.

If you would like to help with the ruleset, you can find it on Flattr where it’s depressingly stopping at only two clicks. If you would like to use the ruleset, you can find it on GitHub and you can use it for free, obviously.

Why I check your user agents

I’m one of the few Free Software activists that actually endorses the use of User-agent header, I’m afraid. The reason for that is that, while in general that header is used to implement various types of policies, it is often used as part of lock-in schemes (sometimes paper-thin lock-ins by the way), and we all agree that lock-ins are never nice. It is a different discussion on whether those lock-ins are something to simply attack, or something to comprehend and accept — I sincerely think that Apple has all the rights to limit the access to their trailers to QuickTime, or at least try to, as they are providing the service, and it’s for them a platform to show their software; on the other hand, BBC and RAI using it to lock-in their public service TV is something nasty!

So basically we have two reasons to use User-agent: policies and statistics. In the former category I also count in the implementation of workarounds of various species. Statistics, are mostly useful to decide on what to focus, policies, can be used for good or evil; lock-ins are generally evil, but you can use policies to improve the quality of the service for users.

One of the most commonly used workarounds applied by using the user agent declarations are related to MSIE missing features; for instance, there is one to handle serving properly the XHTML files through the application/xhtml+xml mime type, which it doesn’t support:

RewriteCond %{REQUEST_URI} ^/[a-z_/]*$
RewriteCond %{HTTP_USER_AGENT} MSIE [OR]
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml+xmls*;s*q=0.?0*(s|,|$)
RewriteRule ^/[a-z_/]*$ - [T=text/html]

Yes this has one further check that most of the copies of the same check have on Internet; the reason is that I have experimentally noticed that Facebook does not handle XHTML properly; indeed if you attach a link to a webpage that has images, and is served as XHTML, it won’t get you the title nor allow you to choose an image to use for the link. This was true at least up to last December, and I assume the same is true now, and thus why I have that extra line.

In a different situation, feng uses the User-agent field to identify bugged software and implement specific workarounds (such as ignoring the RTSP/1.0 standard, and seek on subsequent PLAY requests without PAUSE).

Stepping away from workarounds, policies that can implemented this way include warning about insecure, unsupported browsers, trojan-infected systems, and provide them with an informational message telling the user what to do to get something better/cleaner (I do that for a few websites to tell the users that they are running something very broken — such as Internet Explorer 6). This is policy, it’s generally a good policy in my opinion. *On a different note, if somebody can suggest a way to use cookies to add a static way to bypass the check, I’d be happy.*

There are many more things you can do with agent-specific policies, including providing lower-quality images for smartphones, without implementing mobile-specific website vhosts, but I won’t go into deeper details right now.

For what concerns statistics, they usually provide a way for developers and designer to focus on what’s really being used by the targets of their software. Again, some activists dislike this because it shows that it’s not worth considering non-Firefox, non-IE browsers for most websites, and sometimes not even Firefox, but avoiding these extreme cases, statistics are, in the real working world, very important.

Some people feel like being smarter than the average programmer, and want to throw out of place the statistics by saying that they are using “Commodore 64” or “MS-DOS” as operating system. They pretend to defend their privacy, to camouflage among the bad bad Internet. What they are doing, is actually trying to hide on a plane by wearing a balaclava which you might guess is pretty peculiar. In fact, if you try EFF’s Panopticlick you can see that an unique, “novelty” User-agent is actually making you spark among the Internet users. Which means that if you’re trying to hide through a crowd with the balaclava you’re not smarter than anybody, you’re actually stupider than the average.

Oh and by the way, there is no way your faking being Googlebot will work out good for you; on my webserver for instance, you’ll get 403 responses for all your requests… unless your reverse resolution properly forward-confirms to be coming from the googlebot server farm…

Technology use and abuse

When I started working on my antispam filtering based on user agent strings provided by browsers, I got quite a bit of feedback of people who complained that user agent strings weren’t meant to be used for that, that they should be used just for statistical purposes and other stuff like that. Indeed, reading around free software planets, you find lots of people maintaining this position, that no logic code should be conditional to the user agent string; this usually involves people working on Debian-based systems where Firefox is banned and Iceweasel is the way.

Now, I understand that what I’m doing is borderline valid for the protocol; that discriminating users based on their user agent string is not ethically perfect; but let me say that the thing works pretty nicely; I observe from time to time the comments that gets denied (mod_security can keep them in log) and I haven’t found a single false positive; there are a few false negatives (that is, spam that passes through the mod_security filter and reaches the blog); but luckily the antispam features in Typo itself are good enough at that point. Lately this has happened because a few spambots started declaring themselves some almost credible IE 7 on Windows XP or Vista; while IE8 was released, I’d rather give it a few more months before starting to reject those, too.

These results start making me wonder how much what I’m doing is abuse and how much is it use; there are some questionable reasons behind logic switches happening between Firefox and Iceweasel, but that does not involve me. And at the same time, one would expect that stuff like that is doomed to happening; both Apple and Google seems to have accepted that, and you can see that Safari is still declaring itself KHTML, and Google Chrome declares itself as Safari, too. Sure that most of the code that tries to identify one of the three of them should just hit on WebKit (well, that is, if KDE were to finally decide to go with one engine that is getting support out there), but at the same time they try to be pragmatic and accept that there is code logic based on user agents.

Back to my usage, since publishing the rules on this blog starts to get messy, because of mod_security itself (funny!), I’m probably going to post them on a git or something in the next few days; I’ll be adding also the public service rules that I’ve been using now for a while at least for my friend’s site (and actually found a couple of friends of his that had dialers on their systems and never noticed).

So maybe I’m using it for something that wasn’t designed for, on the other hand, it works and it really does not differ much from using statistical analysis of headers from email messages; and you know that your mail server, or client, or proxy, or whatever is doing something like that with spamassassin!

Yes, again spam filtering

You might remember that I reported success with my filters using User-Agent value as reported by the clients; unfortunately it seems like I was really speaking way too soon. While the amount of spam I had to manually remove from the blog decreased tremendously, which allowed me to disable the 45 days limits on commenting, and also comment moderation, it still didn’t cut it, and caused a few false positives.

The main problem is that the filter on HTTP/1.0 behaviour was hitting almost anybody that tried to comment with a proxied connection: default squid configuration doesn’t use HTTP/1.1 and so downgrades everything to 1.0; thanks to binki and moesasji I was able to track down the issue and now my ruleset (which I’m going to attach at the end of the post) checks for the Via header to identify proxies. Unfortunately, the result is that now I get much more spam; indeed lots and lots of comment spam comes through open proxies, which is far from an uncommon thing.

I guess one option would be to use the SORBS DNSBL blacklists to filter out known open proxies; unfortunately either I misconfigured the dnsbl lookup module for Apache (which I hoped I got already working) or the proxies I’m receiving spam from are not listed there at all. I was also told that mod_security can handle the lookup itself, which is probably good since I can reduce the lookups for the open proxy to the case when a proxy is actually used.

I was also told to look at the rules from Got Root which also list some user agent based filtering; I haven’t done so yet though, because I start to get worried: my rules are already executing a number of regular expression matching on the User-Agent header, and I’m trying to do my best to make sure that the expressions are generic enough but not too broad; on the other hand Got Root’s rules seems to provide a straight match of a series of user agents, which means lots and lots of checks added; the rules also seems to either be absolute (for any requested URL) or just for WordPress-based blogs, which means I’d have to adapt or tinker with them since I’m currently limiting the antispam measures through the use of Apache’s Location block (previously LocationMatch, but the new Typo version uses a single URL for all the comments posting).

What I’d like to see is some kind of Apache module that is able to match an User-Agent against a known list of bad User-Agents, as well as a list of regular expressions, compiled in some kind of bytecode, to be much much faster than the “manual” parsing that is done now. Unfortunately I don’t have neither time nor expertise with Apache to take care of that myself, which means either someone else does it, or I’m going to keep with mod_security for a while longer.

Anyway here’s the beef!

SecDefaultAction "pass,phase:2,t:lowercase"

# Ignore get requests since they cannot post comments.
SecRule REQUEST_METHOD "^get$" "pass,nolog"

# 2009-02-27: Kill comments where there is no User-Agent at all; I
# don't care if people like to be "anonymous" in the net, but the
# whole thing about anonymous browsers is pointless.
SecRule REQUEST_HEADERS:User-Agent "^$" 
    "log,msg:'Empty User-Agent when posting comments.',deny,status:403"

# Since we cannot check for _missing_ user agent we have to check if
# it's present first, and then check whether the variable is not
# set. Yes it is silly but it seems to be the only way to do this with
# mod_security.
SecRule REQUEST_HEADERS_NAMES "^user-agent" 
    "setvar:tx.flameeyes_has_ua=1"
SecRule TX:FLAMEEYES_HAS_UA "!1" 
    "log,msg:'Missing User-Agent header when posting comments.',deny,status:403"

# Check if the comment arrived from a proxy; if that's the case we
# cannot rely on the HTTP version that is provided because it's not
# the one of the actual browser. We can, though, check if it's an open
# proxy blacklist.
SecRule REQUEST_HEADERS_NAMES "^via" 
    "setvar:tx.flameeyes_via_proxy=1,log,msg:'Commenting via proxy'"

# If we're not going through a proxy, and it's not lynx, and yet we
# have an HTTP/1.0 comment request, then it's likely a spambot with a
# fake user agent.
#
# Note the order of the rules is explicitly set this way so that the
# majority of requests from HTTP/1.1 browsers (legit) are ignored
# right away; then all the requests from proxies, then lynx.
SecRule REQUEST_PROTOCOL "!^http/1.1$" 
    "log,msg:'Host has to be used but HTTP/1.0, posting spam comments.',deny,status:403,chain"
SecRule TX:FLAMEEYES_VIA_PROXY "!1" "chain"
SecRule REQUEST_HEADERS:User-Agent "!lynx"


# Ignore very old Mozilla versions (not modern browsers, often never
# exiting) and pre-2 versions of Firefox.
#
# Also ignore comments coming from IE 5 or earlier since we don't care
# about such old browsers. Note that Yahoo feed fetcher reports itself
# as MSIE 5.5 for no good reason, but we don't care since it cannot
# _post_ comments anyway.
#
# 2009-02-27: Very old Gecko versions should not be tollerated, grace
# the period 2007-2009 for now.
#
# 2009-03-01: Ancient Opera versions usually posting spam comments.
#
# 2009-04-22: Some spammers seem to send requests with "Opera "
# instead of "Opera/", so list that as an option.
SecRule REQUEST_HEADERS:User-Agent "(mozilla/[0123]|firefox/[01]|gecko/200[0123456]|msie ([12345]|7.0[ab])|opera[/ ][012345678])" 
    "log,msg:'User-Agent too old to be true, posting spam comments.',deny,status:403"

# The Mozilla/4.x and /5.x agents have 0 as minor version, nothing
# else.
SecRule REQUEST_HEADERS:User-Agent "(mozilla/[45].[1-9])" 
    "log,msg:'User-Agent sounds fake, posting spam comments.',deny,status:403"

# Malware and spyware that advertises itself on the User-Agent string,
# since a lot of spam comments seem to come out of browsers like that,
# make sure we don't accept their comments.
SecRule REQUEST_HEADERS:User-Agent "(funwebproducts|myie2|maxthon)" 
    "log,msg:'User-Agent contains spyware/adware references, posting spam comments.',deny,status:403"

# Bots usually provide an http:// address to look up their
# description, but those don't usually post comments. Consider any
# comment coming from a similar User-Agent as spam.
SecRule REQUEST_HEADERS:User-Agent "http://" 
    "log,msg:'User-Agent spamming URLs, posting spam comments.',deny,status:403"

SecRule REQUEST_HEADERS:User-Agent 
    "^mozilla/4.0+" "log,msg:'Spaces converted to + symbols, posting spam comments.',deny,status:403"

# We expect Windows XP users to upgrade at least to IE7. Or use
# Firefox (even better) or Safari, or Opera, ...
#
# All the comments coming from the old default OS browser have a high
# chance of being spam, so reject them.
#
# 2009-04-22: Note that we shouldn't check for 5.0 and 6.0 NT versions
# specifically, since Server and x64 editions can have different minor
# versions.
SecRule REQUEST_HEADERS:User-Agent "msie 6.0;( .+;)? windows nt [56]." 
    "log,msg:'IE6 on Windows XP or Vista, posting spam comments.',deny,status:403"

# List of user agents only ever used by spammers
#
# 2009-04-22: the "Windows XP" declaration is never used by official
# MSIE agent strings, it uses "Windows NT 5.0" instead, so if you find
# it, just kill it.
SecRule REQUEST_HEADERS:User-Agent "(libwen-us|msie .+; .*windows xp)" 
    "log,msg:'Confirmed spam User-Agent posting spam comments.',deny,status:403"

Spam attacks

I have in my TODO list (always expected to happen, I have no idea when yet), to update the mod security rules that I posted some time ago; while the ones I posted mostly work, I had to add one more exception on the HTTP/1.0 posting (Opera, in some configuration) and I’ve added a few more blacklists for known spamming User-agents (Project Honeypot seems quite useful for double-checking that, and is why you actually find Project Honeypot-induced hidden links in my blog; another item in my TODO list, adding this to the xine Bugzilla too).

With the filtering on, I had one only person reporting false positives (moesaji) and some post from time to time with spam passing through mod_sec and hitting the Typo anti-spam measure (which is not perfect but can deal with the lower rate of spam that I receive now). Today though I found some strangely large hit of spam. Note that for my new standards, “strangely large hit” means nine spam comments on three posts. So I just executed the usual script to get the new data from access log on the server, and it started being interesting.

The one post that stood out from the rest because it was the absolutely usual spam comment reports, as user agent, the Opera for Wii browser. It’s a first for me, in both spam and non-spam, with that user agent. I do use the PSP browser from time to time and I tried blogging from the PlayStation 3, but at this point I don’t doubt the User-Agent header is being forged, because I don’t see someone able to easily hijack a Wii to post spam comments around.

The remaining posts are much more interesting. First of all they come with no User-Agent at all, which means that I forgot to ban that particular case with mod_sec (just checking that it’s ^$ does not work, probably because that expectes User-Agent: instead of no header at all), and I’ll have to fix that in a moment, but there is one other interesting issue, that wouldn’t have been that interesting if I didn’t read Planet Debian almost daily.

The other day I read (and shared with Google Reader), a post by Steve Kemp about how spammers don’t know the syntax of your site and will try to link their website with different methods all at once. In particular he refers that his anti-spam comments service now takes care of identifying that too (reminds me that I have to search or write a plugin for Typo to check on that the comments, again in my TODO list).

How does that make the spam I received today interesting? Well instead of having one spam comment with three different link methods, different IPs in the same C-class posted four comments on the same article, with the usual “Very nice site” text, one without link, and three with the three different links. A quite nice way to avoid the detection as Kemp reported. Which brings me to the final question of the post: are spammers monitoring us? Or is just strange luck that as soon as Kemp found a mostly “no false positive” rule to identify spam, they start to work it around?

At any rate please remember to disable browser anonymisers when you want to post comments on my blog, I don’t like those and you’d have no reason to since I’m not an evildoer that registers the users’ preferences in browsers — I just use them to avoid filling the net with spam.

My idea works: filtering by User-agent

You might remember that some time ago I proposed blocking old user agents; while I wasn’t able to get around implementing this idea Typo-side, providing proper warning and interface to the users, the Apache move that followed that allowed me to implement my idea for real using mod_security.

While I think the default ruleset in mod_security is quite anal-retentive and disallows me to post most of my technical blogs (and related comments) by disallowing posting strings like /etc, the thing is tremendously powerful. I’m (ab)using it to stop requests hitting Typo for PHP pages (the server is not going to use PHP any time soon), which together with mod_rewrite reduce the load on the server itself.

To implement my idea (which is actually live on this blog for quite a while and refined further today), I first observed the behaviour of most spam comments, it turned out that I could identify some common patterns which really made it easy to write some rules. While they cannot remove the whole spam, they have a near-zero false positive percentage and it was able to increased the signal to noises ratio to the point I was able to restore comments on all the thousand (actually, nearly thousand, but that’s good enough for me), posts on this blog, spanning about three years of my Gentoo and Free Software work. Before, I had to stretch it to be able to keep them enabled for posts older than 45 days, and it was difficult to manage.

Anyway the first point to make is that only the comment posting should be blocked. I don’t care about the spammers browsing my blog, at the worst they would poison my AWStats output, but that’s password protected and will not cause Google spam. So I wrote all the SecRule entries directly in the virtual host definition inside a LocationMatch block. This should also reduce the per-request work that Apache and the module have to do.

Now, as for the actual rules, I first decided to disallow postings for blatantly too old browsers, like the ones describing themselves like Mozilla/1 to Mozilla/3 or Firefox/0 and Firefox/1 (beside, didn’t Firefox change name after release 1?):

SecRule REQUEST_HEADERS:User-Agent "(mozilla/[123])|(firefox/[01])" 
    "log,auditlog,msg:'User-Agent too old to be true, posting spam comments.',deny,status:403"

Then I started removing “strange and fake” User-Agents, like the ones reporting a Mozilla type with a non-zero decimal value, and then User-Agents which included a certain spyware .

SecRule REQUEST_HEADERS:User-Agent "(mozilla/[45].[1-9]|FunWebProducts)" 
    "log,auditlog,msg:'User-Agent sounds fake, posting spam comments.',deny,status:403"

I sincerely wonder how much false positives the above rule produces, none on my blog but maybe on more Windows-focused blogs it might not work that well. I’m not sure whether the spyware on the system cause IE to be hijacked to produce spam comments, or if the spam comments just appear to use the same User-Agent, but on the whole I guess an user that browses with such software is an user I don’t really want to hear comments from.

Together with that spyware there seem to be more (jeez, do people on Windows really install any crap sent their way? I’m glad I’m using Linux and OSX!), again I’m not sure whether they use generated User-Agents that include them, if they hijack the browser directly from them, or whether systems that already have those kind of spyware are more likely subject to other kind of spyware too.

The next rule kills a lot more spam bots and more spyware-full browsers, by removing any User-Agent with an URL in it. I haven’t found any legit User-Agent that lists an URL, at least not for browsers. Crawlers do, but they don’t post comments.

# Bots usually provide an http:// address to look up their
# description, but those don't usually post comments. Consider any
# comment coming from a similar User-Agent as spam.
SecRule REQUEST_HEADERS:User-Agent "http://" 
    "log,auditlog,msg:'User-Agent spamming URLs, posting spam comments.',deny,status:403"

Then I noticed a huge amount of spam comments coming with HTTP version 1.0, but with User-Agent of browsers that well support HTTP/1.1 and which I’m sure request pages with that version. The only browser I could find that legitimately uses HTTP/1.0 to post comments is lynx, so I whitelisted it explicitly:

SecRule REQUEST_PROTOCOL "!^http/1.1$" 
    "log,auditlog,msg:'Host has to be used but HTTP/1.0, posting spam comments.',deny,status:403,chain"
SecRule REQUEST_HEADERS:User-Agent "!lynx"

The next observation shown that a lot of User-Agents used to post comments had a common error in them: space was URL-encoded, not with the usual %20, but with +, as sometimes it’s done. So I decided to kill those at once again:

SecRule REQUEST_HEADERS:User-Agent 
    "^mozilla/4.0+" "log,auditlog,msg:'Spaces converted to + symbols, posting spam comments.',deny,status:403"

This already reduced a huge amount of the spam, and I used it till today. Then after one more month of observation I found that a lot of spam, and no good comment, came from old default browsers on Windows, or at least pretended to. This included IE6 under Windows XP and IE5 under Windows 2000. So I decided to disallow all the posts from the first case (I’m expecting Windows XP users to get a decent browser, or if they cannot, get at least IE7), and then all the older versions of Internet Explorer, from 2 (yes sometimes it still hits!) to 5:

# We expect Windows XP users to upgrade at least to IE7. Or use
# Firefox (even better) or Safari, or Opera, ...
#
# All the comments coming from the old default OS browser have a high
# chance of being spam, so reject them.
SecRule REQUEST_HEADERS:User-Agent "msie 6.0; windows nt 5.1" 
    "log,msg:'IE6 on Windows XP, posting spam comments.',deny,status:403"

# Also ignore comments coming from IE 5 or earlier since we don't care
# about such old browsers. Note that Yahoo feed fetcher reports itself
# as MSIE 5.5 for no good reason, but I don't care since it cannot
# post comments anyway.
SecRule REQUEST_HEADERS:User-Agent "msie [12345]" 
    "log,msg:'.',deny,status:403"

Now, describing these rules can be a bit controversial. Since making them public also means that the developers of spam bots can now learn some more things to avoid, but I decided to do it anyway for a few reasons I deem good enough.

The first is that I’m sure that a lot of spam bot users don’t care to update their code at all, and rely on the simple sheer amount of posting. Anybody with minimum amount of knowledge of the web can figure out how to reduce the difference between the used User-Agents and the ones that are actually used by users. Then there is the hope that knowing these problems can help someone else reducing the amount of spam just as well.

Finally, today Reinhard and Darren, when discussing about the new xine website, brought up the bus factor which in my case actually morphs to the pancreas factor. It is actually true that, given my past two years, I could disappear, literally dead, without notice. While thinking of this actually depresses me to a point where I wish I never worked in Free Software, I need to work around the problem, by documenting processes and so on.

In the next week, given I don’t have job-related tasks to direct my attention towards, I’ll try to document all the scripts used for the site generation, the configuration files for Apache, the cron jobs regenerating the script and so on so forth. It’s going to be a massive amount of documentation I have to write, but I have been doing that for Gentoo-related stuff for a while already.

Sigh now I really wish I never embarked in this quest to begin with.