My idea works: filtering by User-agent

You might remember that some time ago I proposed blocking old user agents; while I wasn’t able to get around implementing this idea Typo-side, providing proper warning and interface to the users, the Apache move that followed that allowed me to implement my idea for real using mod_security.

While I think the default ruleset in mod_security is quite anal-retentive and disallows me to post most of my technical blogs (and related comments) by disallowing posting strings like /etc, the thing is tremendously powerful. I’m (ab)using it to stop requests hitting Typo for PHP pages (the server is not going to use PHP any time soon), which together with mod_rewrite reduce the load on the server itself.

To implement my idea (which is actually live on this blog for quite a while and refined further today), I first observed the behaviour of most spam comments, it turned out that I could identify some common patterns which really made it easy to write some rules. While they cannot remove the whole spam, they have a near-zero false positive percentage and it was able to increased the signal to noises ratio to the point I was able to restore comments on all the thousand (actually, nearly thousand, but that’s good enough for me), posts on this blog, spanning about three years of my Gentoo and Free Software work. Before, I had to stretch it to be able to keep them enabled for posts older than 45 days, and it was difficult to manage.

Anyway the first point to make is that only the comment posting should be blocked. I don’t care about the spammers browsing my blog, at the worst they would poison my AWStats output, but that’s password protected and will not cause Google spam. So I wrote all the SecRule entries directly in the virtual host definition inside a LocationMatch block. This should also reduce the per-request work that Apache and the module have to do.

Now, as for the actual rules, I first decided to disallow postings for blatantly too old browsers, like the ones describing themselves like Mozilla/1 to Mozilla/3 or Firefox/0 and Firefox/1 (beside, didn’t Firefox change name after release 1?):

SecRule REQUEST_HEADERS:User-Agent "(mozilla/[123])|(firefox/[01])" 
    "log,auditlog,msg:'User-Agent too old to be true, posting spam comments.',deny,status:403"

Then I started removing “strange and fake” User-Agents, like the ones reporting a Mozilla type with a non-zero decimal value, and then User-Agents which included a certain spyware .

SecRule REQUEST_HEADERS:User-Agent "(mozilla/[45].[1-9]|FunWebProducts)" 
    "log,auditlog,msg:'User-Agent sounds fake, posting spam comments.',deny,status:403"

I sincerely wonder how much false positives the above rule produces, none on my blog but maybe on more Windows-focused blogs it might not work that well. I’m not sure whether the spyware on the system cause IE to be hijacked to produce spam comments, or if the spam comments just appear to use the same User-Agent, but on the whole I guess an user that browses with such software is an user I don’t really want to hear comments from.

Together with that spyware there seem to be more (jeez, do people on Windows really install any crap sent their way? I’m glad I’m using Linux and OSX!), again I’m not sure whether they use generated User-Agents that include them, if they hijack the browser directly from them, or whether systems that already have those kind of spyware are more likely subject to other kind of spyware too.

The next rule kills a lot more spam bots and more spyware-full browsers, by removing any User-Agent with an URL in it. I haven’t found any legit User-Agent that lists an URL, at least not for browsers. Crawlers do, but they don’t post comments.

# Bots usually provide an http:// address to look up their
# description, but those don't usually post comments. Consider any
# comment coming from a similar User-Agent as spam.
SecRule REQUEST_HEADERS:User-Agent "http://" 
    "log,auditlog,msg:'User-Agent spamming URLs, posting spam comments.',deny,status:403"

Then I noticed a huge amount of spam comments coming with HTTP version 1.0, but with User-Agent of browsers that well support HTTP/1.1 and which I’m sure request pages with that version. The only browser I could find that legitimately uses HTTP/1.0 to post comments is lynx, so I whitelisted it explicitly:

SecRule REQUEST_PROTOCOL "!^http/1.1$" 
    "log,auditlog,msg:'Host has to be used but HTTP/1.0, posting spam comments.',deny,status:403,chain"
SecRule REQUEST_HEADERS:User-Agent "!lynx"

The next observation shown that a lot of User-Agents used to post comments had a common error in them: space was URL-encoded, not with the usual %20, but with +, as sometimes it’s done. So I decided to kill those at once again:

    "^mozilla/4.0+" "log,auditlog,msg:'Spaces converted to + symbols, posting spam comments.',deny,status:403"

This already reduced a huge amount of the spam, and I used it till today. Then after one more month of observation I found that a lot of spam, and no good comment, came from old default browsers on Windows, or at least pretended to. This included IE6 under Windows XP and IE5 under Windows 2000. So I decided to disallow all the posts from the first case (I’m expecting Windows XP users to get a decent browser, or if they cannot, get at least IE7), and then all the older versions of Internet Explorer, from 2 (yes sometimes it still hits!) to 5:

# We expect Windows XP users to upgrade at least to IE7. Or use
# Firefox (even better) or Safari, or Opera, ...
# All the comments coming from the old default OS browser have a high
# chance of being spam, so reject them.
SecRule REQUEST_HEADERS:User-Agent "msie 6.0; windows nt 5.1" 
    "log,msg:'IE6 on Windows XP, posting spam comments.',deny,status:403"

# Also ignore comments coming from IE 5 or earlier since we don't care
# about such old browsers. Note that Yahoo feed fetcher reports itself
# as MSIE 5.5 for no good reason, but I don't care since it cannot
# post comments anyway.
SecRule REQUEST_HEADERS:User-Agent "msie [12345]" 

Now, describing these rules can be a bit controversial. Since making them public also means that the developers of spam bots can now learn some more things to avoid, but I decided to do it anyway for a few reasons I deem good enough.

The first is that I’m sure that a lot of spam bot users don’t care to update their code at all, and rely on the simple sheer amount of posting. Anybody with minimum amount of knowledge of the web can figure out how to reduce the difference between the used User-Agents and the ones that are actually used by users. Then there is the hope that knowing these problems can help someone else reducing the amount of spam just as well.

Finally, today Reinhard and Darren, when discussing about the new xine website, brought up the bus factor which in my case actually morphs to the pancreas factor. It is actually true that, given my past two years, I could disappear, literally dead, without notice. While thinking of this actually depresses me to a point where I wish I never worked in Free Software, I need to work around the problem, by documenting processes and so on.

In the next week, given I don’t have job-related tasks to direct my attention towards, I’ll try to document all the scripts used for the site generation, the configuration files for Apache, the cron jobs regenerating the script and so on so forth. It’s going to be a massive amount of documentation I have to write, but I have been doing that for Gentoo-related stuff for a while already.

Sigh now I really wish I never embarked in this quest to begin with.