This Time Self-Hosted
dark mode light mode Search

My idea works: filtering by User-agent

You might remember that some time ago I proposed blocking old user agents; while I wasn’t able to get around implementing this idea Typo-side, providing proper warning and interface to the users, the Apache move that followed that allowed me to implement my idea for real using mod_security.

While I think the default ruleset in mod_security is quite anal-retentive and disallows me to post most of my technical blogs (and related comments) by disallowing posting strings like /etc, the thing is tremendously powerful. I’m (ab)using it to stop requests hitting Typo for PHP pages (the server is not going to use PHP any time soon), which together with mod_rewrite reduce the load on the server itself.

To implement my idea (which is actually live on this blog for quite a while and refined further today), I first observed the behaviour of most spam comments, it turned out that I could identify some common patterns which really made it easy to write some rules. While they cannot remove the whole spam, they have a near-zero false positive percentage and it was able to increased the signal to noises ratio to the point I was able to restore comments on all the thousand (actually, nearly thousand, but that’s good enough for me), posts on this blog, spanning about three years of my Gentoo and Free Software work. Before, I had to stretch it to be able to keep them enabled for posts older than 45 days, and it was difficult to manage.

Anyway the first point to make is that only the comment posting should be blocked. I don’t care about the spammers browsing my blog, at the worst they would poison my AWStats output, but that’s password protected and will not cause Google spam. So I wrote all the SecRule entries directly in the virtual host definition inside a LocationMatch block. This should also reduce the per-request work that Apache and the module have to do.

Now, as for the actual rules, I first decided to disallow postings for blatantly too old browsers, like the ones describing themselves like Mozilla/1 to Mozilla/3 or Firefox/0 and Firefox/1 (beside, didn’t Firefox change name after release 1?):

SecRule REQUEST_HEADERS:User-Agent "(mozilla/[123])|(firefox/[01])" 
    "log,auditlog,msg:'User-Agent too old to be true, posting spam comments.',deny,status:403"
Code language: JavaScript (javascript)

Then I started removing “strange and fake” User-Agents, like the ones reporting a Mozilla type with a non-zero decimal value, and then User-Agents which included a certain spyware.

SecRule REQUEST_HEADERS:User-Agent "(mozilla/[45].[1-9]|FunWebProducts)" 
    "log,auditlog,msg:'User-Agent sounds fake, posting spam comments.',deny,status:403"
Code language: JavaScript (javascript)

I sincerely wonder how much false positives the above rule produces, none on my blog but maybe on more Windows-focused blogs it might not work that well. I’m not sure whether the spyware on the system cause IE to be hijacked to produce spam comments, or if the spam comments just appear to use the same User-Agent, but on the whole I guess an user that browses with such software is an user I don’t really want to hear comments from.

Together with that spyware there seem to be more (jeez, do people on Windows really install any crap sent their way? I’m glad I’m using Linux and OSX!), again I’m not sure whether they use generated User-Agents that include them, if they hijack the browser directly from them, or whether systems that already have those kind of spyware are more likely subject to other kind of spyware too.

The next rule kills a lot more spam bots and more spyware-full browsers, by removing any User-Agent with an URL in it. I haven’t found any legit User-Agent that lists an URL, at least not for browsers. Crawlers do, but they don’t post comments.

# Bots usually provide an http:// address to look up their
# description, but those don't usually post comments. Consider any
# comment coming from a similar User-Agent as spam.
SecRule REQUEST_HEADERS:User-Agent "http://" 
    "log,auditlog,msg:'User-Agent spamming URLs, posting spam comments.',deny,status:403"
Code language: PHP (php)

Then I noticed a huge amount of spam comments coming with HTTP version 1.0, but with User-Agent of browsers that well support HTTP/1.1 and which I’m sure request pages with that version. The only browser I could find that legitimately uses HTTP/1.0 to post comments is lynx, so I whitelisted it explicitly:

SecRule REQUEST_PROTOCOL "!^http/1.1$" 
    "log,auditlog,msg:'Host has to be used but HTTP/1.0, posting spam comments.',deny,status:403,chain"
SecRule REQUEST_HEADERS:User-Agent "!lynx"
Code language: JavaScript (javascript)

The next observation shown that a lot of User-Agents used to post comments had a common error in them: space was URL-encoded, not with the usual %20, but with +, as sometimes it’s done. So I decided to kill those at once again:

SecRule REQUEST_HEADERS:User-Agent 
    "^mozilla/4.0+" "log,auditlog,msg:'Spaces converted to + symbols, posting spam comments.',deny,status:403"
Code language: JavaScript (javascript)

This already reduced a huge amount of the spam, and I used it till today. Then after one more month of observation I found that a lot of spam, and no good comment, came from old default browsers on Windows, or at least pretended to. This included IE6 under Windows XP and IE5 under Windows 2000. So I decided to disallow all the posts from the first case (I’m expecting Windows XP users to get a decent browser, or if they cannot, get at least IE7), and then all the older versions of Internet Explorer, from 2 (yes sometimes it still hits!) to 5:

# We expect Windows XP users to upgrade at least to IE7. Or use
# Firefox (even better) or Safari, or Opera, ...
#
# All the comments coming from the old default OS browser have a high
# chance of being spam, so reject them.
SecRule REQUEST_HEADERS:User-Agent "msie 6.0; windows nt 5.1" 
    "log,msg:'IE6 on Windows XP, posting spam comments.',deny,status:403"

# Also ignore comments coming from IE 5 or earlier since we don't care
# about such old browsers. Note that Yahoo feed fetcher reports itself
# as MSIE 5.5 for no good reason, but I don't care since it cannot
# post comments anyway.
SecRule REQUEST_HEADERS:User-Agent "msie [12345]" 
    "log,msg:'.',deny,status:403"
Code language: PHP (php)

Now, describing these rules can be a bit controversial. Since making them public also means that the developers of spam bots can now learn some more things to avoid, but I decided to do it anyway for a few reasons I deem good enough.

The first is that I’m sure that a lot of spam bot users don’t care to update their code at all, and rely on the simple sheer amount of posting. Anybody with minimum amount of knowledge of the web can figure out how to reduce the difference between the used User-Agents and the ones that are actually used by users. Then there is the hope that knowing these problems can help someone else reducing the amount of spam just as well.

Finally, today Reinhard and Darren, when discussing about the new xine website, brought up the bus factor which in my case actually morphs to the pancreas factor. It is actually true that, given my past two years, I could disappear, literally dead, without notice. While thinking of this actually depresses me to a point where I wish I never worked in Free Software, I need to work around the problem, by documenting processes and so on.

In the next week, given I don’t have job-related tasks to direct my attention towards, I’ll try to document all the scripts used for the site generation, the configuration files for Apache, the cron jobs regenerating the script and so on so forth. It’s going to be a massive amount of documentation I have to write, but I have been doing that for Gentoo-related stuff for a while already.

Sigh now I really wish I never embarked in this quest to begin with.

Comments 11
  1. First of all I haven’t said that encoding spaces with + is an error, I said “as sometimes it’s done”. I agree it’s not an error in most cases.As for the link, it refers to HTML, which is far from anything I care about in this post, since the whole thing is handled within the HTTP protocol instead.At any rate, no legit User-Agent is using that type of encoding, since the header needs not to be encoded, it’s just a bug in some spam bot which I’m using up.

  2. I’ve noticed a few bizarre user-agents in my logs in the past – there’s even one that identifies itself as ancient versions of Konqueror (3.0-3.2).The best way I’ve found of catching bots is to hide a “data:” URL somewhere on the page. Legitimate bots like Google won’t touch it, but spam bots nearly always try to access it as a filename.

  3. Thanks Diego, that’s really useful… I’ve noticed in the past week or so I’ve had a Win 98 machine with IE 6 trying to hit random pages on my website, so this blog came in at a really good time for me!

  4. Do you intend to roll these rules out on the xine website as well?Something that caught my eye was:(1) Reading your xine web article, I saw:”Probably because of the bad way the PHP code was written, the site had all the crawlers stopped by robots.txt, which is a huge setback for a site aiming to be public.”(2) And now in this post (I think I’m reading out of order, bear with me), I read:”# […] Note that Yahoo feed fetcher reports itself# as MSIE 5.5 for no good reason, but I don’t care since it cannot# post comments anyway.”

  5. Ugh, why didn’t I click preview… the formatting of that comment is atrocious.

  6. The “xine website”:http://www.xine-project.org/ is totally static, so no these rules don’t apply.The old xine website, the one that is currently down, used PHP, and probably for that reason it kept all the robots out of way through robots.txt. The new site is not stopping any robot from accessing it since it cannot have vulnerability issues (the sources are parsed and processed before being published).The comment regading the Yahoo feed fetcher (by the way unless Darren implemented it this week, I haven’t had time to look at it, there is no feed on xine’s website) is related to the request I’ve seen in my blog’s logs.In general the crawlers don’t send comments to the blog so I don’t care if their requests would be rejected for the locations reserved to post comments.I have some SecRule entries for the general sites (included the xine website), I’ll see to document them fruther, but in general they are just stopping PHP requests and killing off a few known spam bots.

  7. This is probably the most elegant anti-spam solution I have ever seen. I wonder if it would be more user friendly to display a warning to affected www clients in place of the comment form.

  8. While this post is pretty old, I wanted to make note that Google now uses http:// in the header of their Google Bot. Many of the rules here while good, are probably going to be outdated and need revision soon.

  9. Actually, GoogleBot and other fetch-only agents already don’t get filtered just for the presence of an URL in the string, it’s a bit more complex than that.While [the ruleset](https://github.com/Flameeye… will actually get a big update probably tonight or tomorrow, it is fairly stable at this point.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.