This Time Self-Hosted
dark mode light mode Search

Yes, again spam filtering

You might remember that I reported success with my filters using User-Agent value as reported by the clients; unfortunately it seems like I was really speaking way too soon. While the amount of spam I had to manually remove from the blog decreased tremendously, which allowed me to disable the 45 days limits on commenting, and also comment moderation, it still didn’t cut it, and caused a few false positives.

The main problem is that the filter on HTTP/1.0 behaviour was hitting almost anybody that tried to comment with a proxied connection: default squid configuration doesn’t use HTTP/1.1 and so downgrades everything to 1.0; thanks to binki and moesasji I was able to track down the issue and now my ruleset (which I’m going to attach at the end of the post) checks for the Via header to identify proxies. Unfortunately, the result is that now I get much more spam; indeed lots and lots of comment spam comes through open proxies, which is far from an uncommon thing.

I guess one option would be to use the SORBS DNSBL blacklists to filter out known open proxies; unfortunately either I misconfigured the dnsbl lookup module for Apache (which I hoped I got already working) or the proxies I’m receiving spam from are not listed there at all. I was also told that mod_security can handle the lookup itself, which is probably good since I can reduce the lookups for the open proxy to the case when a proxy is actually used.

I was also told to look at the rules from Got Root which also list some user agent based filtering; I haven’t done so yet though, because I start to get worried: my rules are already executing a number of regular expression matching on the User-Agent header, and I’m trying to do my best to make sure that the expressions are generic enough but not too broad; on the other hand Got Root’s rules seems to provide a straight match of a series of user agents, which means lots and lots of checks added; the rules also seems to either be absolute (for any requested URL) or just for WordPress-based blogs, which means I’d have to adapt or tinker with them since I’m currently limiting the antispam measures through the use of Apache’s Location block (previously LocationMatch, but the new Typo version uses a single URL for all the comments posting).

What I’d like to see is some kind of Apache module that is able to match an User-Agent against a known list of bad User-Agents, as well as a list of regular expressions, compiled in some kind of bytecode, to be much much faster than the “manual” parsing that is done now. Unfortunately I don’t have neither time nor expertise with Apache to take care of that myself, which means either someone else does it, or I’m going to keep with mod_security for a while longer.

Anyway here’s the beef!

SecDefaultAction "pass,phase:2,t:lowercase"

# Ignore get requests since they cannot post comments.
SecRule REQUEST_METHOD "^get$" "pass,nolog"

# 2009-02-27: Kill comments where there is no User-Agent at all; I
# don't care if people like to be "anonymous" in the net, but the
# whole thing about anonymous browsers is pointless.
SecRule REQUEST_HEADERS:User-Agent "^$" 
    "log,msg:'Empty User-Agent when posting comments.',deny,status:403"

# Since we cannot check for _missing_ user agent we have to check if
# it's present first, and then check whether the variable is not
# set. Yes it is silly but it seems to be the only way to do this with
# mod_security.
SecRule REQUEST_HEADERS_NAMES "^user-agent" 
    "log,msg:'Missing User-Agent header when posting comments.',deny,status:403"

# Check if the comment arrived from a proxy; if that's the case we
# cannot rely on the HTTP version that is provided because it's not
# the one of the actual browser. We can, though, check if it's an open
# proxy blacklist.
    "setvar:tx.flameeyes_via_proxy=1,log,msg:'Commenting via proxy'"

# If we're not going through a proxy, and it's not lynx, and yet we
# have an HTTP/1.0 comment request, then it's likely a spambot with a
# fake user agent.
# Note the order of the rules is explicitly set this way so that the
# majority of requests from HTTP/1.1 browsers (legit) are ignored
# right away; then all the requests from proxies, then lynx.
SecRule REQUEST_PROTOCOL "!^http/1.1$" 
    "log,msg:'Host has to be used but HTTP/1.0, posting spam comments.',deny,status:403,chain"
SecRule TX:FLAMEEYES_VIA_PROXY "!1" "chain"
SecRule REQUEST_HEADERS:User-Agent "!lynx"

# Ignore very old Mozilla versions (not modern browsers, often never
# exiting) and pre-2 versions of Firefox.
# Also ignore comments coming from IE 5 or earlier since we don't care
# about such old browsers. Note that Yahoo feed fetcher reports itself
# as MSIE 5.5 for no good reason, but we don't care since it cannot
# _post_ comments anyway.
# 2009-02-27: Very old Gecko versions should not be tollerated, grace
# the period 2007-2009 for now.
# 2009-03-01: Ancient Opera versions usually posting spam comments.
# 2009-04-22: Some spammers seem to send requests with "Opera "
# instead of "Opera/", so list that as an option.
SecRule REQUEST_HEADERS:User-Agent "(mozilla/[0123]|firefox/[01]|gecko/200[0123456]|msie ([12345]|7.0[ab])|opera[/ ][012345678])" 
    "log,msg:'User-Agent too old to be true, posting spam comments.',deny,status:403"

# The Mozilla/4.x and /5.x agents have 0 as minor version, nothing
# else.
SecRule REQUEST_HEADERS:User-Agent "(mozilla/[45].[1-9])" 
    "log,msg:'User-Agent sounds fake, posting spam comments.',deny,status:403"

# Malware and spyware that advertises itself on the User-Agent string,
# since a lot of spam comments seem to come out of browsers like that,
# make sure we don't accept their comments.
SecRule REQUEST_HEADERS:User-Agent "(funwebproducts|myie2|maxthon)" 
    "log,msg:'User-Agent contains spyware/adware references, posting spam comments.',deny,status:403"

# Bots usually provide an http:// address to look up their
# description, but those don't usually post comments. Consider any
# comment coming from a similar User-Agent as spam.
SecRule REQUEST_HEADERS:User-Agent "http://" 
    "log,msg:'User-Agent spamming URLs, posting spam comments.',deny,status:403"

    "^mozilla/4.0+" "log,msg:'Spaces converted to + symbols, posting spam comments.',deny,status:403"

# We expect Windows XP users to upgrade at least to IE7. Or use
# Firefox (even better) or Safari, or Opera, ...
# All the comments coming from the old default OS browser have a high
# chance of being spam, so reject them.
# 2009-04-22: Note that we shouldn't check for 5.0 and 6.0 NT versions
# specifically, since Server and x64 editions can have different minor
# versions.
SecRule REQUEST_HEADERS:User-Agent "msie 6.0;( .+;)? windows nt [56]." 
    "log,msg:'IE6 on Windows XP or Vista, posting spam comments.',deny,status:403"

# List of user agents only ever used by spammers
# 2009-04-22: the "Windows XP" declaration is never used by official
# MSIE agent strings, it uses "Windows NT 5.0" instead, so if you find
# it, just kill it.
SecRule REQUEST_HEADERS:User-Agent "(libwen-us|msie .+; .*windows xp)" 
    "log,msg:'Confirmed spam User-Agent posting spam comments.',deny,status:403"
Comments 5
  1. You should give recaptcha a look ( theres plenty of implementation examples available for all the popular web scripting languages and its a quite good (for now) comment spam solution

  2. I’d sooner enable full comment moderation than using captchas, I sincerely hate the concept itself.Besides, right now I’m spending the time I “take my coffee”:… to grep the Apache logs to find the new comments and encoding new signatures into the mod_sec antispam system.If I could get a decent open proxy check, it’d be much simpler (for what it’s worth, IRC networks performed such checks even years ago, so there has to be a way to deal with that) since all these comments wouldn’t filter through.And in any case, Typo with Akismet identify all of them (Akismet seems to have that list of open proxies, because if I comment through one of them, as a test, it gets filtered down even if the comment is proper).

  3. Well the thing is that the spammers are clever enough to fool many of your regexp solutions. Captchas as is is a novelty solution that can be broken,but the method used by recaptcha is quite good since it uses old scanned books as template, rather than computer-generated hard-to-read text.Your approach with blocking by all these regexps will most certainly block out unintended legit visitors, as you already noted.

  4. My method does not only base itself on regexps, and the solution I’m hoping to implement asap is to check for open proxies which is _not_ based on regexps at all. Also I don’t block visitors but just commenters.Regarding captchas, it does not really solve anything, re-captcha or not, regarding blocking lecit users; I happen to have _lots_ of troubles to read some captchas; all the ones that I can read easily are the ones that bots can break just as easily, the remaining one are obstacles for users too. And they are boring, wastes of time and so on so forth.I can deal with captchas in registration forms, but not every time I want to write a comment. The method I have now is having the regexp alongside with Akismet and other systems; Akismet can stop the comments from appearing on the site, I just need to clean them up, the ones I filter beforehand is just an extra.

  5. I understand your points & i don’t like captchas either, but i’ve struggled with the same problems myself and thought to share some of what I learnt.Besides, if you look at it in a different perspective; using captchas helps accelerate the birth of skynet

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.