The presence of awstats in the title here can be misunderstood for a moment; as I’ll show in the rest of the post, I am putting it there as an experiment. Bear with me.
New year and new month, and as each new month, the statistics I gather from my awstats interface (which is protected by password and SSL to be on the safe side) show me a huge amount of referrer spam. If you don’t know what referrer spam is, then you probably never gathered raw statistics from access logs, so I’ll try to explain.
Spammers wish to increase the pagerank of their websites, so that there are more chances that random Google searches will bring in visitors; to do so one common method is to leave links to their websites in spam comments on blogs, that Google will index and traverse. Luckily for us, spam comments on blogs are relatively easy to stop; most of the times I don’t even see them, thanks to the ModSecurity antispam system I wrote for my blog (more on that in a moment).
But at the same time, a number of websites leave their reported statistics open to be seen by everybody, including Google! A quick search for the right terms shows about 126 thousands results, and that is that just by looking for AWStats, which is but one of the many applications that do so. This means that the spammers have a huge chance of being able to actually get their pagerank boosted. There are even more points in this thanks to the fact that even the latest version of AWStats does not use rel=nofollow on the list of referrers!
Referrer spam also requires a lot less work than comment spam does: you don’t need to send POST requests with the right fields, you just need to request the pages, sometimes even if they don’t exist, with the URL you want to spam in the Referer (not my typo!) header. Now, there are a few patterns to be found even in these referrer spammers, most of which seem to optimise their requests’ bandwidth by targeting URLs that seem to show the presence of the software they are targeting (such as AWStats). Indeed, the one that has gotten more hits since is the one that had AWStats in the title and I’ve been using lately to detect these patterns.
In particular, I have noticed that a number of these spammers seem to be even more interested in reducing the traffic they generate than most legit sites, to the point that they used the HEAD method rather than the GET method. This allowed me to create a simple rule to ban almost all of them: I simply deny HEAD requests coming from real browsers’ agents — as far as I can see, no browser ever sends HEAD requests; if they want to validate cache, they instead use the If-None-Match or If-Modified-Since headers.
trying to reduce the impact of these spammers on my own websites, I’ve been improving and upgrading the rules in my ModSecurity Ruleset to detect both the patterns and the known spammers; generally speaking the added logic shouldn’t be excessively taxing. By the way, if you are using my rules and like them, please do flattr them! At least I’ll know somebody does make use of those now that I published them.
Interestingly enough, you might remember a previous post of mine, where I noted that letting users always register is bad — more and more it seems like a number of these spammers try to work around the easy checks done through URL blacklists (similar to the DNSBL used to block IPs) by using third parties that might be legit, but ignore the nofollow rule, which means that their pagerank is, once again, ensured.
I guess I really should start learning LUA so that I can write more complex but thorough checks for my rules; since it seems like ModSecurity how has hashing capabilities, it wouldn’t be too bad if I could simply check each domain in the referrers once every two hours against a live blacklist, and then skip over it, to avoid repeating the same test over and over again.
For those interested in follows up on the issues, I sent the AWStats author a patch to add the rel=nofollow on links, so that even public statistics stop boosting spammers’ pagerank. The same patch is now applied by the latest ~arch version of awstats, which I hope to get stable sometime next month.