As I noted earlier, I’ve been doing some more housecleaning of bad HTTP crawlers and feed readers. While it matters very little for my and my blog (I don’t pay for bandwidth), I find it’s a good exercise and, since I do publish my ModSecurity rules, it is a public service for many.
For those who think that I may be losing real readership in this, the number of visits on my site as seen by Analytics increased (because of me sharing the links to that post over to Twitter and G+, as well as in the GitHub issues and the complaint email I sent to the FeedMyInbox guys), yet the daily traffic was cut in half. I think this is what is called a win-win.
But one thing that became clear from both AWSstats and Analytics is that there was one more crawler that I did not stop yet. The crawler name is Semalt, and I’m not doing them the favour of linking to their website. Those of you who follow me on twitter have probably seen what they categorized as “free PR” for them, while I was ranting them up. I defined them a cancer for the Internet, I then realized that the right categorization would be bacteria.
If you look around, you’ll find unflattering reviews and multiple instructions to remove them from your website.
Funnily, once I tweeted about my commit, one of their people, who I assume is in their PR department rather than engineering for the blatant stupidity of their answers, told me that it’s “easy” to opt-out of their scanner.. you just have to go on their website and tell them your websites! Sure, sounds like a plan, right?
But why on earth am I spending my time attacking one particular company that, to be honest, is not wasting that much of my bandwidth to begin with? Well, as you can imagine from me comparing them to shigella bacteria, I do have a problem with their business idea. And given that on twitter they even missed completely my point (when I pointed out the three spammy techniques they use, their answer was “people don’t complain about Google or Bing” — well, yes, neither of the two use any of their spammy techniques!), it’ll be difficult for me to consider them as mistaken. They are doing this on purpose.
Let’s start with the technicalities, although that’s not why I noticed them to begin with. As I said earlier, their way to “opt out” from their services is to go to their website and fill in a form. They completely ignore
robots.txt, they don’t even fetch it. And given this is an automated crawler, that’s bad enough.
The second is that they don’t advertise themselves in the
User-Agent header. Instead all their fetches report
Chrome/35 — and given that they can pass through my ruleset, they probably use a real browser with something like WebDriver. So you have no real way to identify their requests among a number of others, which is not how a good crawler should operate.
The third and most important point is the reason why I consider them just spammers, and so seem others, given the links I posted earlier. Instead of using the user agent field to advertise themselves, they subvert the
Referer header. Which means that all their requests, even those that have been 301’d and 302’d around, will report their website as referrer. And if you know how AWStats works, you know that it doesn’t take that many crawls for them to be one of the “top referrers” for your website, and thus appear prominently in your stats, whether they are public or not.
As their twitter person thanked me for my “free PR” for them, I wanted to expand it further on it, with the hope that people will learn to know them. And to avoid them. My ModSecurity ruleset as I said already is set up to filter them out, other solutions for those who don’t want to use ModSecurity are linked above.
And Nabble found even more interesting things about this bacteria: http://blog.nabble.nl/post/93306955157/semalt-infecting-computers-to-spam-the-web
Ah nice technical post on the issue, another Australian marketer Matthew Forzan posted a 101 guide to blocking then with HTaccess http://matthewforzan.com.au…I’ve also had some interesting interactions with their PR person on Twitter today, along with getting threatened with legal action… they seem to have a very ummm interesting/aggressive stance on discussions about their platform.
The link to your published mod_security set points to a 404 page. You should have a look at my handling overly-slashed URLs post. 😉
Hah will doublecheck that ;)In this case I blame a bad sed on my part, I’ll fix that tonight.