What’s up with Semalt, then?

In my previous post on the matter, I called for a boycott of Semalt by blocking access to your servers from their crawler, after a very bad-looking exchange on Twitter with a supposed representative of theirs.

After I posted that, I got threatened by the same representative to be sued for libel, even though what that post was about was documenting their current practices, rather than shaming them. This got enough attention of other people who has been following the Semalt situation so that I could actually gather some more information on the matter.

In particular, there are two interesting blog posts by Joram van den Boezen about the company and its tactics. Turns out that what I thought was a very strange private cloud set up – coming as it was from Malaysia – was actually a botnet. Indeed, what appears from Joram’s investigations is that the people behind Semalt use sidecar malware both to gather URLs to crawl, and to crawl them. And this, according to their hosting provider is allowed because they make it clear in their software’s license.

This is consistent with what I have seen of Semalt on my server: rather than my blog – which fares pretty well on the web as a source of information – I found them requesting my website, which is almost dead. Looking at all the websites in all my servers, the only other affected is my friend’s which is by far not really an important one. But if we start from accepting Joram’s findings (and I have no reason not to), then I can see how that can happen.

My friend’s website is visited mostly by the people in the area we grew up in, and general friends of his. I know how bad their computers can be, as I have been doing tech support on them for years, and paid my bills that way. Computers that were bought either without a Windows license or with Windows Vista, that got XP installed on them so badly that they couldn’t get updates even when they were available. Windows 7 updates that were done without actually possessing a license, and so on so forth. I have, at some point, added a ModRewrite-based warning for a few known viruses that would alter the Internet Explorer User-Agent field.

Add to this that even those who shouldn’t be strapped for cash would want to avoid paying for anything if they can, you can see why software such as SoundFrost and other similar “tools” to download YouTube videos into music files would be quite likely to be found in computers that end up browsing my friend’s site.

What remains still not clear from all this information is why they are doing it. As I said in my previous post, there is no reason to abuse the referrer field, that is, beside to spam the statistics of the websites. Since the company is selling SEO services, one assumes that they do so to attract more customers. After all, if you spend time checking your Analytics output, you probably are the target audience of SEO services.

But after that, there are still questions that have no answer. How can that company do any analytics when they don’t really seem to have any infrastructure but rather use botnets for finding and accessing websites? Do they only make money with their subscriptions? And here is where things can get tricky, because I can only hypothesize and speculate, words that are dangerous to begin with.

What I can tell you is that out there, many people have no scruple, and I’m not referring to Semalt here. When I tried to raise awareness about them on Reddit (a site that I don’t generally like, but that can be put to good use sometimes), I stopped by the subreddit to get an idea of what kind of people would be around there. It was not as I was expecting, not at all. Indeed what I found is that there are people out there seriously considering using black hat SEO services. Again, this is speculation, but my assumption is that these are consultants that basically want to show their clients that their services are worth it by inflating the access statistics to the websites.

So either these consultants just buy the services out of companies like Semalt, or even the final site owners don’t understand that a company promising “more accesses” does not really mean “more people actually looking at your website and considering your services”. It’s hard for people who don’t understand the technology to discern between “accesses” and “eyeballs’. It’s not much different from the fake Twitter followers, studied by Barracuda Labs a couple of years ago — I know I read a more thorough study of one of the websites selling this kind of money but I can’t find it. That’s why I usually keep that stuff on Readability.

So once again, give some antibiotics to the network, and help cure the web from people like Semalt and the people who would buy their services.

Antibiotics for the Internet or, why blocking Semalt crawlers

As I noted earlier, I’ve been doing some more housecleaning of bad HTTP crawlers and feed readers. While it matters very little for my and my blog (I don’t pay for bandwidth), I find it’s a good exercise and, since I do publish my ModSecurity rules, it is a public service for many.

For those who think that I may be losing real readership in this, the number of visits on my site as seen by Analytics increased (because of me sharing the links to that post over to Twitter and G+, as well as in the GitHub issues and the complaint email I sent to the FeedMyInbox guys), yet the daily traffic was cut in half. I think this is what is called a win-win.

But one thing that became clear from both AWSstats and Analytics is that there was one more crawler that I did not stop yet. The crawler name is Semalt, and I’m not doing them the favour of linking to their website. Those of you who follow me on twitter have probably seen what they categorized as “free PR” for them, while I was ranting them up. I defined them a cancer for the Internet, I then realized that the right categorization would be bacteria.

If you look around, you’ll find unflattering reviews and multiple instructions to remove them from your website.

Funnily, once I tweeted about my commit, one of their people, who I assume is in their PR department rather than engineering for the blatant stupidity of their answers, told me that it’s “easy” to opt-out of their scanner.. you just have to go on their website and tell them your websites! Sure, sounds like a plan, right?

But why on earth am I spending my time attacking one particular company that, to be honest, is not wasting that much of my bandwidth to begin with? Well, as you can imagine from me comparing them to shigella bacteria, I do have a problem with their business idea. And given that on twitter they even missed completely my point (when I pointed out the three spammy techniques they use, their answer was “people don’t complain about Google or Bing” — well, yes, neither of the two use any of their spammy techniques!), it’ll be difficult for me to consider them as mistaken. They are doing this on purpose.

Let’s start with the technicalities, although that’s not why I noticed them to begin with. As I said earlier, their way to “opt out” from their services is to go to their website and fill in a form. They completely ignore robots.txt, they don’t even fetch it. And given this is an automated crawler, that’s bad enough.

The second is that they don’t advertise themselves in the User-Agent header. Instead all their fetches report Chrome/35 — and given that they can pass through my ruleset, they probably use a real browser with something like WebDriver. So you have no real way to identify their requests among a number of others, which is not how a good crawler should operate.

The third and most important point is the reason why I consider them just spammers, and so seem others, given the links I posted earlier. Instead of using the user agent field to advertise themselves, they subvert the Referer header. Which means that all their requests, even those that have been 301’d and 302’d around, will report their website as referrer. And if you know how AWStats works, you know that it doesn’t take that many crawls for them to be one of the “top referrers” for your website, and thus appear prominently in your stats, whether they are public or not.

At this point it could be easy to say that they are clueless and are not doing this on purpose, but then there is the other important part. Their crawler executes JavaScript, which means that it gets tracked by Google Analytics, too! Analytics has no access to the server logs, so for it to display the referrer as shown by people looking to filter it out, it has to make an effort. Again, this could easily be a mistake, given that they are using something like WebDriver, right?

The problem is that whatever they use, it does not fetch either images or CSS. But it does fetch the Analytics javascript and execute it, as I said. And the only reason I can think for them to want to do so, is to spam the referrer list in there as well.

As their twitter person thanked me for my “free PR” for them, I wanted to expand it further on it, with the hope that people will learn to know them. And to avoid them. My ModSecurity ruleset as I said already is set up to filter them out, other solutions for those who don’t want to use ModSecurity are linked above.

More on browser fingerprinting, and Analytics

A few days ago I pointed out how it’s possible to use some of the Chrome extensions (and likely just as many of the Firefox ones) to gather extra entropy in addition to the one that Panopticlick already knows about. But this is not the only source of identification that Panopticlick is not considering, and that can be used to track users.

I originally intended to write a full proof of concept for it, but since I’m currently in Mountain View, my time is pretty limited, so I’ll limit myself to a description of it. Panopticlick factors in the Accept header for the page that the browser sends with the page’s request, but there is one thing that it does not check for, as it’s a bit more complex to do: the Accept header for images. Indeed, different browsers support different image formats, as I’ve found before and even browsers that support, for instance, WebP such as Opera and Chrome will have widely different Accept headers.

What does it mean? Well, if you were trying to replace, let’s say, your Chrome user agent with a Firefox one, you’d now have a very unique combination of a Firefox user agent accepting WebP images. Your hope of hiding by muddling the waters just made you stand up much more easily. The same goes if you were trying to disable WebP requests to make your images’ Accept more alike Firefox’s: now you’ll have a given version of Chrome that does not support WebP — the likeliness of being unique is even bigger.

So why am I talking this much about browser fingerprinting later? Well, you may or may not have noticed but both my blog and Autotools Mythbuster are now using Google Analytics. The reason for that is that, after my doubts on whether to keep running the blog or not, I want to know exactly how useful my blog is to people, and how many people end up reading it at given time. I was originally a bit unsure on whether this was going to be a problem for my readers, but seeing how easily it is to track people stealthily, tracking people explicitly shouldn’t be considered a problem — thus why I’m going to laugh at your expense if you’ll start complaining about this being a “web bug”.