What’s up with Semalt, then?

In my previous post on the matter, I called for a boycott of Semalt by blocking access to your servers from their crawler, after a very bad-looking exchange on Twitter with a supposed representative of theirs.

After I posted that, I got threatened by the same representative to be sued for libel, even though what that post was about was documenting their current practices, rather than shaming them. This got enough attention of other people who has been following the Semalt situation so that I could actually gather some more information on the matter.

In particular, there are two interesting blog posts by Joram van den Boezen about the company and its tactics. Turns out that what I thought was a very strange private cloud set up – coming as it was from Malaysia – was actually a botnet. Indeed, what appears from Joram’s investigations is that the people behind Semalt use sidecar malware both to gather URLs to crawl, and to crawl them. And this, according to their hosting provider is allowed because they make it clear in their software’s license.

This is consistent with what I have seen of Semalt on my server: rather than my blog – which fares pretty well on the web as a source of information – I found them requesting my website, which is almost dead. Looking at all the websites in all my servers, the only other affected is my friend’s which is by far not really an important one. But if we start from accepting Joram’s findings (and I have no reason not to), then I can see how that can happen.

My friend’s website is visited mostly by the people in the area we grew up in, and general friends of his. I know how bad their computers can be, as I have been doing tech support on them for years, and paid my bills that way. Computers that were bought either without a Windows license or with Windows Vista, that got XP installed on them so badly that they couldn’t get updates even when they were available. Windows 7 updates that were done without actually possessing a license, and so on so forth. I have, at some point, added a ModRewrite-based warning for a few known viruses that would alter the Internet Explorer User-Agent field.

Add to this that even those who shouldn’t be strapped for cash would want to avoid paying for anything if they can, you can see why software such as SoundFrost and other similar “tools” to download YouTube videos into music files would be quite likely to be found in computers that end up browsing my friend’s site.

What remains still not clear from all this information is why they are doing it. As I said in my previous post, there is no reason to abuse the referrer field, that is, beside to spam the statistics of the websites. Since the company is selling SEO services, one assumes that they do so to attract more customers. After all, if you spend time checking your Analytics output, you probably are the target audience of SEO services.

But after that, there are still questions that have no answer. How can that company do any analytics when they don’t really seem to have any infrastructure but rather use botnets for finding and accessing websites? Do they only make money with their subscriptions? And here is where things can get tricky, because I can only hypothesize and speculate, words that are dangerous to begin with.

What I can tell you is that out there, many people have no scruple, and I’m not referring to Semalt here. When I tried to raise awareness about them on Reddit (a site that I don’t generally like, but that can be put to good use sometimes), I stopped by the subreddit to get an idea of what kind of people would be around there. It was not as I was expecting, not at all. Indeed what I found is that there are people out there seriously considering using black hat SEO services. Again, this is speculation, but my assumption is that these are consultants that basically want to show their clients that their services are worth it by inflating the access statistics to the websites.

So either these consultants just buy the services out of companies like Semalt, or even the final site owners don’t understand that a company promising “more accesses” does not really mean “more people actually looking at your website and considering your services”. It’s hard for people who don’t understand the technology to discern between “accesses” and “eyeballs’. It’s not much different from the fake Twitter followers, studied by Barracuda Labs a couple of years ago — I know I read a more thorough study of one of the websites selling this kind of money but I can’t find it. That’s why I usually keep that stuff on Readability.

So once again, give some antibiotics to the network, and help cure the web from people like Semalt and the people who would buy their services.

Blog pages numbering

This post is mostly trivial and useless, you can skip it. Seriously.

I was musing something the other day: Typo allows to consult the whole history of my blog over time, by using a complete archive, for both the whole content, tags, and categories. These are numbered as “pages ”in the archives. But they are not permanent.

Indeed, the homepage you see is counted as “page 1” — so while the pages grow further and further, the content always moves. A post that is in page 12 today will not be in a couple of months. Sure it’s still possible to find it in monthly archives (as long as the month completed) but it’s far from obvious.

This page numbering is common on most systems where you want the most recent, or most relevant, content first, such as search engines and, indeed, most news sites or blogs. But while the bottom-up order of the posts in the single page makes sense to me, the numbering still doesn’t.

What I would like would be for pages to start from page 1 (the oldest posts) and continue further, 10-by-10, until reaching page 250 (which is pretty near at this point, for this blog), for post number 2501 — unfortunately this breaks badly, as your homepage would only have an article, if the homepage corresponded to page 250 indeed. So what is that I would like?

Well, first of all, I would say that the homepage (as well as the landing pages for tags and categories) is “page 0”, and page 0 is out of the order of the archives altogether. Page 0 is bottom-up, just like we have now, and has a fixed amount of entries. Page 1 is the oldest ten (or less) posts, top-down (in ascending date order), and so forth.

What does this achieve? Well, first of all a given post will always be at a given page. There is no more sliding around of old posts, making pages actually useful links; this includes the ability for search engines to actually have meaningful search results to those pages, instead of an ever-moving target — even though I would say that they should probably check the semantic data when reading the archive pages.

At first I thought this would have reduced the cache use as well, as stopping the sliding means that the content of a given page is not changing at every single post… unfortunately at most it can help cache fragments, as adding more pages means that there will be a different “last page number” (or link), at the bottom of the page. Of course it would be possible to use a /page/last link and only count the pages immediately before and after the current one.

Oh well, I guess this adds up to the list of changes i’d like to make to Typo (but I can’t, due to time, right now).

Who consumes the semantic web?

In my previous post I’ve noted that I was adding support for the latest fad method for semantic tagging of data on web pages, but it was obviously not clear who actually consumes that data. So let’s see.

In the midst of the changes to Typo that I’ve been sending to support a fully SSL-compatible blog install (mine is not entirely yet, mostly because most of the internal links from one post to the next are not currently protocol-relative), I’ve added one commit to give a bit more OpenGraph insights — OpenGraph is used by Facebook, almost exclusively. The only metadata that I provide on that protocol, though, is an image for the blog – since I don’t have a logo, I’m sending my gravatar – the title of the single page and the global site title.

Why that? Well, mostly because this way if you do post a link to my blog on facebook, it will appear with the title of the post itself instead of the one that is visible on the page. This solves the problem of whether the title of the blog itself should be dropped out of the <title> tag.

For what concerns Google, instead, the most important part of metadata you can provide them seems to be authorship tagging which uses Google+ to connect content of the same author. Is this going to be useful? Not sure yet, but at least it shows up in a less anonymous way in the search results, and that can’t be bad. Unlike what they say on the link, it’s possible to use an invisible <link> tag to connect the two, which is why you don’t find a G+ logo on my blog anywhere.

What else do search engines do with the remaining semantic data? Not sure, it doesn’t seem to explain it, and since I don’t know what it does behind the scenes it’s hard for me to give a proper answer. But I can guess, and hope, that they use it to reduce the redundancy of the current index. For instance, pages that are actually a list of posts, such as the main index, the categories/tags and archives will now properly tell that they are describing a blog posting whose URL is, well, somewhere else. My hope would be for the search engines to know then to link to the declared blog post’s URL instead of the index page. And possibly boost the results for the posts that result more popular (given they can then count the comments). What I’m surely counting on, is for descriptions in search results to be more humanly-centered.

Now in the case of Google you can use their Rich Snippet testing tool that gives you an idea of what it finds. I’m pretty sure that they take all this data with a grain of salt though, seeing as how many players are there in the “SEO” world, with people trying to game the system altogether. But at least I can hope that things will move in the right direction.

Interestingly, when I first implemented the new semantic data, Readability did not support it, and would show my blog’s title instead of the post’s title when reading the articles from there — after a feedback on their site they added some workaround for my case, so you can enjoy their app with my content just fine. Hopefully, with time, the microformat will be supported in the general sense.

On the other hand, Flattr still has no improvement on using metadata, as far as I can see. They require that you actually add a button manually, including repeating that kind of metadata (content type, language, tags) that is already easily inferred from the microformat given. Hereby, I’d like to reiterate my plea to Flattr developers to listen to OpenGraph and other microformat data, and at least use that to augment the manually-inserted buttons. Supporting the schema.org format, by the way, should make it relatively easy to add per-fragment buttons — i.e., I wouldn’t mind having a per-comment Flattr button to reward constructive comments, like they have on their own blog, but without the overhead that it adds to do so manually.

Right now this is all the semantic data that I figured out that is being used. Hopefully things will become more useful in the future.

A story of bad suggestions

You might have noticed that my blog has been down for a little while today. The reason that happened is that I was trying to get Google Webmaster Tools to work again, as I’ve been spending some more time lately to clean up my web presence — I’ll soon post more about news related to Autotools Mythbuster and the direction it’s going to take.

How did that cause my blog’s not coming up though? Well, the new default for GWT’s validation of the authorship of the website is to use DNS TXT records, instead of the old header on the homepage, or file on the root. Unfortunately, it doesn’t work as well.

First, it actually tends to be smart, by checking whose DNS servers are assigned to the domain — which meant that it showed up instructions on how to login on my OVH account (great). On the other hand, it told me to create the new TXT record without setting a subdomain — too bad that it will not accept a validation on flameeyes.eu for blog.flameeyes.eu.

The other problem is that the moment I added a TXT record for blog.flameeyes.eu, the resolution of the host didn’t lead to the CNAME anymore, which meant that the host was unreachable altogether. I’ve not checked the DNS documentation to learn whether this is a bug in OVH or if the GWT suggestion is completely broken. In either case it was a bad suggestion.

Also, if you happen to not be able to reach posts and you end up always on the homepage, please flush your cache, I made a mess when I was fixing the redirects to fix more links all over the Internet — it should all be fine now, and links should all work, even those that were mangled beforehand due to non-ASCII-compatible URLs.

Finally, I’ve updated the few posts were a YouTube video was linked, and they now use the iframe-based embed strategy, which means they are visible without using Adobe Flash, via HTML5. But that’s all fine, no issue should be created by that.

Do I hate bots too much?

Since I’ve been playing some extra with ModSecurity after reviewing the book, I’ve decided to implement one thing, the idea of which I’ve been toying a bit some time ago but I never went around implementing. But let’s start with some background.

I’ve had quite some pet peeves with crawlers, and generic bots. The main problem I have is with the sheer amount of them. Once upon a time you would have quite a limited amount of bots floating around, but nowadays you get quite a few of them together; some of them are the usual search engines, other are more “amateurish” things, and “of course” the usual spam crawlers, but those that do upset me are the marketing-focused crawlers. I’ll split my post focusing on each of those types, in reverse order.

Marketing crawlers are those deployed by companies that sell services like analysis of blog posts to find bad publicity and stuff like that. I definitely hate these crawlers: they keep downloading my posts for their fancy analysis, when they could use some search engine’s data already. Since most of them also seem to focus only on profiting, instead of developing their technology first, they tend to ignore the robots exclusion protocol, the HTTP/1.1 features to avoid wasting extra bandwidth, and they also ignore having some delay between requests.

I usually want to kill these bots; since some don’t look for, or even respect, the robots.txt file, and having a huge robots.txt file would be impractical. So for that, I use a list of known bot names and a simple rule in ModSecurity that denies them access. Again, thanks to Magnus’s book, I’ve been able to make the rule much faster by using the pmFromFile matcher instead of the previous regular expression.

The second category are the spam crawlers, something that nowadays we’re unfortunately quite used to see. In this category you actually have more than one type of target: you have those who crawls your site to find email addresses (which is what Project Honey Pot tries to look out for), those who send requests to your site to spam the referrer counter (to gain extra credit if your referrer statistics are public – one more reason why my statistics are secured by obscurity and by a shallow password), and those who make use of your feeds to get content to post on their site, without a link but with lots of advertising.

These are nasty, but are more difficult to kill, and I’ll get to that later.

The third category is the one of the amateurish crawlers: new technologies being developer, “experimental” search engines and the like. I understand that it’s quite difficult for the developers to have something to work with if we all were to block them. But on the other hand, they really should start by respecting protocols and conventions, as well as by describing their work, and where the heck they are trying to get with it.

One funny thought here: if there was a start-up that wanted to developer new crawler technology, by heavily distributing rules and documentation to filter their requests out, it’s probably a quite evil way to kill the company off. To give a suggestion to those who might find themselves in that situation: try getting a number of affiliates who will let you crawl their site. To do that you need to either show a lot of good intent, or bribe them. It’s up to you what you decide to do, but lacking both, it’s likely going to be hard to get your stuff together.

The last category is search engine crawlers. Googlebot, msnbot, Yahoo! Slurp. The whole bunch is usually just disabled through robots.txt and there is nothing to say about them in general. The whole point about talking about them here is that, well, it happens that all of the crawlers in the previous categories sometimes try to pass themselves as one of the more famous crawlers to be let in. For this reason, all of them suggest you to check their identity through double-resolution of the IP address: get the IP address of the request, reverse resolve them to the hostname (checking that it falls in the right domain, for instance for googlebot it’s simply .googlebot.com), and then resolve the hostname to ensure it’s still the same address.

The double resolution is useful to make sure that the fake bot is not connected enough to set the reverse resolution to point to the correct domain. Luckily, Apache already has code to handle this properly to check the host-based authorizations: you just need to set HostnameLookups to Double. And once that’s enable, the REMOTE_HOST variable for ModSecurity is then available. The result is the following snippet of Apache configuration:

HostnameLookups Double

SecRule REQUEST_HEADERS:User-Agent "@contains googlebot" 
    "chain,deny,status:403,msg:'Fake Googlebot crawler.'"
SecRule REMOTE_HOST "!@endsWith .googlebot.com"

SecRule REQUEST_HEADERS:User-Agent "@contains feedfetcher-google" 
    "chain,deny,status:403,msg:'Fake Google feed fetcher.'"
SecRule REMOTE_HOST "!@endsWith .google.com"

SecRule REQUEST_HEADERS:User-Agent "@contains msnbot" 
    "chain,deny,status:403,msg:'Fake msnbot crawler.'"
SecRule REMOTE_HOST "!msnbot-[0-9]+-[0-9]+-[0-9]+.search.msn.com"

SecRule REQUEST_HEADERS:User-Agent "@contains yahoo! slurp" 
    "chain,deny,status:403,msg:'Fake Yahoo! Slurp crawler.'"
SecRule REMOTE_HOST "!@endsWith .crawl.yahoo.net"

At that point, any request from the three main bots will be coming from the original requester. You might notice that it uses a more complex regular expression to validate the Microsoft bot. The reason for that is that both Google and Yahoo! to be safe do provide the crawling hosts with their own (sub)domain, but Microsoft and (at a quick check, as I haven’t implemented the tests for it, since it doesn’t have as many hits as the rest) Ask Jeeves don’t have special domains (the regexp for Ask Jeeves would be crawler[0-9]+.ask.com). And of course changing that is going to be tricky for them because many people are already validating them. So, learn from their mistakes.

Hopefully, the extra rules I’m loading ModSecurity with are actually saving me bandwidth rather than waste it; given that some fake bots seem to do hundreds of requests a day, that’s probably very likely. Also, thankfully, I have nscd running (so that Apache does not have to send all the requests to the DNS server), and the DNS server is within the local network (so the bandwidth used to contact that is not as precious as the one used to send the data out).

My next step is probably going to be optimisation of the rules, although I’m not sure how to proceed for that; I’ll get to that when I push this to an actual repository for a real project, though.

Links checking

I started writing a bog just to keep users updated on the development of Gentoo/FreeBSD and other projects I worked on; it was never my intention to make it my biggest project, but one thing causes the other and I’m afraid to say that lately my biggest contribution to free software is this very blog. I’m not proud of this, it really shouldn’t be this way, but lacking time (and having a job that gets me to work on proprietary rather than free software), this is the best I can come up with.

But, I still think that a contribution is only worth to the extent it’s actually properly done, for this reason it bother me I cannot go over all the current posts and make sure there aren’t any factual mistake in them. Usually, if I know of something I got wrong for any reason, and I want to explain the mistake and fix it, after a longish time from publication, I just write a new “correction” entry and link to the older post; originally this worked out nicely because Typo would handle the internal trackback, so that it could be automatically circularly linked; unfortunately trackbacks don’t seem to work even though I did enable them when I started the User-Agent filtering (so that the spam could be reduced to a manageable amount).

In addition, there are quite a few posts that are for now only available on the older blog which bothers me quite a bit, since it’s actually full of spam, gets my name wrong, and forces users to search two places for the first topics I wrote about. Unfortunately migrating the posts out of the b2evolution install is quite cumbersome, and I guess I should try to bribe Steve again about that.

Update (2016-04-29): I actually imported the old blog in 2012. I also started merging every other post I wrote anywhere else in the mean time.

Anyway, beside the factual errors in the content, there are a few other things that I can and should deal with, on the blog, and one of this is the validity of the external and internal links. Now, I know this is the sort of stuff that falls into the so-called ”Search Engine Optimisation” field, and I don’t care. I dislike the whole idea and I find that calling that ”SEO” is just a way for script kiddies to feel important like a “CEO”; I don’t do this for the search engines, I do this for the users; I don’t like when I find a broken link on a site, so I’d like for my own sites not to have broken links.

The Google Webmaster Tools is a very interesting tool in this regard since it allows you to find broken inbound links; I already commented about OSGalaxy breaking my links (and in the mean time I don’t get published any longer in there because they don’t handle Atom feeds); for that and other sites, I keep a whole table of redirections for the blog’s URLs, as well as a series of normalisation for URLs that often have trailing garbage characters (like semicolons and other things).

Unfortunately what GWT lacks is a way to check outbound links, at least as far as I can see; I guess it would be a very useful tool for that because Google has to index the content anyway so adding checks for that stuff shouldn’t be much of a problem for them. The nicest thing would be for Typo (the software handling my blog) to check the links before publishing, and alerting me for errors (an as-you-type check would help but it would require for a proxy to cache requests for at least a few hours otherwise I would be hitting the same servers many time while writing). Since that does not seem to be happening for now and I don’t foresee it to happen in the near future, I’m trying to find an alternative approach.

At the time I’m writing (which is not when you’re going to read this post), I’m running locally a copy of the W3C LinkChecker (I should package it for Gentoo, but I don’t have much experience with Perl packaging), over my blog; I already executed it over my site and xine’s and fixed a few of the entries that the link checker already spewed out.

Again, this is not the final solution I need, the problem with this is that it does not allow me to run an actual incremental scan; while I currently am caching all the pages through polipo this is not going to work for the long run, just for today’s spree. There are a quite a few problems with the current setup, though:

  • it does not allow to remove the 1-second delay on requests, not even for localhost (when I’m testing my own static site locally I don’t need delay at all, I can actually pipeline lots of requests together);
  • it does not just have a way to provide a “whitelist of unreachable URLs” (like my Amazon’s wishlist that does not respond to the HEAD request);
  • while the output is quite suitable to be sent via email (so I can check each day for new entries), I would have preferred for it to output XML, with a provided XSL to convert it to something user friendly, that would have allowed me to handle the URL filtering in a more semi-automatic way;
  • finally, it does not support IDN, and I like IDN which makes me a bit sad;

For now, what I gathered from the checker output is that my use of Textile for linking causes most of the bad links in the blog (because it keeps the semicolons, closed parentheses and so on as part of the link), and I dislike the effect of the workaround of adding spaces (no, the “invisible space” is not a solution since the parser doesn’t understand that is whitespace, and also add that to the link). And there are lots of broken links because since, after the Typo update , the amazon: links don’t work any longer. This actually will give me a bit of a chance: they used to be referral links (even though they never made any difference), now after the change of styles I don’t need those any longer thus I’ll just replace them database-wide to the direct link.

Reducing the blog

Since I’ve written about my stake about the superabundance of web content I’ve been thinking how to reduce the duplication and wheel re-invention on my blog.

The first chance have been the refined mod_security rules which is now a bit tighter, and actually allowed me to disable the comments pre-moderation. This is not only one less thing that the blog has to take care of, but especially one less thing I have to do, which frees up my time to do more important things. Like writing posts.

The second chance have been my thinking about feeds which now brought me to the point the blog does not provide RSS feed any longer, and only provides Atom feeds all over the place, for tags, categories, articles and comments, just Atom feeds. If your reader is not broken, it should automatically switch them since the old RSS urls are redirected to Atom; this way the amount of cache Typo has to take care of is considerably reduced in both entries and size.

This is good and dandy to reduce the amount of work done behind the scene, but does not reduce much what the blog exports to outside. Sure there is less to crawl for search engines, since now there are half the feeds there were before, but that’s barely user visible. I’ve removed the links to the RSS version, so that the Syndication box in the right column is also shorter, but still does not limit what is exported to users.

For the “less is more” idea, I really would love if my blog was quite minimal in what it recreates, and the one thing that right now is making me think is the search box. I’ve already removed months ago the in-typo search method, since it wasn’t working that well for multi-page results, which is not that rare on my blog with over one thousand posts, when you press enter, you’re going to use Google for the search page, and that is quite nice actually. I do have to find a way to avoid it showing all the tag, category and archive pages though. Today, I’m thinking whether the live search box is still useful.

Beside being buggy and sometimes searching for a short substring rather than what I went on typing, it’s using JavaScript client-side, and makes a bit of request server-side too. And it does not search in the comments, but just on the articles’ bodies. On the other hand the Google custom search is set already to search on both the blog (comments included), the site and the gitweb archive, which usually makes much more sense. While it requires one full page load out of the side, is probably much more likely to find what you’re looking for, if you’re searching my blog.

As for much more visible changes, I’m probably going to convert the site to use the same “Beautiful Day” theme as the blog so that they mix in better. Although I would love if a friend of mine were to prepare a design for me since I love his; maybe not so minimal, but I don’t see why I have so much graphics on mine (and yes I know I don’t have any on the site).

Service entry: looking for information about a “Yeti bot”

The stats for my blog, yesterday, were all messed up. With a quite usual amount of unique visits (about 900), the amount of requested pages and of total hits went skyrocketing, totaling more than 400MB of traffic, against an usual ~100 MB (depending on blog posts and if they are picked up by other sites too).

Sure it wasn’t last year’s Slashdot effect bound stats, but they were still quite a bit of a bandwidth being used. AWstats wasn’t picking up any new bot, not a specific single IP biasing the stats, so I had to do some manual analysis to find the cause…

There are two possible culprits, one is a German IP (reporting Opera as user agent), which seemed to refresh my last post on Gentoo’s “issue” constantly starting from 9 AM till 10 PM. Seems a legit request, although I’d suggest that reader, if (s)he’s reading this, to use the RSS feed for the comments instead, that will save their and mine bandwidth ;)

The other clearly is a bot, as it advertise itself as such: “Yeti/0.01 (nhn/1noon, yetibot@naver.com, check robots.txt daily and follow it)” . The requests from this bot come from a single B class, although mixed with a different “NaverBot” (which points to http://help.naver.com/delete_main.asp in the useragent).

The netblock owner is NHN Corporation, which seems to be the entity behind that Naver site, which seems to be some kind of search engine, likely something similar to Technorati, but my Korean is… well let’s just say the only Asiatic language I can barely understand is Japanese.

I don’t mind indexing, I don’t stop any robot in my robots.txt, and right now bandwidth is far from being a problem (it would have been a very big problem if the blog was still hosted on my home connection though), but they hit the robots.txt file 384 times just yesterday, out of 542 hits total in the day! I’d very much like to write them about this at this point.

So the question would be, am I the only one hit by this “Yeti bot” out there? Any of my readers understand Korean and can tell me what the page linked above for NaverBot says?

Sorry for the service posting.