Google Maps: Spam and Scams

Last year I wrote an analysis of fishy ads on Facebook, among other things because I had no way to escalate the level of scam that I was getting, and I thought that it would be good for others to be able to follow the breadcrumbs to recognize that type of scams. Since this year I switched company, I find myself being unable to escalate the massive amount of scams and spam I see around me in Google Maps… so let me try to show you what I mean.

For the past two and a half years I’ve been living in West London — I’m not going to show exactly where, but it’s in Brentford. And when I moved into the apartment, I noticed that the various pictures I was taking at home (of food, the back of the router, and so on) kept popping up on my notifications with a suggestion by Google Maps to add them to a review of the building my apartment is in. The reason for that turned out to be that the building was marked as Lodging, which had it acting as a commercial accommodation, rather than a residential building. Oops, but why?

Well, turns out that it’s a smaller version of what’s going on in London and the rest of the world, and it became apparent a few months later, and very clear now. On our floor, there’s not one, but two “holiday rental” units — despite that being apparently against the leasehold agreement between the building and the owners. And indeed, both that company and another few appears to have tried taking over the Google Maps entities for the apartment buildings in our complex, and in a few of the nearby ones, particularly adding phone numbers for inquiries about the buildings — buildings that have no central phone number.

Let me try to show you how deep that rabbit hole is — on your right you can see “Pump House Crescent”, which is a road going around some tall apartment buildings not far from where we live. You can already notice that there’s a Green Dragon Estate that has a pink pin instead of the gray one — that’s because it’s currently marked up as a Lodge and it has reviews — probably for the same reason why I was prompted to add pictures from my own apartment. As far as I can tell, it’s actually part of the various blocks on the crescent, and instead holds residential apartments. I’ve sent an edit request.

When you zoom in a bit more, you can see interesting things. Turner House, Hyperion Tower, Cunningham House, Masson House, and Aitons House are all marked as either Apartment Building or Flat Complex, both of which are residential, and that’s correct. But then there’s Bridgeman House that is marked as Lodging — and that is likely wrong (edit sent and published as I’m typing, turns out Google Maps reviewers are very prompt to fix these issues when you report them).

But not far from it is the possible scam you can see on the left. 360 Serviced Apartments is even showing availability for a residential building — which they are not in. They provide their address as Masson House for Google Maps to put them there, but their actual company address according to their website is in London W5 — and the picture that they posted is quite a way away on the river, rather than where the pin is present. Can we call it scam? I think I will.

It doesn’t end here — there’s another similar scam across the road for another serviced apartment that even managed to put their website’s domain in the name of the point of interest! In this case there’s no picture of the outside, so it is possible that the inside pictures are actually appropriate — but again we’re talking about a residential building, in a residential area, that, as far as I can tell, is not cleared for subletting. Does it end here? Of course it does not.

In the gallery above you can see one of the most blatant scams that I could find on Maps in this area. As you can see from the first picture, down a lane from Kew Bridge station there’s an Apple Apartments Kew Bridge. Once you click on it, you can see that is reported as a 4-star hotel. With a picture of a station — except that the station is definitely not the National Rail Kew Bridge station, but rather the London Underground Hammersmith Station, which as any resident of West London would be able to tell you, is nowhere near Kew Bridge!

And as you scroll the gallery, you can see more pictures uploaded as if they were taken from the hotel, but they are clearly taken at different places. There’s pictures of Kew Bridge itself, as if you were able to see it from the place. And then you can see from the various reviews that this is not a 4-star hotel at all. Indeed, the star rating of UK hotels is defined by the AA, and for 4-star hotels they expect:

4 stars: Professional, uniformed staff respond to your needs or requests. Well-appointed public areas. The restaurant or dining room is open to residents and non-residents. Lunch is available in a designated eating area.

Let me remind you that it’s definitely a holiday let apartment that those “Apple Apartments” are.

I’m sure that all of these examples are not limited to Google Maps, as I remember being a “SuperUser” for FourSquare years ago, and having to review similar spam/scam situations, which is why I’ve been doing this on and off with Google Maps, both while working at Google and now. But at the same time, I don’t think it’s fair we, the public, end up having to clean up after this level of spam and abuse.

Anyway, if you live in an area that has lots of residential buildings, do take a look if there’s a lot of them marked Lodging, and that have phone numbers (particularly mobile numbers), and consider giving them a clear up if you can. It’s not just about removing scams for tourists, it’s also making sure that a residential area is not peppered with commercial corridors created by rule-bending (or rule-breaking) holiday let listings.

What’s up with Semalt, then?

In my previous post on the matter, I called for a boycott of Semalt by blocking access to your servers from their crawler, after a very bad-looking exchange on Twitter with a supposed representative of theirs.

After I posted that, I got threatened by the same representative to be sued for libel, even though what that post was about was documenting their current practices, rather than shaming them. This got enough attention of other people who has been following the Semalt situation so that I could actually gather some more information on the matter.

In particular, there are two interesting blog posts by Joram van den Boezen about the company and its tactics. Turns out that what I thought was a very strange private cloud set up – coming as it was from Malaysia – was actually a botnet. Indeed, what appears from Joram’s investigations is that the people behind Semalt use sidecar malware both to gather URLs to crawl, and to crawl them. And this, according to their hosting provider is allowed because they make it clear in their software’s license.

This is consistent with what I have seen of Semalt on my server: rather than my blog – which fares pretty well on the web as a source of information – I found them requesting my website, which is almost dead. Looking at all the websites in all my servers, the only other affected is my friend’s which is by far not really an important one. But if we start from accepting Joram’s findings (and I have no reason not to), then I can see how that can happen.

My friend’s website is visited mostly by the people in the area we grew up in, and general friends of his. I know how bad their computers can be, as I have been doing tech support on them for years, and paid my bills that way. Computers that were bought either without a Windows license or with Windows Vista, that got XP installed on them so badly that they couldn’t get updates even when they were available. Windows 7 updates that were done without actually possessing a license, and so on so forth. I have, at some point, added a ModRewrite-based warning for a few known viruses that would alter the Internet Explorer User-Agent field.

Add to this that even those who shouldn’t be strapped for cash would want to avoid paying for anything if they can, you can see why software such as SoundFrost and other similar “tools” to download YouTube videos into music files would be quite likely to be found in computers that end up browsing my friend’s site.

What remains still not clear from all this information is why they are doing it. As I said in my previous post, there is no reason to abuse the referrer field, that is, beside to spam the statistics of the websites. Since the company is selling SEO services, one assumes that they do so to attract more customers. After all, if you spend time checking your Analytics output, you probably are the target audience of SEO services.

But after that, there are still questions that have no answer. How can that company do any analytics when they don’t really seem to have any infrastructure but rather use botnets for finding and accessing websites? Do they only make money with their subscriptions? And here is where things can get tricky, because I can only hypothesize and speculate, words that are dangerous to begin with.

What I can tell you is that out there, many people have no scruple, and I’m not referring to Semalt here. When I tried to raise awareness about them on Reddit (a site that I don’t generally like, but that can be put to good use sometimes), I stopped by the subreddit to get an idea of what kind of people would be around there. It was not as I was expecting, not at all. Indeed what I found is that there are people out there seriously considering using black hat SEO services. Again, this is speculation, but my assumption is that these are consultants that basically want to show their clients that their services are worth it by inflating the access statistics to the websites.

So either these consultants just buy the services out of companies like Semalt, or even the final site owners don’t understand that a company promising “more accesses” does not really mean “more people actually looking at your website and considering your services”. It’s hard for people who don’t understand the technology to discern between “accesses” and “eyeballs’. It’s not much different from the fake Twitter followers, studied by Barracuda Labs a couple of years ago — I know I read a more thorough study of one of the websites selling this kind of money but I can’t find it. That’s why I usually keep that stuff on Readability.

So once again, give some antibiotics to the network, and help cure the web from people like Semalt and the people who would buy their services.

WebP and the effect on my antispam

When I was posting my notes about WebP I found out at the time of posting that I could not post on my blog any more. The reason was to be found in my own ModSecurity rules as quite a long time ago I added to the antispam rules one that stops POST requests if they included image/webp in the Accept header.

Unfortunately, for whatever reason, instead of just adding image/webp to all the image requests, they added it to every single request that Chrome makes, including the POST requests when submitting a form… It does not entirely sound correct, to be honest, but there probably was a reason for that.

So I dropped the WebP check from my rules. And today I check my comments, and I found four spam elements. Turns out that the particular check was very effective, and it’s going to be a pain to leave it be. On the other hand, it seems like it’s accepting image/x-bitmap and coming from Firefox, two conditions that I expect are never met by real-life browsers, so I can probably look into adding a rule for that.

Another interesting rule I added recently and that I did not discuss yet is related to the fact that this blog is now only available over HTTPS. Most of the spam comments I receive are posted directly over HTTPS, but they report as referrer the original post’s URL over plain HTTP. Filter these out, and most of my spam is gone.

Long live ModSecurity — the problem is going to be when HTTP2 will be out, as it’s binary and leaves much less space to request fingerprinting.

ModSecurity and my ruleset, a release

After the recent Typo update I had some trouble with Akismet not working properly to mark comments as spam, at least the very few spam comments that could get past my ModSecurity Ruleset — so I set off to deal with it a couple of days ago to find out why.

Well, to be honest, I didn’t really want to focus on why at first. The first thing I found out while looking at the way Typo uses akismet, is that it still used a bundled, hacked, ancient akismet library.. given that the API key I got was valid, I jumped to the conclusion, right or wrong it was, that the code was simply using an ancient API that was dismissed, and decided to look around if there is a newer Akismet version; lo and behold, a 1.0.0 gem was released not many months ago.

After fiddling with it a bit, the new Akismet library worked like a charm, and spam comments passing through ModSecurity were again marked as such. A pull request and its comments later, I got a perfectly working Typo which marks comments as spam as good as before, with one less library bundled within it (and I also got the gem into Portage so there is no problem there).

But this left me with the problem that some spam comments were still passing through my filters! Why did that happen? Well, if you remember my idea behind it was validating the User-Agent header content… and it turns out that the latest Firefox versions have such a small header that almost every spammer seem to have been able to copy it just fine, so they weren’t killed off as intended. So more digging in the requests.

Some work later, and I was able to find two rules with which to validate Firefox, and a bunch of other browsers; the first relies on checking the Connection: keep-alive header that is always sent by Firefox (tried in almost every possible combination), and the other relies on checking the Content-Type on the POST request for a charset being defined: browsers will have it, but whatever the spammers are using nowadays doesn’t.

Of course, the problem is that once I actually describe and upload the rules, spammers will just improve their tools to not commit these mistakes, but in the mean time I’ll have some calm, spamless blog. I still won’t give in to captchas!

At any rate, beside adding these validations, thanks to another round of testing I was able to fix Opera Turbo users (now they can comment just fine), and that lead me to the choice of tagging the ruleset and .. releasing it! Now you can download it from GitHub or, if you use Gentoo, just install it as www-apache/modsec-flameeyes — there’s also a live ebuild for the most brave.

Telecom Italia, il Registro Pubblico delle Opposizioni, e l’utente

Sorry if this post is written in Italian, but it is related to an Italian issue and makes little to no sense for anyone who doesn’t speak Italian, so …

Spero che questo post possa essere d’aiuto ad altri, ma temo che possa solo mostrare una situazione che non può essere risolta in maniera semplice e soprattutto non da singole persone. Ma cominciamo dall’inizio.

Quando ho iniziato ad avere contatti da parte di clienti, datori di lavoro e pure amici senza dover disturbare i miei genitori, oltre che poter fare e ricevere telefonate anche quando mia madre fosse al telefono per ore con le sue amiche, ho deciso di prendere una numerazione VoIP personale da usare come “numero di ufficio”. All’epoca la scelta più economica e comune era Skypho di Eutelia che poi è stato semplicemente rinominato in Eutelia VoIP.

Sono anni che uso quel numero, e per un certo periodo (e anche ora in realtà, ma per poco), quel numero è stato presente nell’elenco telefonico, senza consenso alle chiamate commerciali. Fino a poco dopo aver aperto partita IVA, ho usato il numero per qualsiasi comunicazione che mi chiedesse un numero diverso dal mio numero di cellulare.

Purtroppo Eutelia ha sempre avuto un sistema VoIP pessimo, e mentre all’inizio era accettabile, ad un certo punto è diventato così fastidioso che ho deciso di spostare il mio numero ufficiale presso un altro provider, gestito da un mio amico e collega. E ancora purtroppo, Eutalia non permetteva la portabilità dei propri numeri di telefono, ancora più un problema perché un numero 0418 è riservato ad operatori non-Telecom. Quindi per non dover aver a che fare con il server SIP di Eutelia, l’unica mia scelta era quella di trasferire le chiamate verso il nuovo numero, il che significa che ogni chiamata trasferita era pagata, d’oh!

Ad ogni modo, già un anno o due fa cominciavo a ricevere chiamate da Telecom Italia proponendomi di rientrare in Telecom (dopo che l’avevo già fatto!), o di prendere contatto col nuovo responsabile di zona (che non esiste visto che ormai Telecom Italia usa agenzie da un bel po’ di tempo). Per un po’ hanno anche provato a propormi di riportare quel numero a Telecom (ogni volta che dicevo loro che quel numero non era Telecom), anche se sapevo benissimo che, essendo la decade otto (0418) riservata ad operatori diversi da Telecom, non avevano la minima chance di portar dentro il numero anche se fosse stato loro possibile avere accordi con Eutelia per farlo.

Alla fine, il numero è stato portato fuori da Eutelia, ed è controllato dal mio provider preferito, grazie ai nuovi regolamenti entrati in vigore a Febbraio 2011, che obbligano i gestori a fornire la portabilità dei numeri non mobili per i propri utenti. Questo significa che non ho più bisogno di effettuare trasferimenti di chiamata per tutte le chiamate, ma faccio comunque trasferimenti di chiamata quando non sono in casa (o per meglio dire in ufficio), in modo da non perdere le chiamate ma di riceverle invece sul cellulare.

Il problema è che dall’anno scorso, ogni due settimane, ricevo una chiamata di Telecom Italia: il copione è cambiato nel tempo, prima proponendomi di rientrare in Telecom, poi di passare a TIM, e ieri hanno iniziato con il “le facciamo sapere le nuove tariffe perché è un cliente di lunga data senza mai cambiare operatore” (prima dell’anno scorso, l’ultima fattura di Telecom che possa trovare, decisamente non intestata a me, è del 2001 o giù di lì). Tra l’altro, da un po’ di tempo a questa parte le chiamate arrivano con un caller ID 191, come se arrivassero dal call center di Telecom stessa.

Con il nuovo operatore ho richiesto di non essere aggiunto in elenco, principalmente perché per essere messo in elenco avrei dovuto pagare extra, e non ha né senso né importanza per me farlo. E appena ne ho avuto notizia ho iscritto il numero al registro pubblico delle opposizioni che permette di indicare espressamente una numerazione come una da non chiamare. Secondo il loro stesso status check, questo risulta convalidato da almeno il 27 Maggio 2011.

Ieri mi sono stufato, dopo aver ripetuto all’operatrice, stizzito, che non mi interessa e che non desidero essere richiamato per l’ennesima volta (no, non è la prima volta che glielo dico!), ho preso e chiamato il 191 (dopo tutto riportano quello come numero di arrivo). Spiegata loro la situazione, mi hanno indicato principalmente due possibili strade per gestire la cosa:

  • chiedere al mio fornitore di aprire un reclamo presso il Ministero (dello Sviluppo Economico, suppongo), indicando data e ora dell’ultima chiamata, e richiedendo una verifica della concessione del call center;
  • inviare a Telecom Italia direttamente un reclamo, se ho un’idea di quale agenzia stia utilizzando il call center.

La prima strada è un po’ scomoda, principalmente il mio fornitore mi ha risposto che preferirebbe non aprire il reclamo, perché altrimenti la prima risposta è un contro-reclamo verso di lui a cui deve difendersi. E questo potrebbe non portare a nulla di buono perché ci sono un po’ di metodi per girare attorno al problema e chiamare lo stesso anche se il Registro dovrebbe evitarlo.

La seconda.. dovrà aspettare almeno settimana prossima perché sono correntemente senza stampante (la mia si è rotta e il mio fornitore me ne sta cucendo una “semi-nuova”). E quindi ho provato ad optare per una terza strada.

Primo punto, credo di sapere quale agenzia sia coinvolta. Perché lo scorso luglio, nella speranza di far smettere queste chiamate (e con la scusa che ne avevo in effetti la necessità), ho preso appuntamento per “passare a TIM” (in realtà, mi servivano solo due SIM, una per questo portatile e una per l’iPad — mentre Tre è decisamente la più conveniente, TIM ha una copertura maggiore, e questo la rende insostituibile in certe situazioni). E mi han fornito subito appuntamento con un agente di zona. Ora, che questo appuntamento abbia creato un gran elenco di altri problemi è indubbio, ma il fatto che non mi abbiano fatto richiamare significa che lavoravano direttamente con l’agenzia di questa persona: Serenissima Informatica di Padova — sì son passato al name and shame e ora capirete perché.

Come dicevo, c’è stata una lunga serie di problemi relativi alla mia richiesta di due SIM; basti sapere che alla fine ho dovuto aspettare un mese esatto per avere TIM sul mio iPad. Per questo motivo ho il numero dell’agente sottomano per chiamarlo subito. Peccato che ieri non abbia risposto subito e abbia dovuto richiamarlo oggi; per fortuna si ricordava di me e del casino che era successo. Okay, partiamo bene. Continuiamo male però: perché invece di interessarsi direttamente, la sua risposta è stata abbastanza evasiva, dicendo che ci son tante agenzie a Padova, che anche se smettessero loro inizierebbero le altre e via dicendo. Diciamo che non mi ha convinto troppo.

Alla fine eravamo rimasti d’accordo che gli avrei mandato una mail con il riassunto del problema, il mio codice utenza per RPO, e che l’avrebbe girata ai suoi responsabili. Peccato che … la mail che mi ha lasciato sul biglietto da visita non esista. Il dominio che mi ha fornito non risulta neanche intestato alla stessa agenzia. Evviva.

Il risultato? Per ora aspetterò che il mio fornitore metta in produzione il nuovo sistema, che include una blacklist configurabile per le chiamate, in modo da bloccare le chiamate che paiono arrivare da 191. Tanto comunque lascio sempre il mio numero di cellulare come primo contatto, il numero dell’ufficio è un fallback se questo non fosse disponibile perlopiù.

ModSecurity, changing times

You probably remember my recent rant about Debian’s ModSecurity packaging that started with me trying to get my ruleset to work on VideoLAN to help them fight the spam back. Well, thanks to the guys at the ModSecurity twitter account I was able to get in touch with the Debian maintainer (Alberto) and it now looks like the story will have a happy ending.

Alberto is working on a similar split of the ModSecurity module and the Core Rule Set configuration files, so that they can be managed with the Debian package manager, just like they can be managed with Portage already. And to make it easier to admin both distribution, I’ve decided to make a few changes to the Gentoo ebuilds so that the installed layout of the two varies the least possible.

The first change relates to the internal name of the package; while I haven’t decided to do a package move yet, mod_security is a Gentooish spelling; the package is actually called ModSecurity upstream and the tarball is named modsecurity-apache; you can already see that the CRS is modsecurity-crs. Configuration files and storage directories are now also using modsecurity — I’ll see when I’ll feel to rename the package altogether to www-apache/modsecurity.

The second change relates to the way the rule configuration files are installed; up to now the rules were installed in a subdirectory of the Apache configuration tree; this is not suitable for Debian and even in Gentoo it looked awkward — the new directory for the ModSecurity CRS rules is /etc/modsecurity. Furthermore, what once was modsecurity_crs_10_config.conf is now /etc/apache2/modules.d/80_modsecurity-crs.conf and includes the inclusion masks for the rest of the rules to include. This will allow the ebuild to enable/disable rules depending on USE flags in the future.

And to make it as easy to deal with as possible, I’ve now added a geoip USE flag to mod_security — which does nothing more than adding dev-libs/geoip to its runtime dependencies and set the configuration file to use the database installed by that ebuild. The reason to having this dependency is two-fold: from one side, declaring the dependency helps making sure that the database is installed and kept updated by Portage; from the other side, if you already have a license to use MaxMind’s GeoIP databases, the package provides you with all the updater scripts you need to get the updated data from MaxMInd.

A little digression about GeoIP: I think that it might be a good idea to consider changing the GeoIP ebuild and have instead a virtual that provides the database, either in form of the updater scripts to get the paid versions, or GeoLite packages that can be updated regularly. Unfortunately I don’t have the time to follow something like this for now.

Going back to my personal favourite subject on the ModSecurity topic, my ruleset has gained a number of fake-browser pattern matching ­with a fairly low risk of false positive – thanks to the testing that you helped me with – and should now filter almost any possible spam you’re going to receive. I’m now updating the documentation to provide examples on how to debug the rules themselves; in the next days I might try to get some extra time to tag all the rules so that they can be disabled in block when the new ModSecurity 2.6 is released.

Don’t forget to recommend my ruleset, report problems and … flattr it!

Help me testing the rules!

I’ve been working even harder to make my ModSecurity ruleset as strong as possible without causing too many false positives. With the current git master I’m sure I can reject most of the spam sources before they hit at all… but.

But I’m not sure if I made it a bit too strong. So please, leave a comment on this post, with as many browsers as you usually use, and if it doesn’t work, drop me an email with your browser’s version (and if you can give me an IP address it’ll be easier to find it in the logs).

Thanks!

ModSecurity and Debian, let the challenge begin

Some of you might have already read about my personal ruleset that I developed to protect my blog from the tons of spam comments that it receives daily. It is a set of configuration files for ModSecurity for Apache, that denies access to my websites to crawlers, spammers and other malicious clients.

I was talking with Jean-Baptiste of VLC fame the past two days about using the same ruleset to protect their Wiki, which has even worse spam problems than my blog. Judging from the logs j-b has shown me, my rules already cover most of the requests he’s seeing (which is a very positive note for my ruleset); on the other hand, configuring their web host to properly make use of them is proving quite tricky.

In Gentoo, when you install ModSecurity you get both the Apache module, with its basic configuration, and a separate package with the Core Rule Set (CRS). This division is an idea of mine to solve the problem of updating the rules, which are sometimes updated even when the code itself is unchanged — that’s the whole point of making the rules independent of the engine. By using the split package layout, the updater script that is designed to be used together with ModSecurity is not useful on Gentoo so it’s not even installed — even though it is also supposedly flexible enough that I could make it usable with my ruleset as well.

In Debian, though, the situation is quite more complex. First of all there is no configuration installed with the libapache-mod-security package, which only installs the file to load the module, and the module itself. At a minimum, for ModSecurity to work you have to configure the SecData directive, and then give it the set of rules to use. The CRS files, including the basic configuration files, are installed by the Debian packages as part of the documentation, in /usr/share/doc/mod-security-common/examples/rules/.

I’ve now improved the code to provide an init configuration file that can be used without CRS.. but it seriously makes me wonder how can Debian admin deal with ModSecurity at all.

Finally, a consideration: the next version of ModSecurity will have support for looking posted URLs up in the Google Safebrowsing database, which is very good as an antispam measure.. I have hopes that either the next release or the one after will also bring Project Honey Pot http:BL support, given the Apache module was totally messed up and was unusable. That would make it a sweet tool to block crawlers and spammers!

Those strange, fake MSIE agents used by… MSN?

On my blog I have been seeing a lot of requests coming coming from an User-Agent string that definitely didn’t look right; this was, of course, one of the steps during my usual antispam analysis for which the fake agents are usually a symptom of a sloppy spammers, easy to kill on sight. In this case, though, the spammer didn’t either have nasty referrers nor it was posting comments, which was definitely unusual. What did that mean at all? Let’s start from the beginning.

The User-Agent string I was seeing is the following:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707; MS-RTC LM 8)

It isn’t, at a first glance, anything special: simply a very old Internet Explorer version (6.0) on Windows XP; this configuration is already banned from posting comments on my blog, as nobody should really stick to that ancient MSIE version unless they are forced to, and if they are forced to for whatever reason they better comment on my blog from a different system (or get a sane browser).

What makes the string smell funny is the presence of a double-space after the semicolon. No official User-Agent adds more than one space after the semicolon character, even less so when it comes to official MSIE strings. Of course it’s a nit, but it’s indeed based on such nits that my ModSecurity ruleset works.

As I sad, though, the requests are tremendously strange for a spammer: they appear coming from a browser or something rigged as one (so they request the missing favicon, the stylesheets, the images and the javascript), and at the same time they neither try to post spam comments nor they bring suspicious referrers. The next step would be to see where the requests come from on the network: a single IP address, whose reverse resolution points to the search.msn.com subdomain.

At first, my thought was some spammer trying to feign coming from the MSN network so that the most basic filtering capabilities wouldn’t trigger (my ruleset documentation suggests indeed to enable the double-resolution of the hostnames so that you have a forward-confirmed reverse resolution). But a manual FCrDNS gave in on the fact that the IP address is indeed one of those assigned to msnbot, and a quick lookup with Whois shows that the IP block is indeed assigned to Microsoft.

What’s going on with these requests? Googling (or binging — erm) up for the user agent string is obviously not going to respect the space so I couldn’t find any already explained reason why they decided to go this way. My most likely explanation is that they are trying to see which websites still support their old browser, or they are trying to get a screenshot of the pages to show in the search results. What I can’t understand is why they wanted to provide such a blatant false string rather than using a real one or making up a new one for the render bot itself.

At any rate, if you see those false strings you now know that you’re not alone.

The monthly spam analysis

New month, clean slate of Awstats-generated statistics for my blog and website to analyse.

When looking at the statistics generated by Awstats, it’s much easier to find referrer spammers at the start of the month: the links that provide more than a few referrers in a matter of hours are usually not real links but rather spamlinks. Unfortunately at least one I misunderstood one – sorry Bruno! – but on the other hand it usually proves itself quite useful.

But looking out for these spammes is not only going to populate the list of bad referrers, it also brings out the opportunity to look at fake browsers’ user-agent strings that they use. I’m not sure on why they do it, but rather than simply gathering realistic user-agent lists (which would probably be much harder to counter), they seem to Google them up; and since most statistics generator mangle them, they also get mangled. This is the only reason I can find for some of them to be so blatantly broken that it’s a piece of cake to filter them out with ModSecurity so that they can’t post comments or spam my referrer lists.

Today I was able to find another of those common situations; some spammers seem to try passing themselves as the Opera browser, but in doing so they forgot the space after the agent’s short name (Opera/9.62) and before the open parenthesis with the agent’s details. This is never done by the real browsers, so it can be safely considered one of the tagging features of spammers. My ruleset linked above also contains checks for fake Opera strings reporting as “Mozilla” (since it never does), for fake strings converting spaces to the + symbol, or not closing the details’ parenthesis at all.

While finding referrer spammers at the beginning of the month is quite easy, it also seem to be much more common for spammers to try harder during this time; this is probably for the same reason: it’s much easier for a spammer in a day to hit the top-ten of the statistics’ page, and that’s where the real pagerank comes from. Ah, the hard life of spammers and antispam developers.

Speaking about statistics software, I previously noted a shortcoming in Awstats relating to the lack of rel=nofollow on the referrer links. I sent the patch (applied in Gentoo) to add that attribute upstream and although Laurent accepted the patch and added it already, he pointed out that the page has a global noindex, nofollow directive in the <head> tag, which should cover already the issue of giving pagerank to spammers. While I agree that the theory would like us to know that this is the case, a quick check could find me a number of Awstats-generated pages all over the network. Don’t ask me why that is, even though it shouldn’t be indexed to begin with. At any rate, spammers seem to rely on that presence.

Finally, I’d like to remind you that my ModSecurity Ruleset is available on Github for free; if you do use it and are a Flattr user, though, I’d invite you to flattr it — it also will give me a way to track roughly how many people are affected by my changes.