This Time Self-Hosted
dark mode light mode Search

More crawlers hatred

After announcing it I’ve not stopped working on my ModSecurity RuleSet but, even with a bit of clumsiness in handling of changes, I’ve started using the repository as a way to document how new crawlers are discovered.

This let me reflect on a number of things, beside my concerns about EC2 I have posted already. For instance, there is still the problem of discovering new bots, and identify more complex patterns beside the User-Agent and IP themselves. Analysing each of the requests one by one is definitely out of discussion: it’ll require a huge amount of time, especially with the kind of traffic my website has. I guess the most useful thing to do here would be to apply the type of testing that Project Honey Pot have been taking care of for the past years.

I sincerely have used to use Project Honey Pot myself, but there has been a couple of things that caused trouble with it before, and at the end of the day, ModSecurity already filtered enough of those bots that it wouldn’t make an excessive amount of sense to actually submit my data. On the other hand, I guess it might provide more “pointed” data: only the most obnoxious of the crawlers would pass through my first line of defence, at first. At any rate, I’m now considering the idea of setting up again Project Honey Pot on my website, and see from there how the thing works out; maybe I’ll just add an auditlog call to the ModSecurity rules when hitting the honeypot script, and analyse those requests to find more common patterns that can help me solve the problem.

Speaking about Project Honey Pot, I’ll be masking for removal the http:BL access filter for Apache because it turned out being totally useless as it is; probably it’s outdated or worse — it hasn’t been touched for the past three years so it doesn’t really look well. For those who don’t know it, http:BL is the service provided by Project Honey Pot to their active users; it uses the collected data to export (via special DNS requests) information on a given IP address, for instance whether it is a known search engine crawler, a non-crawler or a spam crawler. The data itself is pretty valuable, I have seen very little false positive before, so it should be pretty helpful, but unfortunately, the API to access it is not the standard DNSBL, so you cannot simply use the already-provided code in ModSecurity to check it.

What I’m thinking of, right now, is making good use of the ModSecurity LUA extension capability. Unfortunately, I don’t have much experience with LUA as it is, and I don’t have much time to dedicate to this task for the moment (if you want to, I can be hired to dedicate next week to the task). The same LUA capabilities are likely to come into play if I’ll decide to implement an open proxy check, or adapt one of the IRC server ones, to work as antispam.

For the moment, I’ve updated the repository with a few more rules, and I’ll soon be pushing an optional/ directory containing my antispam configuration as well as a few extra rules that only apply in particular situations and should rather be only included within the vhosts or locations where they should be applied.

In the mean time, if you like the ruleset, and you use it, I have a flattr widget on the ruleset’s website

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.