On Friday I updated my Autotools Mythbuster guide to add support for 2.68 portability notes (all the releases between 2.65 and 2.67 have enough regressions to make them a very bad choice for generic use — for some of those we’ve applied patches that make projects build nonetheless, but those releases should really just disappear from the face of Earth). When I did so, I announced the change on my identi.ca and then looked at the log, for a personal experiment of mine.
In a matter of a couple of hours, I could see a number of bots coming my way; some simply declared themselves outright (such as StatusNet that checked the link to produce the shortened version), while others tried more or less sophisticated ways to pass themselves for something else. On the other hand it is important to note that many times when a bot declares itself to be something like a browser, it’s simply to get served what the browser would see, for browser-specific hacks are still way too common, but that’s a digression I don’t care about here.
This little experiment of mine was actually aimed at refining my ModSecurity ruleset since I had some extra free time; the results of it are actually already available on the GitHub repository in form of updated blacklists and improved rules. But it made me think about a few more complex problems.
Amazon’s “Elastic Computer Cloud” (or EC2) is an interesting idea to make the best use of all the processing power of modern server hardware; this makes the phrase of a colleague of mine last year, sound even more true (“Recently we faced the return of clustered computing under the new brand of cloud computing, we faced the return of time sharing systems under the software as a service paradigm […]”) when you think of them introducing a “t1.micro” size for EBS-backed instance, for non-CPU-hungry tasks, that can be run with minimal CPU, but need more storage space.
But at the same time, the very design of the EC2 system gets troublesome in many ways; earlier this year I encountered troubles with hostnames when calling back between different EC2 instances, which ended up being resolved by using a dynamic hostname, like we were all used to use at the time of dynamic IP connections such as home ADSL (which for me has been basically till a couple of years ago). A very old technique, almost forgotten by many people, but pretty much necessary here.
It’s not the only thing that EC2 brought back from the time of ADSL though; any service based on it will lack a proper FcRDNS verification, which is very important to make sure that a bot request hasn’t been forged (that is until somebody creates a RobotKeys standard similar to DomainKeys standard), leaving it possible to non-legit bots to pass for legit ones, unless you can actually find a way to discern between the two with deep inspection of the requests. At the same time, it makes it very easy to pass for anything at all, since you can just judge by the User-Agent to find out who is making a request, as the IP address are dynamic and variable.
This situation lead to an obvious conclusion in the area of DNSBL (DNS-based black lists): all of the AWS network block is marked down as a spam source and is thus mostly unable to send email (or in the case of my blog, to post comments). Unfortunately this has a huge disadvantages: Amazon’s own internal network faces Internet from the same netblock, which means that Amazon employers can’t post comments on my blog either.
But the problem doesn’t stop there. As it was, my ruleset cached the result of robots analysis based on IP for a week. This covers the situation pretty nicely for most bots that are hosted on a “classic” system, but for those running on Amazon AWS, the situation is quite different: the same IP address can change “owner” in a matter of minutes, leading to false positives as well as using up an enormous amount of cache entries. To work around this problem, instead of hardcoding the expiration date of any given IP-bound test, I use a transaction variable, which defaults to the previous week, but gets changed to an hour in the case of AWS.
Unfortunately, it seems like EC2 is bringing us back in time, in the time of “real-time block lists” that need to list individual IPs rather than whole netblocks. What’s next, am I going to see again construction signs in websites “under construction”?