You probably don’t know that, but for my blog I do analyse the Apache logs with AWStats over and over again. This is especially useful at the start of the month to identify referrer spam and other similar issues, which in turn allows me to update my ModSecurity ruleset so that more spammers are caught and dealt with.
To do so, I’ve been using for, at this point, years, AWStats which is a Perl analyzer, generator and CGI application. It used to work nicely, but nowadays it’s definitely lacking. It doesn’t filter referrers search engines as much as it used to be (it’s still important to filter out requests coming from Google, but newer search engines are not recognized), and most of the new “social bookmark” websites are not there at all — yes it’s possible to keep adding to them, but with upstream not moving, this is getting harder and harder.
Even more important, for my ruleset work, is the lack of identification of modern browsers. Things like Android versions and other fringe OSes would be extremely useful for me, but adding support for all of them is a pain and I have enough things on my plates that this is not something I’m looking forward to tackle myself. It’s even more bothersome when you consider that there is no way to reconsider the already analyzed data, if a new URL is identified as a search engine, or an user agent a bot.
One of the most obvious choices for this kind of work is to use Google Analytics — unfortunately, this means that it will only work if it’s not blacklisted from the user side — that includes NoScript users and of course most of the spammers. So this is not a job for them. It’s something that has to be done on the backend, on the logs side.
The obvious point at that point is to find something capable to import the data out of the current awstats datafiles I got, and keep importing data from the Apache log files. Hopefully this should be done by saving the data in a PostgreSQL database (which is what I usually use); native support for vhost data, but the ability to collapse it in a single view would also be nice.
If somebody knows of a similar piece of software, I’d love to give it a try — hopefully, something that is written in Ruby or Perl might be the best for me (because I can hack on those) but I wouldn’t say no to Python or even Java (the latter if somebody helped me making sure the dependencies are all packed up properly). This will bring you better modsec rules, I’m sure!