Passive web logs analysis. Replacing AWStats?

You probably don’t know that, but for my blog I do analyse the Apache logs with AWStats over and over again. This is especially useful at the start of the month to identify referrer spam and other similar issues, which in turn allows me to update my ModSecurity ruleset so that more spammers are caught and dealt with.

To do so, I’ve been using for, at this point, years, AWStats which is a Perl analyzer, generator and CGI application. It used to work nicely, but nowadays it’s definitely lacking. It doesn’t filter referrers search engines as much as it used to be (it’s still important to filter out requests coming from Google, but newer search engines are not recognized), and most of the new “social bookmark” websites are not there at all — yes it’s possible to keep adding to them, but with upstream not moving, this is getting harder and harder.

Even more important, for my ruleset work, is the lack of identification of modern browsers. Things like Android versions and other fringe OSes would be extremely useful for me, but adding support for all of them is a pain and I have enough things on my plates that this is not something I’m looking forward to tackle myself. It’s even more bothersome when you consider that there is no way to reconsider the already analyzed data, if a new URL is identified as a search engine, or an user agent a bot.

One of the most obvious choices for this kind of work is to use Google Analytics — unfortunately, this means that it will only work if it’s not blacklisted from the user side — that includes NoScript users and of course most of the spammers. So this is not a job for them. It’s something that has to be done on the backend, on the logs side.

The obvious point at that point is to find something capable to import the data out of the current awstats datafiles I got, and keep importing data from the Apache log files. Hopefully this should be done by saving the data in a PostgreSQL database (which is what I usually use); native support for vhost data, but the ability to collapse it in a single view would also be nice.

If somebody knows of a similar piece of software, I’d love to give it a try — hopefully, something that is written in Ruby or Perl might be the best for me (because I can hack on those) but I wouldn’t say no to Python or even Java (the latter if somebody helped me making sure the dependencies are all packed up properly). This will bring you better modsec rules, I’m sure!

New AWStats ebuild notes

Since there has been quite a bit of noise on the stable request bug for the newest AWStats version, and I’ve actually wanted to write this down before it went stable, but the security issues there forced my hand a little bit.

It is obvious when you look at the installed layout of the new awstats that this version installs quite differently from the previous one; for once it doesn’t use webapp-config any longer, but it also changes the paths used to install the tools and other scripts. There are a few reasons for this so let me revisit them a bit.

The webapp-config system was designed to help managing web application deployment in Gentoo by allowing install of multiple version on different vhosts. This was a generally good approach for software designed around the concepts of old-style web applications, written in Perl/CGI or PHP, which are self-contained in a single installation directory, but it fails badly when you deal with modern application designs such as Rails, TurboGears, and so on. For this reason you can see that it starts to “wither” as the modern applications are just not usable with such a system.

In the case of AWStats, while it is a web application written in Perl/CGI, it is far from being a self-contained one: it provides a number of system tools that are used system-level, and has a system-wide configuration file system in /etc/awstats. Also, it strictly doesn’t need to be used as a CGI, as you can easily produce daily and then provide static web pages for those. With all these restrictions, the requirement to use webapp-config is pretty much pointless; as an added bonus, none of the AWStats ebuilds ever allowed slotted, side-by-side installation of multiple version, which is the primary use of webapp-config.

Since you lose the previously-installed version when upgrading, webapp-config becomes now a burden, rather than an ease in management. I have been running it without using that system for quite a long while, but upgrades were still bothering me as the path to the CGI scripts changed version by version, as they are tied to the ebuild’s version and revision. Thus, the new system.

First of all, the tools are now installed in the path so you don’t need to use full path to them. Then the whole application is installed onto a single path: /usr/share/awstats/wwwroot (the path is in line with the upstream documentation); you can then use your webserver configuration to either alias those paths to the exposed one, or simply symlink them around in your webserver’s document root. As a reference, this is the kind of configurations I use on my statistics vhost (which is, as I have written before, secured and password-protected to avoid giving pagerank to spammers):

  Allow from all

  SetHandler cgi-script
  Options +ExecCGI

Yes, that’s it.

Referrer spam and awstats

The presence of awstats in the title here can be misunderstood for a moment; as I’ll show in the rest of the post, I am putting it there as an experiment. Bear with me.

New year and new month, and as each new month, the statistics I gather from my awstats interface (which is protected by password and SSL to be on the safe side) show me a huge amount of referrer spam. If you don’t know what referrer spam is, then you probably never gathered raw statistics from access logs, so I’ll try to explain.

Spammers wish to increase the pagerank of their websites, so that there are more chances that random Google searches will bring in visitors; to do so one common method is to leave links to their websites in spam comments on blogs, that Google will index and traverse. Luckily for us, spam comments on blogs are relatively easy to stop; most of the times I don’t even see them, thanks to the ModSecurity antispam system I wrote for my blog (more on that in a moment).

But at the same time, a number of websites leave their reported statistics open to be seen by everybody, including Google! A quick search for the right terms shows about 126 thousands results, and that is that just by looking for AWStats, which is but one of the many applications that do so. This means that the spammers have a huge chance of being able to actually get their pagerank boosted. There are even more points in this thanks to the fact that even the latest version of AWStats does not use rel=nofollow on the list of referrers!

Referrer spam also requires a lot less work than comment spam does: you don’t need to send POST requests with the right fields, you just need to request the pages, sometimes even if they don’t exist, with the URL you want to spam in the Referer (not my typo!) header. Now, there are a few patterns to be found even in these referrer spammers, most of which seem to optimise their requests’ bandwidth by targeting URLs that seem to show the presence of the software they are targeting (such as AWStats). Indeed, the one that has gotten more hits since is the one that had AWStats in the title and I’ve been using lately to detect these patterns.

In particular, I have noticed that a number of these spammers seem to be even more interested in reducing the traffic they generate than most legit sites, to the point that they used the HEAD method rather than the GET method. This allowed me to create a simple rule to ban almost all of them: I simply deny HEAD requests coming from real browsers’ agents — as far as I can see, no browser ever sends HEAD requests; if they want to validate cache, they instead use the If-None-Match or If-Modified-Since headers.

trying to reduce the impact of these spammers on my own websites, I’ve been improving and upgrading the rules in my ModSecurity Ruleset to detect both the patterns and the known spammers; generally speaking the added logic shouldn’t be excessively taxing. By the way, if you are using my rules and like them, please do flattr them! At least I’ll know somebody does make use of those now that I published them.

Interestingly enough, you might remember a previous post of mine, where I noted that letting users always register is bad — more and more it seems like a number of these spammers try to work around the easy checks done through URL blacklists (similar to the DNSBL used to block IPs) by using third parties that might be legit, but ignore the nofollow rule, which means that their pagerank is, once again, ensured.

I guess I really should start learning LUA so that I can write more complex but thorough checks for my rules; since it seems like ModSecurity how has hashing capabilities, it wouldn’t be too bad if I could simply check each domain in the referrers once every two hours against a live blacklist, and then skip over it, to avoid repeating the same test over and over again.

Making awstats run as an unprivileged user

I admit I never was a great sysadmin. My most important sysadmin work was almost a decade ago, for (site is dead since a long time ago now, too), on a Windows 2000 Server system on the other side of Italy (in Genova), that was running the site (quite simple), a forum software (and I tried quite a few at the time) and the Sphere emulator for Ultima OnLine. If you were an Italian player at the time you might have heard of Dragons’ Land, or *Heaven*…

Anyway since last November I’ve started sysadmining a vserver to keep this blog running, and the xine bugtracker too (and by the way, thanks again IOS for the hosting). There were a few things that I left TODO before, and I’m now doing them as I see time for them.

One of these things is to let awstats run as an unprivileged user, instead of as root as it was doing before. I’m writing down what I did here, so that I’ll remember if I ever have to do this again.

The first step is of course to create an awstats user, and giving it full access to its home directory

# useradd -d /var/lib/awstats -s /sbin/nologin
# chown -R awstats /var/lib/awstats

As the configuration files needs to be read by the user too, let’s make that accessible in read-only mode by that user, leaving write access to root:

# chown -R awstats:root /etc/awstats
# chmod 570 /etc/awstats
# chmod 460 /etc/awstats/*

As the awstats CGI is ran as lighttpd user, I need to let it access both the

Now, you need to let this user access the webserver logs, in my case I’m using lighttpd, thus the directory I have the logs in is /var/log/lighttpd, which is owned by the webserver user. To be able to restrict the awstats user access, I need to use ACLs:

# setfacl -m u:awstats:r /var/log/lighttpd/*
# setfacl -m u:awstats:rx /var/log/lighttpd
# setfacl -d -m u:awstats:rx /var/log/lighttpd

Now it’s time to change the script that executes awstats, in my case on multiple virtual hosts. I’m not posting the whole script as it’s quite fugly, but just the general rule is:

# su -s /bin/sh -c 'perl $path_to_awstats/hostroot/cgi-bin/ 
  -config=$your_config -update' awstats

Now of course your awstats CGI has to access both the configuration and the datafiles. ACLs come useful again:

# setfacl -m u:lighttpd:r /etc/awstats/* /var/lib/awstats/*
# setfacl -m u:lighttpd:rx /etc/awstats /var/lib/awstats
# setfacl -d -m u:lighttpd:rx /etc/awstats /var/lib/awstats

Now of course you can guess that you cannot ask the CGI to parse the logs to regenerate the data, because it doesn’t have permission to write to the datafiles, but that’s exactly what I want 🙂

Now you have awstats running at the lowest privilege possible, but still able to access what it has to, hopefully this should be a nice mitigation strategy.

[Now of course if someone knows I made a mistake, I’d very much like to hear about it :)]