Passive web logs analysis. Replacing AWStats?

You probably don’t know that, but for my blog I do analyse the Apache logs with AWStats over and over again. This is especially useful at the start of the month to identify referrer spam and other similar issues, which in turn allows me to update my ModSecurity ruleset so that more spammers are caught and dealt with.

To do so, I’ve been using for, at this point, years, AWStats which is a Perl analyzer, generator and CGI application. It used to work nicely, but nowadays it’s definitely lacking. It doesn’t filter referrers search engines as much as it used to be (it’s still important to filter out requests coming from Google, but newer search engines are not recognized), and most of the new “social bookmark” websites are not there at all — yes it’s possible to keep adding to them, but with upstream not moving, this is getting harder and harder.

Even more important, for my ruleset work, is the lack of identification of modern browsers. Things like Android versions and other fringe OSes would be extremely useful for me, but adding support for all of them is a pain and I have enough things on my plates that this is not something I’m looking forward to tackle myself. It’s even more bothersome when you consider that there is no way to reconsider the already analyzed data, if a new URL is identified as a search engine, or an user agent a bot.

One of the most obvious choices for this kind of work is to use Google Analytics — unfortunately, this means that it will only work if it’s not blacklisted from the user side — that includes NoScript users and of course most of the spammers. So this is not a job for them. It’s something that has to be done on the backend, on the logs side.

The obvious point at that point is to find something capable to import the data out of the current awstats datafiles I got, and keep importing data from the Apache log files. Hopefully this should be done by saving the data in a PostgreSQL database (which is what I usually use); native support for vhost data, but the ability to collapse it in a single view would also be nice.

If somebody knows of a similar piece of software, I’d love to give it a try — hopefully, something that is written in Ruby or Perl might be the best for me (because I can hack on those) but I wouldn’t say no to Python or even Java (the latter if somebody helped me making sure the dependencies are all packed up properly). This will bring you better modsec rules, I’m sure!

Tinderbox logging away!

So finally after spending one full day alone, at my sister’s, yesterday, I finished at least the part of the Tinderbox log analaysis code that takes care of gathering the logs and highlighting the bad stuff on them.

The code is all available and you can easily improve upon it if you want; now I think I should have put everything back together in a single git repository with the tinderboxing script, but that can be arranged at a different point, hopefully.

What is interesting though is discussing the design, because by all means it’s not a simple one, and you can be deceived into thinking I was out of my mind when I wrote it.

First of all there is the fact that the analysis itself is offloaded to a different system; in this case a different container. The reason for this is simply the matter of reliability of the tinderbox scripts themselves. Due to the way it works, it’s easy that even system-level software can break during upgrade, which is one of the reasons why it’s not totally easy to automate the process. Due to this I’m not interested in adding either analysis logic, nor complex “nodes” into the tinderbox host. My solution has been relatively easy: I just rely on tar and nc6 — I would have loved using just busybox for the whole of it, but the busybox implementation of netcat does not come with the -q option which is required to complete disconnection once the standard input is terminated.

Using tar gives it a very bare protocol I can use to provide a pair of filename and content which can then be analysed on the other side, with the Ruby script in the repository linked at the top of the post; this script uses archive-tar-minitar with a special patched version in Gentoo as otherwise it wouldn’t be able to access “streamed” archives — I’m tempted to fork it and release a version 0.6 with a few more fixes and more modern code, but I just don’t have the time right now; if you are interested ping me.

One important thing to note here is that the script uses Amazon’s AWS, in particular S3 and SimpleDB. This might not be obvious, as the system has enough space to store the logs files for a very long time. So why did I do it that way? While storage abounds, Excelsior resides in the same network as my employer’s production servers (well, on the same pipe, not on the same trusted network of course!), so to not swamp it too much, I don’t want to give anybody access to the logs on the system itself. Using S3 should be cheap enough that I can keep them around for a very long time!

Originally I planned on having the script be called one-off by xinetd, using spawned multiple process to avoid using threading (which is not Ruby’s forte), but then the time taken for AWS to be initialised wasn’t worth it, so I wrote it as it is now. Yes there is one bad thing: the script expects Ruby 1.9 and won’t work with Ruby 1.8. Why? Well, mainly because this way it was easier to write it, but then again, since I’m going to need concurrent processing at some point, which means I need to make the script multithreaded, Ruby 1.9 is a good choice. After all I can decide what I run, no?

After the log is received, the file is split line-by-line and for each of them a regexp is applied – an extra thank to blackace and Joachim for helping with a human OCR over a very old, low-res screenshot of my emacs window with the tinderbox logs grep command – and if there are any matches, the lines are marked as red. This creates very big HTML files obviously, but they should be fine. If they’ll start pile up, I’ll see to compress them before storing them to Amazon.

The use of SimpleDB is simply because I didn’t want to have to set up two different connections. Since all AWS services use the same login, I only need one and it uses both the storage and the database. While SimpleDB’s “eventual consistency” makes it more a toy than a reliable database, I don’t care really much; the main issue is with concurrent requests, and which one is served first, to me, makes no difference, as I only have to refresh my window to fetch a new batch of logs to scan.

In the database I’m adding very few attributes of the files: the link to the file itself, the package that it was, the date of the log, and how many matches there have been. My intention is to extend this to show me some legend on what happened: did it fail testing? did it fail the ebuild? are they simply warnings? For now I went with the simplest options though.

To see what’s going on, though, I wrote a completely different script. Sinatra-based, it only provides a single entrypoint on localhost, and gives you the last hundred entries in the SimpleDB which have matches.. I’m going to try making this more extensible in the future as well.

One thing I skipped all over this: to make it easier to actually apply this to different systems, I’m organising the logs by hostname, simply by checking from where the connection is coming (over IPv6, I have access to the reverse DNS for them). This is why I want to introduce threaded responses: quite soon, Excelsior will run some other tinderbox (I’m not yet sure on whether to use AMD64 Hardened or x86 Hardened — x32 is also waiting, to work as third arch), which means that the logs will be merged and “somebody” will have to sift through three copies of them. At least with this system it’s feasible.

Anyway now I guess I’ll sign off, go watch something on Sky, which I’ll probably miss for a couple of weeks when I come back to the US, just the time for me to find a place and get some decent Internet.

I need help publishing tinderbox logs

I’m having a bit of time lately, since I didn’t want to keep one of the gigs I’ve been working on in the past year or so… this does not mean I’m more around than usual for now though, mostly because I’m so tired that I need to recharge before I can even start answering the email messages I’ve received in the past ten days or so.

While tinderboxing lately has been troublesome – the dev-lang/tendra package build causes Yamato to reach out of memory, causing the whole system to get hosed; it looks like a fork bomb actually – I’ve also had some ideas on how to make it easier for me to report bugs, and in general to get other people to help out with analysing the results.

My original hope for the log analysis was to make it possible to push out the raw log, find the issues, and report them one by one.. I see now that this is pretty difficult, nearing infeasible, so I’d rather try a different approach. The final result I’d like to have now is a web-based equivalent of my current emacs+grep combination: a list of log names, highlighting those that hit any trouble at all, and then within those logs, highlights on the rows showing trouble.

To get to this point, I’d like to start small and in a series of little steps… and since I honestly have not the time to work on it, I’m asking the help of anybody who’s interested in helping Gentoo. The first step here would be to find a way to process a build log file, and translate it into HTML. While I know this means incrementing its size tremendously, this is the simplest way to read the log and to add data over it. What I’d be looking for would be a line-numbered page akin to the various pastebins that you can find around. This does mean having one-span-per-line to make sure that they are aligned, and this will be important in the following steps.

The main issue with this is that there are build logs that include escape sequences, even though I disable Portage’s colours as well as build systems’, and that means that whatever converts the logs should also take care of stripping away said sequences. There are also logs that include outputs such as wget’s or curl’s, that use the carriage-return code to overwrite the output line, but creates a mess when viewing the log outside a terminal — I’m not sure why the heck they don’t check whether you’re outputting only on a tty. There are also some (usually Java) packages whose log appears to grep as a binary file, and that’s also something that the conversion code should have to deal with.

As a forecast of what’s to come with the next few steps, I’ll need a way to match error messages in each line, and highlight those. Once they are highlighted, just using XPath expressions to count the number of matched lines should make it much easier to produce an index of relevant logs… then it’s a matter where to publish these. I think that it might be possible to just upload everything here in Amazon’s S3 system and use that, but I might be optimistic for no good reason, so it has to be discussed and designed.

Are you up to help me on this task? If so, join in the comments!

Tinderbox and logs, some thoughts

While looking at today’s logs I started to wonder whether the current approach is actually sustainable. The main problem is that most of the tinderbox was just created piece by piece as needed to complete a specific task, and it really has never undergone a complete overhaul.

Right before moving to containers it only consisted of a chroot, which caused quite a few issues on the long run, but not extreme issues (still some of the important parts, such as the fact that libvirt testsuite seem to be triggering something bad in my kernel, which in turn blocks evolution and gwibber).

Beside the change from a simple chroot to a container (that reminds a lot of BSD jails by the way), the rest ofthe tinderbox has been kept almost identical: there is a script, written by Zac, that produces a list of packages, following these few rules:

  • the package is the highest version available for a slot of a package (it lists all slots for all packages);
  • the package is not masked, nor its dependencies are (this is important because I do mask stuff locally when I know ti fails, for instance libvirt above);
  • the ebuild in that particular version is not installed, or it wasn’t re-installed in the past 10 weeks (originally, six weeks, now ten because tests need to be executed as well).

This is the main core of the tinderbox, and beside that there is a mostly-simple bashrc file for Portage, that takes care of a few more tests, that portage itself ignores for now:

  • check for calls to AC_CANONICAL_TARGET, this is a personal beef against overcanonicalisation in autotools;
  • check for bundled common libraries (zlib, libpng, jpeg, ffmpeg, …), this makes it easier to identify possibly bundled libraries, although it has a quite high rate of false positives; having an actual database of code present in the various package would be easier;
  • check for use of insecure functions; this is just a little extra check for functions like tmpnam() and similar, that ld already warns about;
  • single-pass find(1) checks for OSX forkfiles (unfortunately common in the Ruby packages!), for setuid and setgid binaries (to have a list of them) and for invalid directories (like /usr/X11R6) in the packages;
  • extra QA trick to identify packages calling make rather than emake (which turns out to be quite useful to identify packages that use make instead of emake -j1 to hide bugs.

Now, all this produces a few files in the temporary directory, which are then printed so that the actual build log keeps them. Unfortunately this does not really work tremendously well since it requires quite a bit of work to properly extract them.

So I was thinking, what if instead of just running emerge, I run a wrapper around emerge? This would actually help me gathering important information, at the cost of spending more time for each merge. At that point, the wrapper could be writing up a “report card” after each emerge, containing in a suitable format (XML comes to mind) the following information:

  • status of the emerge (completed or not);
  • build log of the emerge (filename, gets copied with the report card itself);
  • emerge info for the current merge (likewise) — right now the emerge info I provide is the generic information of the tinderbox, which might differ from the specific instance used by the merge;
  • version of all the dependency tree — likewise, it’s something that might change between the time the package fails and I file the bug;
  • CVS revision of the current ebuild — very important to debug possibly fixed and duplicated bugs; this is one of the crucial point to be considering before moving to git where such revision is not available, as far as I know;
  • if the package has failed within functions we know they leave a log with details, those files should be copied together with the report card, and listed — this works for epatch, the functions from autotools.eclass and econf;
  • gathering of all the important information as returned by the checks in bashrc and in Portage proper (prestripped files, setXid files, use of make and AC_CANONICAL_TARGET, …).

Once this report card is generated, the information from the workdir is pretty unimportant and can mostly cleared up, this would allow me to save space on disk, which isn’t exactly free.

More importantly, with this report it’s also possible to check for bugs, and report them almost automatically, which would reduce the time needed for handling the reports of the tinderbox, and thus the time I need to invest on the reporting side of the job, instead of spending it on the fixing side (which thankfully is currently handled very well by Samuli and Victor).

More on this idea in the next days hopefully, with some proof of concept as well.