This Time Self-Hosted
dark mode light mode Search

GSoC Proposal: a better log collector and analyzer

In my previous post I didn’t add much about Gentoo ideas for GSoC and I didn’t really volunteer on anything. Well, this changes now as I have a suggestion and I’m even ready to propose myself as a mentor — with the caveat that as every other year (literally) my time schedule is not clear, so I will need a backup co-mentor.

You might or might not remember, but for my tinderbox I’m using a funky log collection and analysis toolset. Said toolset, although completely lacking chewing gum, is not exactly an example of good engineering, but it’s really just a bunch of hacks that happen to work. On the other hand, since doing Gentoo tinderboxing has never been part of my job, I haven’t had the motivation to rewrite it to something decent.

While the (often proposed) “continuous integration solution” for Gentoo is a task that is in my opinion unsuitable for a GSoC student – which should be attested by the fact that we don’t have one yet! – I think that at least the collection-and-analysis part should be feasible. I’ll try to list and explain here the current design and how I’d like it to work, so if somebody feels like working on this, they can already look into what there is to do.

Yes, I know that Zorry and others involved in Gentoo Hardened have been working on something along these line for a while — I still haven’t seen results, so I’m going to ignore it altogether.

Right now what happens is that we have four actors (in sense of computer/systems) involved: the tinderbox itself, a collector, a storage system, and finally, a frontend.

Between the tinderbox and the collector, the only protocol is tar-over-tcp, thanks to, well, tar and netcat on one side, and Ruby on the other. Indeed, my tinderbox script sends (encapsulated in tar) every completed log to the collector, which then extracts the log, and parses it.

The collector does most of the heavy lifting right now: it gets the package name and the timestamp of the log from the name of the file in the tar archive, then the maintainers (to assign the bug) from the log itself. It scans for particular lines (Portage and custom warning, among others), and creates a record with all of that together, to be sent to Amazon’s SimpleDB for querying. The log file itself is converted to HTML, split line by line, so that it can be seen with a browser without downloading, and saved to Amazon’s S3 once again.

The frontend fetches the records from SimpleDB where at least one warning is to be found, and display them in a table, with a link to see the log, and one to open a bug (thanks to pre-filled templates). It’s implemented in Sinatra as it is, and it’s definitely simplistic.

Now there are quite a number of weak spots in this whole setup. The most obvious is the reliance on Amazon. It’s not just an ethical or theoretical question, but the fact that S3 makes you pay for per-access is the reason why the list of logs is not visible to the public right now (it would easily add up to costs quickly).

What I would like from a GSoC project would be a replacement for the whole thing in which there can be a single entity that covers the collector, storage and frontend, without relying on Amazon at all. Without going all out on features that are difficult to manage, you need to find a way to store big binary data (the logs themselves can easily become many gigabyte in size), and then have a standard database with the list of entries like I have now. Technically, it would be possible to keep using S3 for logs, but I’m not sure how much of a good idea that would be at this point.

*Of course, one of the reasons why the collector and the frontend are split in my current setup, is that I thought that network connectivity between the tinderbox and the Internet couldn’t be entirely guaranteed; on the other hand, while the connection between the tinderbox and the collector is guaranteed (they are on the same physical host), the collector might not have an Internet connection to push to S3/SimpleDB, so…*

On the frontend side, a new frontend would have a better integration with Bugzilla, for instance it would be nice if I could just open the log and have a top frame (or div, I don’t care how it’s implemented) showing me a quick list of bugs open for the package, so I no longer have to search for it on Bugzilla to see if the problem has been reported already or not. Would also be nice to be able to attach the log to the newly open bugs, but that’s something that I care for relatively, if nothing else because some logs are so big, that to attach them you’d have to compress them with xz, and even then they might not work.

But possibly the most important part of this would be to change the way the tinderbox connects to the collector. Instead of keeping tar and netcat, I would like for Portage to implement a “client” by itself. This would make it easier to deploy the collector for uses different from tinderboxing (such as an organization-wide deployment), and at the same time would allow (time permitting) expansion on what is actually sent. For instance right now I have to gather and append the autoconf, automake and similar error logs to the main build log, to make sure that we can read it when we check the build log on the bug… if Portage was able to submit the logs by itself, it would submit the failure logs as well as config.log and possibly other output file.

Final note: no you can’t store the logs in an SQL database. But if you’re going to take part in GSoC and want to help with the tinderbox, this would be what I (as the tinderbox runner) need the most, so … think about it! I can provide a few more insights and the why of them (for instance why I think this really got to be written in Python, and why it has to support IPv6), if there is the interest.

Comments 1
  1. Not sure of the details of your logs, but have you looked into logstash/kibana/elasticsearch?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.