So finally after spending one full day alone, at my sister’s, yesterday, I finished at least the part of the Tinderbox log analaysis code that takes care of gathering the logs and highlighting the bad stuff on them.
The code is all available and you can easily improve upon it if you want; now I think I should have put everything back together in a single git repository with the tinderboxing script, but that can be arranged at a different point, hopefully.
What is interesting though is discussing the design, because by all means it’s not a simple one, and you can be deceived into thinking I was out of my mind when I wrote it.
First of all there is the fact that the analysis itself is offloaded to a different system; in this case a different container. The reason for this is simply the matter of reliability of the tinderbox scripts themselves. Due to the way it works, it’s easy that even system-level software can break during upgrade, which is one of the reasons why it’s not totally easy to automate the process. Due to this I’m not interested in adding either analysis logic, nor complex “nodes” into the tinderbox host. My solution has been relatively easy: I just rely on tar and nc6 — I would have loved using just busybox for the whole of it, but the busybox implementation of netcat does not come with the -q option which is required to complete disconnection once the standard input is terminated.
Using tar gives it a very bare protocol I can use to provide a pair of filename and content which can then be analysed on the other side, with the Ruby script in the repository linked at the top of the post; this script uses archive-tar-minitar with a special patched version in Gentoo as otherwise it wouldn’t be able to access “streamed” archives — I’m tempted to fork it and release a version 0.6 with a few more fixes and more modern code, but I just don’t have the time right now; if you are interested ping me.
One important thing to note here is that the script uses Amazon’s AWS, in particular S3 and SimpleDB. This might not be obvious, as the system has enough space to store the logs files for a very long time. So why did I do it that way? While storage abounds, Excelsior resides in the same network as my employer’s production servers (well, on the same pipe, not on the same trusted network of course!), so to not swamp it too much, I don’t want to give anybody access to the logs on the system itself. Using S3 should be cheap enough that I can keep them around for a very long time!
Originally I planned on having the script be called one-off by xinetd, using spawned multiple process to avoid using threading (which is not Ruby’s forte), but then the time taken for AWS to be initialised wasn’t worth it, so I wrote it as it is now. Yes there is one bad thing: the script expects Ruby 1.9 and won’t work with Ruby 1.8. Why? Well, mainly because this way it was easier to write it, but then again, since I’m going to need concurrent processing at some point, which means I need to make the script multithreaded, Ruby 1.9 is a good choice. After all I can decide what I run, no?
After the log is received, the file is split line-by-line and for each of them a regexp is applied – an extra thank to blackace and Joachim for helping with a human OCR over a very old, low-res screenshot of my emacs window with the tinderbox logs grep command – and if there are any matches, the lines are marked as red. This creates very big HTML files obviously, but they should be fine. If they’ll start pile up, I’ll see to compress them before storing them to Amazon.
The use of SimpleDB is simply because I didn’t want to have to set up two different connections. Since all AWS services use the same login, I only need one and it uses both the storage and the database. While SimpleDB’s “eventual consistency” makes it more a toy than a reliable database, I don’t care really much; the main issue is with concurrent requests, and which one is served first, to me, makes no difference, as I only have to refresh my window to fetch a new batch of logs to scan.
In the database I’m adding very few attributes of the files: the link to the file itself, the package that it was, the date of the log, and how many matches there have been. My intention is to extend this to show me some legend on what happened: did it fail testing? did it fail the ebuild? are they simply warnings? For now I went with the simplest options though.
To see what’s going on, though, I wrote a completely different script. Sinatra-based, it only provides a single entrypoint on localhost, and gives you the last hundred entries in the SimpleDB which have matches.. I’m going to try making this more extensible in the future as well.
One thing I skipped all over this: to make it easier to actually apply this to different systems, I’m organising the logs by hostname, simply by checking from where the connection is coming (over IPv6, I have access to the reverse DNS for them). This is why I want to introduce threaded responses: quite soon, Excelsior will run some other tinderbox (I’m not yet sure on whether to use AMD64 Hardened or x86 Hardened — x32 is also waiting, to work as third arch), which means that the logs will be merged and “somebody” will have to sift through three copies of them. At least with this system it’s feasible.
Anyway now I guess I’ll sign off, go watch something on Sky, which I’ll probably miss for a couple of weeks when I come back to the US, just the time for me to find a place and get some decent Internet.
Hey, I just searched for “Tinderbox” and this is crazy. But I got 559 results, so consider variety, maybe?