The end of an era, the end of the tinderbox

I’m partly sad, but for the most part this is a weight that goes away from my shoulders, so I can’t say I’m not at least in part joyful of it, even though the context in which this is happening is not exactly what I expected.

I turned off the Gentoo tinderbox, never to come back. The S3 storage of logs is still running, but I’ve asked Ian to see if he can attach everything at his pace, so I can turn off the account and be done with it.

Why did this happen? Well, it’s a long story. I already stopped running it for a few months because I got tired of Mike behaving like a child, like I already reported in 2012 by closing my bugs because the logs are linked (from S3) rather than attached. I already made my position clear that it’s a silly distinction as the logs will not disappear in the middle of nowhere (indeed I’ll keep the S3 bucket for them running until they are all attached to Bugzilla), but as he keeps insisting that it’s “trivial” to change the behaviour of the whole pipeline, I decided to give up.

Yes, it’s only one developer, and yes, lots of other developers took my side (thanks guys!), but it’s still aggravating to have somebody who can do whatever he likes without reporting to anybody, ignoring Council resolutions, QA (when I was the lead) and essentially using Gentoo as his personal playground. And the fact that only two people (Michał and Julian) have been pushing for a proper resolution is a bit disappointing.

I know it might feel like I’m taking my toys and going home — well, that’s what I’m doing. The tinderbox has been draining on my time (little) and my money (quite more), but those I was willing to part with — draining my motivation due to assholes in the project was not in the plans.

In the past six years that I’ve been working on this particular project, things evolved:

  • Originally, it was a simple chroot with a looping emerge, inspected with grep and Emacs, running on my desktop and intended to catch --as-needed failures. It went through lots of disks, and got me off XFS for good due to kernel panics.
  • It was moved to LXC, which is why the package entered the Gentoo tree, together with the OpenRC support and the first few crude hacks.
  • When I started spendig time in Los Angeles for a customer, Yamato under my desk got replaced with Excelsior which was crowdfounded and hosted, for two years straight, by my customer at the time.
  • This is where the rewrite happened, from attaching logs (which I could earlier do with more or less ease, thanks to NFS) to store them away and linking instead. This had to do mostly with the ability to remote-manage the tinderbox.
  • This year, since I no longer work for the company in Los Angeles, and instead I work in Dublin for a completely different company, I decided Excelsior was better off on a personal space, and rented a full 42 unit cabinet with Hurricane Electric in Fremont, where the server is still running as I type this.

You can see that it’s not that ’m trying to avoid spending time to engineer solutions. It’s just that I feel that what Mike is asking is unreasonable, and the way he’s asking it makes it unbearable. Especially when he feigns to care about my expenses — as I noted in the previously linked post, S3 is dirty cheap, and indeed it now comes down to $1/month given to Amazon for the logs storage and access, compared to $600/month to rent the cabinet at Hurricane.

Yes, it’s true that the server is not doing only tinderboxing – it also is running some fate instances, and I have been using it as a development server for my own projects, mostly open-source ones – but that’s the original use for it, and if it wasn’t for it I wouldn’t be paying so much to rent a cabinet, I’d be renting a single dedicated server off, say, Hetzner.

So here we go, the end of the era of my tinderbox. Patrick and Michael are still continuing their efforts so it’s not like Gentoo is left without integration test, but I’m afraid it’ll be harder for at least some of the maintainers who leveraged the tinderbox heavily in the past. My contract with Hurricane expires in April; at that point I’ll get the hardware out of the cabinet, and will decide what to do with it — it’s possible I’ll donate the server (minus harddrives) to Gentoo Foundation or someone else who can use it.

My involvement in Gentoo might also suffer from this; I hopefully will be dropping one of the servers I maintain off the net pretty soon, which will be one less system to build packages for, but I still have a few to take care of. For the moment I’m taking a break: I’ll soon send an email that it’s open season on my packages; I locked my bugzilla account already to avoid providing harsher responses in the bug linked at the top of this post.

The virt-manager pain

So I’m still trying to come up with a decent way to parse those logs. Actually I lost a whole batch of them because I forgot to rename them before processing, but that made me realize that the best thing I can do is process the logs outside of the tinderbox itself. But this is a topic for another time.

Another thing I’m trying to set up is a virtual machine to test the new x32 ABI that Mike has made available in Gentoo. This is important for the tinderbox as well as for FATE (which is already being run for a standard Gentoo Hardened setup on that very same hardware). Unfortunately I can’t use this via LXC, yet — simply because the kernel currently running does not support the x32 executables yet.

This means that I have to use KVM and go full-virtualisation. Thanks to Tim, I’ve got a modified SysRescueCD ISO that should let me take care of the install. Unfortunately, this is not easy to deal with for a number of reasons, still.

The first is that virt-manager is just slow, and painful, as some kind of slow and painful death. The whole idea of using a frontend that connects through SSH is that you don’t want to “feel” the lag… but virt-manager makes you feel the lag even more than a command-line SSH connection. I’m under the impression that the guys who work on that kind of stuff only ever tried this on a local connection, and never from the other side of the world. I mean, I understand you might have concurrency issues, but do you really have to make me wait for two minutes to switch from CPU settings to Memory settings to Disk setting when editing the VM?

The second issue is that even though I was able to set up a testing VM for x32… qemu doesn’t like the additional instruction sets (SIMD) that Bulldozer comes with; something within the C library causes every x32 binary to be killed with SIGILL (Illegal Instruction). The problem is likely in some of the indirect-binding functions that are being used — my guess based on the NEWS file of the 2.15 release is strcasecmp() which has been optimised through the use of SSE4.2 and AVX (both of which are available on that server) — I have a 34 written, half drawn post about this kind of optimisations in my queue, I’ll see if I can post it over the weekend.

The end result is that I spent the most part of three hours on virt-manager before accepting that the way to go is to update the host’s kernel and just run the usual container. Just so you know, the final step that “creates” the VM (which is not the LVM allocation step!) took me over half an hour. This is silly, what the heck was it doing during that time?

Oh and yes, two years afterwards virt-manager still keeps defunct ssh processes around because it never reaps them (check the comments).

Right now I’m trying to get this to work with LXC, but I’m not having much luck with it either; and yes I did update the init script to handle correctly the x32 containers, that didn’t work correctly before… it might have some problems if you’re going to use this on SPARC, because I’m not handling those properly yet, but this is (again) a topic for another time.

Tinderboxing again

Finally Excelsior is doing what it has been bought to do: tinderboxing. Well, not really to full capacity but I’ve got enough pieces of the puzzle in place that it should soon start building regularly.

What the tinderbox set up is doing now is actually a limited run: it’s building the reverse dependencies of virtual/ffmpeg to find out which packages actually require libpostproc, so that we can fix their dependencies with libav — to follow up on Tomáš’s post I’m also strongly attached to libav, being, together with Luca, part of the development team of the project.

Speaking about libav, one of the things that since a couple of days ago Excelsior is doing, is running FATE so to both cover AMD Bulldozer’s architecture and Gentoo Hardened. For those who do not know, FATE is an automated testing environment that configures, builds, and run the tests for libav and then send the result to a central server for analysis. The whole build takes three minutes on Excelsior. Thanks to libav’s highly-parallelized makefiles.

There is only one problem with the Tinderbox, and is once again the logs analysis. To be honest this time I have some idea on how to proceed: the first step is to replace the grep command I have used before with a script that produces an HTML output of the log file. Yes many people said before that HTML is not a good idea for this kind of thing: since nobody else has helped me writing a better log analysis tool, that’s going to be enough.

So what has to happen is reading line by line the input log, then create an HTML file with (numbered) lines, marked with a specific (CSS) file that makes the row red. As I said before it would be nice to have some kind of javascript to jump from one error line to the other. Until this is all set, though, just creating the HTML file would be enough.

The next step would probably be getting the HTML output on S3 (for easy access), and write it down on a database that actually give you an index of the error logs — for those who wonder, using s3fs as PORT_LOGDIR is not a good idea at all. I guess tomorrow I might try to find some time to work on this. I could find some time on the weekend, given that my original idea (biking around with Luca) is disabled due to me falling and (lightly) injuring myself last night.

When upstream lacks a testsuite

… you can make your own, or try at least.

I maintain in portage a little ebuild for uif2iso; as you probably already know, the foo2iso tools are used t convert various type of proprietary disk images from Windows proprietary software into ISO9660 images that can be used under Linux. Quite obviously, making unit testing out of such a tool is pointless, but regression testing at tool level might actually work. Unfortunately for obvious reasons upstream does not ship testing data.

Not exactly happy with this, I started considering what solution I had, and thus my decision: if upstream does not ship with any testsuite, I’ll make one myself. The good thing with ebuilds is that you can write what you want for the test in src_test. I finally decided to build an UIF image using MagicISO on my Windows XP vbox, download it together with the MD5 digest of the files I’d put in it conditionally to the test USE flag, and during the test phase convert it to ISO, extract the files, and check that the MD5 digest is correct.

Easier said than done.

To start with I had some problem deciding what to put on the image; of course I could use some random data, but I thought that at that point I could at least make it funny for people to download the test data, if they wanted to look at it. My choice fell on just finding some Creative Commons-licensed music and use a track from that. After some looking around, my choice went to the first track of Break Rise Blowing by Countdown on Jamendo.

Now, the first track is not too big so it’s not a significant overhead to download the test data, but there is another point here: MagicISO has three types of algorithms used: default, best compression and best speed; most likely they are three compression levels in lzma or something along those lines, but just to be safe I’d just put all three of those to the test. The resulting file with the three UIF images and the MD5 checksums was less than 9MB, so an acceptable size.

At that point, I started writing the small testsuite, and the problem started: uif2iso always returns 1 at exit, which means you can’t use || die or it would always die. Okay good enough, just check that the file was created. Then you have to take the files out, nothing is that easy when you got libarchive that can extract ISO files like they were tarballs; just add that as a dependency with the test USE flag enabled, a bit of overhead but at least I can easily extract the data to test.

It seems instead that the ISO file produced by uif2iso is going to be a test for libarchive instead, since the latest release fails to extract it. I mailed Tim and I hope he can fix it up for the next release (Tim is fantastic with this, when 2.5.902a was released, I ended up finding a crasher on a Portage-generated binpkg, I just had to mail it to him, and in the next release it was fixed!). The ISO file itself seems fine, since loop-mounting it works just fine. The problem is that I know no other tool that can extract ISO images quickly and without having to command it file by file (iso-read from libcdio can do that, it’s just too boring); if somebody has suggestions I’m open to them.

This is the fun that comes out of writing your own test cases I guess, but on the other hand I think it’s probably a good idea to keep the problematic archives around, if they have no problems with licenses (Gentoo binpkgs might, since they are built from sources and you’d have to distribute the sources with the binaries, which is why I wanted some Creative Commons licensed content for the images), since that allows you to test stuff that broke before to ensure it never breaks again. Which is probably the best part of unit, integration and system testing: you take a bug that was introduced in the past, fix it and write a test so that, if it is ever reintroduced it would be caught by the tests rather than by the users again.

Has anybody said FATE ?