Tinderbox problems

My tinderbox for those who don’t know it, is a semi-automatic continuous integration testing tool. What it does is testing, repeatedly and continuously, the ebuilds in tree to identify failures and breakages in them as soon as they are available. The original scope of the testing was ensuring that the tree could be ready for using --as-needed by default – it became a default a few months ago – but since then it has expanded to a number of other testing conditions.

Last time I wrote about it, I was going to start reporting bugs related to detected overflows and so I did; afterwards there has been two special requests for testing, from the Qt team for the 4.7 version, and from the GNOME team for the GTK 2.22 release — explicitly to avoid my rants as it happens. This latter actually became a full-sweep tinderbox run, since there are just too many packages using GTK in tree.

Now, I already said before that running the tinderbox takes a lot of time, and thankfully, Zac and Kevin are making my task there much easier; on the other hand, right now I start to have trouble running it mostly for a pure economical matter. From one side, running it 247 for the past years starts to take its tall in power bills, from another, I start to need Yamato’s power to complete some job tasks, and in that time, the tinderbox is slowing down or pausing altogether. This, in turn, causes the tinderbox to go quite off-sync with the rest of Gentoo and requires even more time from there on.

But there are a few more problems that are not just money-related; one is that that the reason why I stopped posting regular updates about the tinderbox itself and its development is that I’ve been asked by one developer not to write about it, as he didn’t feel right I was asking for help with it publicly — but as it turns out, I really cannot keep doing it this way so I have to choose between asking for help or stop running the tinderbox at all. It also doesn’t help that very few developers seem to find the tinderbox useful for what it does: only a few people actually ask me to execute it for their packages when a new version is out, and as I said the GNOME team only seemed compelled to ask me about it for the sole reason of avoiding another rant from me rather than actually to identify and solve the related problems.

Not all users seem to be happy with the results, either. Since running the tinderbox also means finding packages that are broken, or that appear so, which usually end up being last-rited and scheduled for removal, it can upset users of those packages, whether they really need them or they just pretend and are tied to them just for legacy or historical reasons. Luckily, I found how to turn this to my advantage, which is more or less what I did already with webmin sometime ago and that is to suggest users to eitehr pick up the package themselves or, in alternative, to hire someone to fix it up and get it up to speed with QA practices. Up to now I hadn’t been hired to do that, but at least both webmin and nmh seem to have gotten a proxy-maintainer to deal with them.

Talking about problems, the current workflow with the tinderbox is a very bad compromise; since auto-filing bugs is a bit troublesome, I haven’t gone around filing them just yet; Kevin provided me with some scripts I’d have to try, but they have one issue, which is the actual reason why I haven’t gone filing them just yet: I skim through the tinderbox build logs through a grep process running within an emacs session; I check log by log for bugs I already reported (that I can remember); I check Bugzilla through Firefox in the background (it’s one of the few uses of Firefox I still have; otherwise I mostly migrated to Chromium months ago) to see if someone else reported the same bug, then if all fails, I manually file it through the use of templates.

This works out relatively well for a long series of “coincidences”: the logs are available to be read as my user, and Emacs can show me a browsable report of a grep; Bugzilla allows you to have a direct search query in Firefox search, and most of the time, the full build log is enough to report the bug. Unfortunately it has a number of shortcomings, for instance for emerge --info I have to manually copy and paste it from a screen terminal (also running on the same desktop in the background).

To actually add a self-reporting script to the workflow, what I’d be looking for is a way to launch it from within Emacs itself, picking out the package name and the maintainers from the log file itself (portage reports maintainer information at the top of the log nowadays, one thing that Zac implemented and made me very happy about). Another thing that would help would be bugzilla integration with Emacs to find the bugs; this may actually be something a Gentoo user who’s well versed in Emacs could help a lot about; adding a command “search on Gentoo Bugzilla” to Emacs so that it identifies the package the build log refers to, or the ebuild refers to, and report within itself the list of known open bugs for that package. I’m sure other developers using Emacs would find that very useful.

I also tried using another feature that Zac implemented, to my delight: compressed logs; having gzip-compressed logs makes the whole process faster for the tinderbox (logs are definitely smaller on disk, and thus require less I/O), and makes it easier to store older data, but unfortunately Bugzilla does not hand out transparently those logs to browsers; worse, it seems to double-compress them with Firefox even though they are properly provided a mime-type declaration, resulting in difficult-to-grok logs (I still have to compress some logs because they are just too big to be uploaded to Bugzilla). This is very unfortunate because scarabeus asked me before if I could get the full logs available somewhere; serving them not compressed is not going to be fun for any hosting service. Maybe Amazon S3 could turn out useful here.

Actually, there is also one feature that the tinderbox lost, as of lately: the flameeyestinderbox user on identi.ca, which logged, as a bot, all the running of the tinderbox, was banned again. The first time support unbanned it the same day, this time I didn’t even ask. An average of around 700 messages a day, and a single follower (me) probably don’t make it very palatable to the identi.ca admins. Jürgen suggested me to try requesting my own private StatusNet, but I’m not really sure I want to ask them to store even more of my data, it’s unlikely to be of any use for them. Maybe if I’ll ever end up having to coordinate more than one tinderbox instance, I’ll set up a StatusNet installation proper and let that one aggregate all the tinderboxes reporting their issues. Until a new solution has been found, I then fully disabled bti in the tinderbox code.

Anyway, if you wish to help, feel free to leave comments with ideas, or discuss them on the gentoo-qa mailing list (Kevin, you too please, your mail ended up in my TODO queue, and lately I just don’t have time to go through that queue often enough, sigh!); help with log analysis, bug opening and CCing is all going to make running the tinderbox much smoother. And that’s going to be needed if I won’t be able to run it 247. As for running the tinderbox for longer, if you run a company and you find the tinderbox worth it, you can decide to hire me to keep it running or donate/provide access to boxes powerful enough to run another instance of it. You can even set up your own (but it might get tricky to handle, especially bug reporting since if you’re not a Gentoo developer nor a power user with editbugs capabilities you cannot directly assign the filed bugs).

All help is appreciated, once again. Please don’t leave me here alone though…

Two magic words: “merged upstream”

The lives of distributions packagers are full of words that make them cringe – backport, regression, hotfix, custom patch, … – but there are two that can make your day truly shine: merged upstream.

When you’re maintaining a package for any kind of software, you have to mediate between the original upstream intentions and requests, and the distribution policy; in Gentoo, that policy includes respecting the user-chosen flags, especially since some of those are mandated by us and are needed to make sure that the software installed on the users’ systems is actually working as intended, but that’s just one of a long list of things you have to care about.

Most of the time, because of this, you have to end up patching the software itself: modifying the code so that it behaves like the distribution wants, even if that diverges from upstream behaviour, or in the best of cases, making it abide to both restrictions. Sometimes, you have to take a fix that has been already applied to the upstream repository, for instance in a development branch, and apply to the currently packaged version (backport). The least boring case is when users report a problem to you, that might or might not apply to other distributions, and you have to fix it anew.

When that’s the case, Gentoo’s guideline is to write patches so that they can be sent upstream (I also wrote about that — and I’m tempted to just re-publish my articles on my website rather than keep them there, especially given that the page is more broken each time I visit it). Unfortunately, some upstream are more difficult to work than others, plus sometimes the patches are really too Gentoo-specific that they make no sense to be sent upstream (for instance the S/Key patch for sudo, as it’s to support the Gentoo-specific port of OpenBSD’s S/Key support).

Now, as part of the Ruby team I’ve already written over and over about our need to patch a huge number of gems and other Ruby libraries, a lot of times simply to fix their Rakefile, less often to fix their tests… and in very rare occasions to hack or unhack the packages. Thankfully, these patches tend to be merged in quite quickly — I’m still not sure whether that’s due to the use of GitHub and of merge requests, or because releasing a Ruby gem is so quick and easy (which is a definite pro of RubyGems even for us packagers).

With other kind of software, the merge-and-release approach is not that common, but luckily there are exceptions; Robin Gareus has been terrific at merging my patches for liboauth and release a new version, which means that while I added it to the tree with three patches to be applied, you can get it now with no patch at all: vanilla 0.8.9!

Less quick to release, but that’s understandable considering the criticality of it, has been Linux-PAM; the newly released version, 1.1.2, which I added to the tree today, comes back to two patches, against the previous six of 1.1.1-r2. The remaining two are a bit tricky; one is something I’m keeping around since the 0.99 series, so a very long time ago, and is just a simple way for us to not build a bunch of test programs that we wouldn’t be using anyway; the other is a fix for the Berkeley DB detection in configure to work with the libraries are installed in Gentoo (with the prefix on the library name, but no prefix on the ELF symbols; we use versioning instead to avoid collisions between them). I’m now trying to find a compromise with Thorsten Kukuk (my upstream Linux-PAM contact) so that the patches can be applied, and we can stop patching it altogether.

Being able to ship packages that are not patched at all is important for many reasons, and at least one applies even to people who don’t use that package at all. Obviously, staying closer to upstream’s code is a positive thing because it means that you don’t risk that upstream counts your users out of support range (well, they can still do that if you are using non-default build options, but in general, it’s much easier for them to help you out if you don’t touch their code), but sending your own fixes to the original developers has another important result: the same fix is available to all the users of the program, not just those using your particular distribution. And finally, even people not using it will have a positive result, as long as they use Gentoo: less patch files in the tree means less files, and less overhead — and trust me, there is lots of overhead in the tree as it is right now, not sure if worse or better than what we had at the time I originally wrote that rundown, but it certainly needs some kind of proper solutions to be devised.

So I’m happy to thank for their merges Robin (liboauth), Thorsten (Linux-PAM), Karel (util-linux-ng), Cole (virt-inst), Thibault (fcron), Ludovic (ccid), … — the list goes a lot further, they are just the most recent upstreams I’ve exchanged email with!

On a side note; yes I picked up co-maintainership of bti with Greg; the reason is that I’ve been using it to dent the tinderbox results. I’ve also branched it upstream to cleanup a few things in the build system, and to implement one feature I’d very much like to use (--background); those fixes will come straight on 028 release, though, as they are not critical.

Sealed tinderbox

I’ve been pushing the tinderbox one notch stricter from time to time; a few weeks ago I set up the tinderbox so that any network access beside for the basic protocols (HTTP, HTTPS, FTP and RSYNC) was denied; the idea is that if the ebuilds try to access network by themselves, something is wrong: once the files are fetched, that should be enough. Incidentally, this is why live ebuilds should not be in the tree.

Now, since I’ve received a request regarding the actual network traffic issued by the tinderbox, I decided to go one step further still, and make sure that beside for the tasks that do require network access the tinderbox does not connect to anything outside of the local network. To do so, I set up a local RSync mirror, then added a squid passthrough proxy, that does not cache anything; at that point, rather than allowing some protocols on the router for the tinderbox, I simply reject anything originating from the tinderbox to access Internet; all the outgoing connections originating from the tinderbox are done through Yamato, so I have something like this in my make.conf:

FETCHCOMMAND="/usr/bin/curl --location --proxy yamato.local:3128 --output ${DISTDIR}/${FILE} ${URI}" 
RESUMECOMMAND="/usr/bin/curl --location --proxy yamato.local:3128 --continue-at - --output ${DISTDIR}/${FILE} ${URI}"

Note: googling on how to set up those two variables in Gentoo to use curl I did find some descriptions on the Gentoo Forums that provide most of them; unfortunately all I found ignore the --location option, which makes it fail to fetch stuff from the SourceForge mirrors and any other mirroring system that uses 302 Moved responses.

I also modified the bti-calling script so that the identi.ca dents are sent properly through the proxy. I didn’t set the http_proxy variable, because that would have made moot the sealing. Instead, by setting it up this way, explicitly for the fetch and dent, if any testsuite tries to fetch something, even via HTTP, will be denied.

But… why should it be a problem if testsuites were to access services on the network? Well, the answer is actually easy once you understand two rules of Gentoo: what is not in package.mask is supposed to work, and any bug found needs to be fixable, and testsuites results need to be reproducible, to make sure that the package works. When you rely on external infrastructure like GIT repositories, you have no way to make sure that if there is a problem it can be fixed; and when your testsuite relies on remote network services, it might fail because of connection problems, and it will fail if the remote service is closed entirely.

I’ve also been tempted to remove IPv4 connectivity from the tinderbox at all; IPv6 should well be enough given that it only needs to connect to Yamato, and it would be under NAT anyway..