Why I don’t trust most fully automatic CI systems

Some people complain that I should let the tinderbox work by itself and either open bugs or mail the maintainers when a failure is encountered. They say that it should make it faster for the bugs to be reported and so on. I resist the idea.

While it does take time, most of the error logs I see in the tinderbox are not reported as bug. They can be either the same bug happening over and over and over again ­– like in the case of openldap lately, which fails to build with gnutls 3 – or they can be known/false positives, or they simply might not be due to the current package but something else that broke, like if a KDE library is broken because one of its dependencies changed ABI.

I had a bad encounter with this kind of CI systems when, for a short while, I worked on the ChromiumOS project. When you commit anything, a long list of buildbots pick up the commits and validate them against their pre-defined configurations. Some of these bots are public, in the sense that you can check their configuration and see the build logs, but a number of them are private and only available on Google’s network (which means you either have to be at a Google office or connect through a VPN).

I wasn’t at a Google office, and I had no VPN. From time to time one of my commits would cause the buildbots to fail and then I had to look up the failure; I think more or less half the time, the problem wasn’t a problem at all but rather one of the bots going crazy, or breaking on another commit that wasn’t flagged. The big bother was that many times the problem appeared in the private buildbots, which meant that I had no way to check the log for myself. Worse still, it would close the tree making it impossible to commit anything else than a fix for that breakage… which, even if there was one, I couldn’t do simply because I couldn’t see the log.

Now when this happened, my routine was relatively easy, but a time waster: I’d have to pop in the IRC channel (I was usually around already), and ask if somebody from the office was around; this was not always easy because I don’t remember anybody in the CEST timezone to have access to the private build logs at the time, at least not on IRC; most where from California, New York or Australia. Then if the person who was around didn’t know me at least by name, they’d explain to me how to reach the link to the build log… to which I had to reply that I have the link, but the hostname is not public, then I’d have to explain that no, I didn’t have access to the VPN….

I think in the nine months I spent working on the project, my time ended up being mostly wasted on trying to track people down, either asking them to fetch the logs for me, review my changes, or simply “why did you do this at the time? Is it still needed?”. Add to this the time spent waiting for the tree to come “green” again so I could push my changes (which often times meant waiting for somebody in California to wake up, making half my European day useless), and the fact that I had no way to test most of the hardware-dependent code on real hardware because they wouldn’t ship me anything in Europe, and you can probably see why both I didn’t want to blog about it while I was at it and why I haven’t continued longer than I did.

Now how does this relate to me ranting about CI today? Well, yesterday while I was working on porting as many Ruby packages as possible to the new testing recipes for RSpec and Cucumber, I found a failure in Bundler — at first I thought about just disabling the tests if not using userpriv, but then I reconsidered and wrote a patch so that the tests don’t fail, and I submitted it upstream — it’s the right thing to do, no?

Well, it turns out that Bundler uses Travis CI for testing all the pull requests and – ‘lo and behold! – it reported my pull request as failing! “What? I did check it twice, with four Ruby implementations, it took me an hour to do that!” So I look into the failure log and I see that the reported error is an exception that is telling Travis that VirtualBox is being turned off. Of course the CI system doesn’t come back at you to say “Sorry, I had a {male chicken}-up”. So I had to comment myself showing that the pull request is actually not at fault, hoping that now upstream will accept it.

Hopefully, after relating my experiences, you can tell why the tinderbox still applies a manual filing approach, and why I prefer spending time to review the logs instead of spending time to attach them.

An appeal to upstream KDE developers

While I have stopped using KDE almost two years ago, I don’t hate KDE per-se… I have trouble with some of the directions KDE has been running in – and I fear the same will happen to GNOME, but we’ll see that in a few months – but I still find they have very nice ideas in many fields.

Unfortunately, there is still one huge technical problem with KDE, and KDE-based third party software (well, actually, with any Qt-based software): they are still mostly written in C++.

Now, it’s no mystery that I moved from having a liking at that language to hating it outright, preferring things like C# to it (I still have to find the time to learn ObjectiveC to be honest). One of the problems I have with it is the totally unstable nature of it, that is well shown by the number of problems we have to fix each time a new GCC minor version is released. I blogged last week about GCC 4.5 and what actually blew my mind out is what Harald said about the Foo::Foo() syntax:

The code was valid C++98, and it wasn’t a constructor call. In C++98, in every class C, the name C exists as both a constructor and a type name, but the constructor could only be used in certain contexts such as a function declaration. In other contexts, C::C simply means C. C::C::C::C::C was also a perfectly valid way to refer to class C. Only later did there turn out to be a need to refer to the constructor in more contexts, and at that point C::C was changed from a typename to a constructor. GCC 4.4 and earlier follow the original C++ standard in this respect, GCC 4.5 follows the updated standard.

Harald van Dijk

Now, I thank Harald for clearing this up, but even so I think this aliasing in the original standard was one of the most obnoxious things I ever read… it was actually making much more sense to me that it was an oversight! Anyway, enough with the C++ bashing for its own sake.

What I would ask the upstream KDE developers is to please set up some way to test your code on the newest, unstable, unusable for day-to-day coding upcoming GCC branches. This way we would avoid having the latest, very recent, release of a KDE-related tool to fail the day the new GCC exits beta status and gets released properly. I’m not pretending that you fiddle to make sure that there are no ICEs (Internal Compiler Errors) caused by your code (although finding them will probably help GCC upstream a lot), but at least make sure that the syntax changes won’t break as soon as a new GCC is released, pretty please.

Of course, this kind of continuous integration, even extended to all of the KDE software hosted by KDE.org is not going to be enough as there are huge numbers of packages that are being developed out of that… but it would at least give the idea that there is need for better CI in general in FLOSS!

From scratch: about my tinderbox

I know that quite a bit of people are still unsure on what a tinderbox is, on what it does, and especially on why they should care enough that I’m bothering them every other day about it on my blog (and by extension on Planet Gentoo and the Gentoo.org homepage which syndicate it). So I’m now trying to write a long-breathed article that hopefully will cover all the important details.

What is “a” tinderbox

The term tinderbox, as reported by Wikipedia has been extended to the general use starting from the software, developed by Mozilla, designed to perform continuous integration over a given software project. This doesn’t really help that much because the term “continuous integration“ might still not say anything to users. So I’ll try to explain that from scratch as well.

Gentoo users, especially those running the unstable branch, are unfortunately used to breakage caused by build failures after a library update or other environment changes like that. This problem is increasingly present also during the development phase of any complex software. It gets even worse because while developers might not see any build failure on their machine, they can introduce “buggy“ code that breaks on others, as not all distributions have the same environment, the same libraries, the same filesystem layout and so on so forth.

And this is just by limiting the scope of our consideration to a single operating system – Linux in this case – and not considering the further problems caused by the presence of other software platforms, such as FreeBSD (which is still a Unix system), Mac OS X, and Windows. This is one of the biggest problems with multi-platform development that make it definitely non-trivial to write software that works correctly between these systems. I could write more about that topic but I’ll leave that to another time.

To make it possible for the developers not to have to manually get their code on multiple machines with various combination of operating systems and versions to ensure that the code does build properly after every change is merged, the smart approach was to invent the concept of continuous integration. To sum it up, think about what I just said – getting the code on multiple machines, with different operating systems, or versions of operating systems, build and eventually execute it – but done by a software rather than having an human doing it.

Mozilla’s software for continuous integration is called Tinderbox and for whatever reason, the name has stuck in Gentoo to refer to the various tries we have had at setting up some continuous integration testing. I’m not sure on the reasons why the name stuck in the Gentoo development team, but so it is, and so I’m keeping it.

Why is it “my” tinderbox

As I just said, Gentoo has had in the past more than one try at making continuous integration a major player in its development model. I think at least four people separately have worked on various “tinderboxes“ in the past five years, both as proper Gentoo projects and as outside, autonomous efforts. So I cannot call it “the Tinderbox”, but rather “my tinderbox”, to qualify which one among those that were and are.

Right now, we have three actively maintainer continuous integration efforts called tinderboxes: my own is the latest one joining the party, before that we already had Patrick’s and before that the Gentoo tinderbox which runs on proper Gentoo infrastructure boxes.

As I already rambled on about this “duplication“ of efforts is not a problem, as the three current systems have different procedures, and different targets. In particular, the “infra“ Tinderbox is mostly there to ensure the packages in the system set work for most major architectures, to provide reverse dependency information, and to make available binary packages useful for system recovery in case of major screw-up. On the other hand, me and Patrick are using our tinderboxes to identify bugs and report them to developers who might have not known about them otherwise.

What my Tinderbox *does*

I’ve written more or less how it works although that was really going in the detail of what I do to have it working. For what concern users, those details are mostly unimportant.

What my Tinderbox does, put it simply, is building the whole Portage tree, at least that’s the theoretical target. Unfortunately, as many noted before, it’s a target that no one box in the world can likely reach in our lifetime, let alone continuously. Continuous integration tools are designed to take away from the developers the time-consuming task of trying their code on different platforms (software or hardware, it doesn’t matter); even though I’m now limiting the scope of them to testing Gentoo, the possible field of variations is limitless.

Not only we support at least two operating systems (Linux and FreeBSD), a number of architectures (over a dozen), two branches (stable and unstable), but for the very essence of Gentoo, each installed system is never identical to another one: USE flags, CFLAGS, package sets, accepted packages from the unstable branches, overlays… Testing all these variables is impossible, so my tinderbox reduces (a lot) the scope, while still trying to be as broadly applicable as possible.

It runs over the unstable x86-keyworded branch of the tree (~x86 keyword), along with some packages that are not available by default and are rather unmasked (mostly toolchain-related). It uses vary bland CFLAGS, forced --as-needed and some common USE flag setup. It runs tests but it doesn’t let their failure stop the merge. It enables USE flag (for dependencies) as needed to have as many packages installed as possible.

The big part of the work is done by a single loop that tries to install, one by one, all the packages that haven’t been tested in the past two and a half months. Portage is left to remove conflicting packages as it finds fit. Sometimes packages are rejected for various reasons (they collide with another package, there is a conflict in the needed USE flag settings between different packages, or they require very old software versions that would in turn cause other problems), so instead of having the full Portage tree available for ~x86, I can only get a subset, the biggest subset I can at least.

The results

Of course, to understand why the Tinderbox is an important project, and why I’m investing so much time on it, and looking for other people to invest part of their time (or resources) to help me, is necessary to look at what the by-products of its execution are.

I’m not making my binpkgs available to the public (like the infra-based tinderbox has), mostly because I’m not allowed to: firstly I don’t enable the bindist USE flag, which means I’m allowing build configurations that are non-redistributable, because of licenses or trademarks, secondly because that’s not my target at all, and they may very well not be functional, in some parts.

The accomplishments coming from my efforts are, generally speaking, bug reports. Not only I scan through the failure logs for those ebuilds that fail to build in this environment, but I also scan for failing test procedures, and other similar quality issues. But again, this is not the result final users are looking for, as bug reports don’t mean anything for them.

Results that are good for users are, instead, the fixes for these bug reports: you can be safe to have documentation installed in the right place where you expect it; you can be safe that if you need to debug a problem, you only need to follow the guide without the build system of the package coming in your way; you can be safe that a package that will report itself as installed is indeed installed and it does work.

While, obviously, there are still problems that my Tinderbox effort won’t be finding anytime soon (for instance the problem of missing dependencies, which is instead the speciality of Patrick’s Tinderbox), and test procedures are rarely that well thought, designed and realised that they don’t leave possible runtime problems to be found, I think that keeping it running, continuously integrating as many ebuilds as it can, is definitely going to be of help for Gentoo’s general quality and stability.

And there is another thing that users are most likely interested in: testing of the future toolchain. As I said before, beside using the ~x86 visible ebuilds, my Tinderbox is also running some masked packages, which include, nowadays, gcc-4.4 and autoconf-2.65. Using these versions, not yet available to the general public of users, allows me to find failure cases related to them before they are made available. This serves the double role of assessing the impact that any given change might have on the users (by knowing how many packages will be broken if the package is unmasked), and correcting those problems before they have the chance of wasting users’ time.

Helping

Continuous integration can’t be said to be a “boring” task, in the sense that every day you might stumble into a new problem, for which you’ll have to devise a solution. On the other hand, it’s an heavy duty, for both the machines performing the job and the people that have to look through the logs. Even the sole task of filing the bugs is time-consuming, and that’s without adding the problems dealing with the human factor which often take you much more time.

Furthermore there are a number of practical issues tied into running this kind of continuous integration: the time needed to build packages conditions the amount of packages they can test over a given time frame, and unfortunately to reduce that time often you have to push for higher end machines (a lot of the work is I/O bound, and not all the software can build or execute tests in parallel), which tend to have high price tags, as well as consuming more power (and thus costing more over the long term) than your average machine.

For this reason both me and Patrick tend to ask for material help: we are taking the costs of these efforts directly.

It has been suggested a number of time to host the Tinderboxes over Gentoo infrastructure hardware, but there have been problems before about this. On the other hand, I think that the way I’m currently proceeding, for my effort at least, could open the possibility for that to happen. The one big issue for which I don’t really feel comfortable with running the tinderbox remotely is easy access to the logs for identifying problems and, especially, to attach them to bug reports. I’m currently thinking how to close that gap but it needs another kind of help in the form of available developers.

And one kind of help that costs very little on users, but can help a lot our work is commenting (and criticising if needed) our design choices, our code, our methods. Constructive criticism of course: telling us that our method sucks and not giving any suggestion on how to make them better is not going to help anybody, and can actually lower our morale quite a bit.

So please, the greatest favour you can do me is continuing to listen my ramblings about QA, tests, Tinderbox and log analysis; and if you are still unsure how something works in this system, feel free to ask.