Tinderbox: explaining some works

Many people asked before to explain better how my tinderbox works so that they can propose changes and help. Well, let me try to explain more how the thing is working so I can actually get some suggestions, as right now I’m a bit stuck.

Right now the tinderbox is a Linux Container that runs in Yamato; since Linux Containers are pretty lightweight, that means it has all the power of Yamato (which is actually my workstation, but it’s an 8-way Opteron system, with 16GB of Registered ECC RAM and a couple of terabytes of disk space).

Processing power, memory and disk space are shared between the two so you can guess that while I’m working with power-hungry software (like a virtual machine running a world rebuild for my job), the tinderbox is usually stopped or throttled down. On the other hand this makes it less of a problem to run, since Yamato is usually always running anyway. If I had to keep another box running just for the tinderbox, the cost in electrical power would be probably too high for me to sustain for a long time). The distfiles are also shared in a single tree (among also the virtual machines, other containers and chroots so that makes it very lightweight for Yamato to run in the background).

Since it’s an isolated container, I access the tinderbox through SSH, from there I launch screen and then I start the tinderbox work; yes it’s all manual for now. The main script is tinderbox4a.py that Zac wrote for me some time ago; this lists all the packages that haven’t been merged in the given time limit (6 weeks), or that have been bumped since the last time they were merged. It also spews on the standard error if there are USE-based dependencies that wouldn’t be satisfied with the current configuration (since the dependencies are brought in automatically, I just need to make sure the package.use file is set properly.

The output of this script is sorted by name and category; unfortunately I noticed that doing so would isolate too many of the packages at the bottom of the list, so to make it more useful, I sort it at random before saving it to a file. That file is then passed as argument to two xargs calls: the tinderbox itself, and the fetcher. The tinderbox itself has this command xargs --arg-file list -n1 emerge -1D --keep-going which means that each package listed is tried to install with its dependencies brought in, and if some new dependency fails to build (but the old is present) it’s ignored.

Now that you’ve seen how the tinderbox is executed you can see why I have a separate fetcher: if I were to download all the sources inline with the tinderbox (which is what I did a long time ago) I would end up having to wait for the download to complete before it would start the build, and thus add a network-bound latency to the whole job which is already long enough. So the fetcher runs this: xargs --arg-file list emerge -fO --keep-going which runs a single emerge to fetch all the packages. I didn’t use multiple calls here because the locks on vdb would make the whole thing definitely slower than what it is now; thanks to --keep-going it doesn’t stop when one package is unfetchable at least.

Silly note here: I noticed tonight while looking at the output that sometimes it took more time to resolve the same host name than fetching some smaller source file (since Portage does not yet reuse the connection as it’s definitely non-trivial to implement — if somebody knows of some kind of download manager that keeps itself in the background to reuse connections without using proxies I’d be interested!). The problem was I forgot to start nscd inside the tinderbox… took a huge hit from that, now it’s definitely faster.

This of course only shows the running interface; there are a couple extra steps involved though; there is another script that Zac wrote me: unavailable_installed.py that lists the packages that are installed in the tinderbox but are now unavailable, for instance because they were masked, or removed. This is important so I can keep a clean system from stuff that has been dropped because it was broken and so on. I run this each time I sync, before starting the actual tinderbox list script.

In the past the whole tinderbox was much hairier; Zac provided me with patches that let me do some stuff that the official portage wouldn’t do, that made my job easier, but now all of them are in the official Portage, and I just need to disable the unmerge-logs feature as well as enable the split-log one: the per-category split logs are optimal to submit them to the bugzilla, as Firefox does not end up chocking while trying to load the list of files.

When it’s time to check the logs for failures (of either build/install or tests, since I’m running with FEATURES="test test-fail-continue"), I simply open my lovely emacs and run this grep command: egrep -nH -e "^ .**.* ERROR:.*failed" -r /media/chroots/logs/tinderbox/build/*/*:*.log | sort -k 2 -t : -r which gives me the list of logs to look into. Bugs for them are then filed by me, manually with Firefox and my bug templates since I don’t know enough Python to make something useful out of pybugz.

Since I’m actually testing for a few extra things that are currently not checked for by Portage, like documentation to be installed in the proper path, or mis-installed man pages, I’ve got a few more greps rounds to run in the completed logs to identify them and report them, also manually. I should clean up the list of checks but for now you got my bashrc if you want to take a peek.

The whole thing is long, boring, and heavy on maintenance; I have still to polish some rough edges, like a way to handle the updates to the base system before starting the full run, or a way to remove the packages broken by ABI changes if they are not vital for the tinderbox operations (I need some extra stuff which is thus “blessed” and supposedly never going to be removed, like screen, or ruby to use ruby-elf).

There are currently two things I’d like to find a way to tweak in the scripts. The first is a way to identify collision problems: right now those failures gets only listed in the elog output and I have no way to get the correct data out without manually fiddling a lot with the log, which is suboptimal. The second problem is somewhat tied to that: I need a scoring system that drops all the packages that failed to merge to drop down in the list of future merges: build failures and collisions alike. This would let me spend more time building untested packages than rebuilding those that failed before.

If you want to play with the scripts and send me improvements, that’s definitely going to be helpful; a better reporting system, or a bashrcng plugin for the QA tests (hint, this was for you Mauro!) would be splendid.

If you still would like to contribute to the tinderbox effort without having knowledge of the stuff behind it, there are a couple of things you can get me that would definitely be helpful; in particular a pretty useful thing would be more RAM for Yamato; it has to be exactly the same as the one that I got inside, but luckily, I got it from Crucial so you can get it with the right code: CT2KIT51272AB667 — yes the price is definitely high, I paid an even higher price though for it, though. If you’d like to contribute this, you should probably check the comments, in the unlikely case I get four pairs of those. I should be able to run with 24, but the ideal would be to upgrade from the current 16 to 32 GB; that way I would probably be able to build using tmpfs (and find eventual problems tied to that as well). Otherwise check the donations page or this post if you’re looking for more “affordable” contributions.

3 thoughts on “Tinderbox: explaining some works

  1. “The first is a way to identify collision problems: right now those failures gets only listed in the elog output and I have no way to get the correct data out without manually fiddling a lot with the log, which is suboptimal.”If you don’t mind patching portage that’s pretty trivial to do, the logging is done after this line in dbapi/vartree.py:for pkg, owned_files in owners.items():(somewhere around line 3700)I could probably cook up a patch if you tell me what format you’re after (though I can’t test it myself).

    Like

  2. @Jeremy: it’s not much complex as in time consuming; and I forgot to add that there are a few prerequisites, like for instance if Python is updated, I need to run python-updater, and Alexis told me today that after an ocaml update I have yet another script to run; Haskell and ADA have their scripts as well…@Genone: I guess it’d be enough if the collisions were logged in either the same log as the build, or in a separate log from the elog. After all I’m going to have to check stuff like that by hand, but at least I won’t have to sort through tons and tons of extra lines to do so.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s