For A Parallel World: Again on directories

Parallel build, and parallel install, is not hard, in the sense that it usually doesn’t give you new, undocumented challenges; it actually seems to be repeating the same problems over and over again. Sometimes the exact same problem — as it seems Ruby upstream applied the stupid Funtoo patch instead of mine, and that made it fail again on a parallel install. Luckily I was able to fix it again for good, and now is sent to the ruby-core mailing list.

Another issue came up today, when I noticed a bug for OpenSC which turned out to be a parallel install failure. While Michelangelo’s quick fix is actually a smart way to deal with it quickly, I’ve preferred applying the correct fix, which I also sent to the opensc-devel mailing list.

So this is just a quick post to remember you all: if you see failures such as “file already exists”, remember you’re in front of a parallel install failure, and you can see my previous blog to understand how to fix it properly.

For A Parallel World: Parallel building is not passé

It’s been a while since I last wrote about parallel building. This has only to do with the fact that the tinderbox hasn’t been running for a long time (I’m almost set up with the new one!), and not with the many people who complained to me that spending time in getting parallel build systems to work is a waste of time.

This argument has been helped by the presence of a --jobs option to Portage, with them insisting that the future will have Portage building packages in parallel, so that the whole process will take less time, rather than shortening the single build time. I said before that I didn’t feel like it was going to help much, and now I definitely have some first hand experience to tell you that it doesn’t help at all.

The new tinderbox is a 32-way system; it has two 16-core CPUs, and enough RAM for each of them; you can easily build with 64 process at once, but I’m actually trying to push it further by using the unbound -j option (this is not proper, I know, but still). While this works nicely, we still have too many packages that force serial-building due to broken build systems; and a few that break in these conditions that would very rarely break on systems with just four or eight cores, such as lynx .

I then tried, during the first two rebuilds of world (one to set my choices in USE flags and packages, the other to build it hardened), running with five jobs in parallel… between the issue of the huge system set (yes that’s 4.24 years old article), and the fact that it’s much more likely to have many packages depending on one, rather than one depending on many, this still does not saturate the CPUs, if you’re still building serially.

Honestly seeing such a monstrous system take as much as my laptop, which is 14 in cores and 14 in RAM, to build the basic system was a bit… appalling.

The huge trouble seem to be for packages that don’t use make, but that could, under certain circumstances, be able to perform parallel building. The main problem with that is that we still don’t have a variable that tells us exactly how many build jobs we have to start, instead relying on the MAKEOPTS variable. Some ebuilds actually try to parse it to extract the number of jobs, but that would fail with configurations such as mine. I guess I should propose that addition for the next EAPI version… then we might actually be able to make use of it in the Ruby eclasses to run tests in parallel, which would make testing so much faster.

Speaking about parallel testing, the next automake major release (1.13 — 1.12 was released but it’s not in tree yet, as far as I can tell) will execute tests in parallel by default; this was optional starting 1.11 and now it’s going to be the default (you can still opt-out of course). That’s going to be very nice, but we’ll also have to change our src_test defaults, which still uses emake -j1 which forces serialisation.

Speaking about which, even if your package does not support parallel testing, you should use parallel make, at least with automake, to call make check; the reason is that the check target should also build the tests’ utilities and units, and the build can be sped up a lot by building them in parallel, especially for test frameworks that rely on a number of small units instead of one big executable.

Thankfully, for the day there are two more packages fixed to build in parallel: Lynx (which goes down from 110 to 46 seconds to build!) and Avahi (which I fixed so that it will install in parallel fine).

For A Parallel World: ebuild writing tip: faster optional builds

Today lurking on #gentoo-hardened I came to look at an ebuild written particularly badly, that exasperated one very bad construct for what concerns parallel builds (which are a very good thing with modern multi-core multi-thread CPUs):

src_compile() {
  if use foo; then
     emake foo || die
  fi

  if use bar; then
    emake bar || die
  fi

  if use doc; then
    emake doc || die
  fi
}

This situation wastes a lot of processing power: the three targets with all their dependencies will be taken into consideration serially, not in parallel; if you requested 12 jobs, but each of foo and bar only have three object files as dependencies, they should have been built at the same time, not in two different invocations.

I admit I made this mistake before, and even so recently, mostly related to documentation building, so how does one solve this problem? Well there are many options, my favourite being something along these lines:

src_compile() {
  emake 
    $(use foo && echo foo) 
    $(use bar && echo bar) 
    $(use doc && echo doc) 
    || die "emake failed"
}

Of course this has one problem in the fact that I don’t have a general target so it should rather be something more like this:

src_compile() {
  local maketargets=""

  if use bar ; then
    maketargets="${maketargets} bar"
  else
    maketargets="${maketargets} foo"
  fi

  emake ${maketargets} 
    $(use doc && echo doc) 
    || die "emake failed"
}

This will make sure that all the targets will be considered at once, and will leave make to take care of dependency resolution.

I tried this approach out in the latest revision of the Drizzle ebuild that I proxy-maintain for Pavel; the result is quite impressive because doxygen, instead of taking its dear time after the build completed, runs for about half of the build process (using up only one slot of the twelve jobs I allocate for builds on Yamato).

Obviously, this won’t make any difference if the package is broken with respect to parallel build (using emake -j1) and won’t make a difference when you’re not building in parallel, but why not doing it right, while we’re at it?

Tell-tale signs that your Makefile is broken

Last week I sent out a broad last-rite email for a number of gkrellm plugins that my tinderbox reported warnings about that shows that they have been broken for a long time. This has been particularly critical because the current maintainer of all the gkrellm packages, Jim (lack), seems not to be very active on them.

The plugins I scheduled for removal are mostly showing warnings related to the gdk_string_width() function called with a completely different object than it should have been called with, which will result in unpredictable behaviour at runtime (most likely, it’ll crash). A few more were actually buffer overflows, or packages failing because their dependencies changed. If you care about a plugin that is scheduled for removal, you’re suggested to look into it yourself and start proxy-maintain it.

I originally though I was able to catch all of the broken packages; but since then, another one appeared with the same gdk_string_width() error, so I decided running the tinderbox specifically against the gkrellm plugins; there was another one missing and then I actually found all of them. A few more were reported ignoring LDFLAGS, but nothing especially bad turned up on my tinderbox.

What it did show though, is that the ignored LDFLAGS are just a symptom of a deeper problem: most of the plugins have broken Makefile that are very poorly written. This could be seen in a number of small things, but the obvious one is the usual ”job server unavailable” message that I have written about last year.

So here’s a good checklist of things that shows that your Makefile is broken:

  • you call directly the make command — while this works perfectly fine on GNU systems, where you almost always use the GNU make implementation, this is not the case in most BSD systems, and almost always the Makefile is good enough only to work with the GNU implemenation; the solution is to call $(MAKE) which is replaced with the name of the make implementation you’re actually using;
  • it takes you more than one command to run make in a subdirectory (this can also be true for ebuilds, mind you) — things like cd foo && make or even worse (cd foo; make; cd ..; ) are mostly silly to look at and, besides, will cause the usual jobserver unavailable warning; what you might not know here is that make is (unfortunately) designed to allow for recursive build, and provides an option to do so without requiring changing the working directory beforehand: make -C foo (which actually should be, taking the previous point into consideration, $(MAKE) -C foo) does just that, and only changes the working directory for the make process and its children rather than for the current process as well;
  • it doesn’t use the builtin rules — why keep writing the same rules to build object files? make already knows how to compile .c files into relocatable objects; instead of writing your rules to inject parameters, just use the CFLAGS variable like make is designed to do! Bonus points if, for final executables, you also use the built-in linking rule (for shared objects I don’t think there is one);
  • it doesn’t use the “standard” variable names — for years I have seen projects written in C++ insisting on using CPP and CPPFLAGS variables, well that’s wrong, as here “cpp” refers to the C Pre-Processor; the correct variables are CXX and CXXFLAGS; inventing your own variable names to express parameters that can be passed by the user tends to be a vary bad choice, as you break the expectations of the developers and packagers using your software.

Now, taking this into consideration, can you please clean up your packages? Pretty please with sugar on top?

Bigger, better tinderbox

Well, not in the hardware sense, not yet at least, even though it’d be wicked to have an even faster box here (with some due control of course, but I’ll get back to that later). I’ll probably get some more memory and AMD Istanbul CPUs when I’ll have some cash surplus — which might not be soon.

Thanks to Zac, and his irreplaceable help, Portage gained a few new features that made my life as “the tinderbox runner” much easier: collision detection now is saved in the build/merge log, this way I can grep for them as well as for the failures; the die hook is now smarter, working even in case of failures coming from the Pythons side of Portage (like collisions) and it’s accompanied by a success hook. The two hooks are what I’m using for posting to identi.ca the whole coming of the tinderbox (so you can follow that account if you feel like being “spammed” by the proceeding of the tinderbox — the tags allow to have quick look of how the average is).

But it’s not just that; if you remember me looking for a run control for the tinderbox, I’ve implemented one of the features I talked about in that post even without any fancy, complex application: when a merge fails, the die hook masks the failed package (the exact revision), and this has some very useful domino effects. The first is that the same exact package version can only ever fail once in the same tinderbox run (I cannot count the times my tinderbox wasted time rebuilding stuff like mplayer, asterisk or boost and failing, as they are dependencies of other packages), and that’s what I was planning for; what I had instead is even more interesting.

While the tinderbox already runs in “keep going mode” (which means that a failed, optional build will not cause the whole request to be dropped, and applies mostly to package updates), by masking specific, failing revisions of some packages, it also happens to force downgrades, or stop updates, of the involved packages, which means that more code is getting tested (and sometimes it gets luckier as older versions build where newer don’t). Of course the masking does not happen when the failure is in the tests, as those are quite messed up and warrant a post by themselves.

Unfortunately I’m now wondering how taxing the whole tinderbox process is getting: in the tree there are just shy of 14 thousands packages. Of these, some will merge in about three minutes (this is back-to-back from call to emerge to end of the process; I found nothing going faster than that), and some rare ones, like Berkeley DB 4.8, will take over a day to complete their tests (db-4.8 took 25 hours, no kidding). Accepting an average of half an hour per package, this brings us to 7 thousands hours, 300 days, almost an year. Given that the tinderbox is currently set to re-merge the same package over a 10 weeks schedule, this definitely gets problematic. I sincerely hope the average is more like 10 minutes, even thought that will still mean an infinite rebuild. I’ll probably have to find the real average looking through the total emerge log, and at the same time I’ll have to probably reduce the rebuild frequency.

Again, the main problem gets to be with parallel make: I’d be waiting for the load of the system to be pretty high while on the other hand it’s really left always under the value of 3. Quite a few build systems, including Haskell extensions’, Python’s setuptools, and similar does not seem to support parallel build (in case of setuptools, it seems that it calls make directly, so ignoring Gentoo’s emake wrapper), and quite a few packages force serial make (-j1) anyway.

And a note here: you cannot be sure that calling emake will give you parallel make; beside the already-discussed “jobserver unavailable” problem, there is the .NOTPARALLEL directive that instructs GNU make to not build in parallel even though the user asked -j14. I guess this is going to one further thing to look for when I’ll start with the idea of distributed static analysis.

Make mistakes

This post is coming to you on my birthday! Yes, I’m still writing even today; I’m still filing bugs as well. I do have a personal request if somebody wishes to send me something: two £20 PSN cards (from UK). I bought Fallout 3 for PS3 “preowned” while I was in London earlier this month; two days ago the expansion packs were released in Europe, but the versions for UK or the rest of EU, including Italy, are not compatible. And of course to use the UK store I need a UK credit or debit card, or more easily UK PSN gift cards. So if you want to wish me an happy birthday, feel free to get the PSN card and mail me the redeem code, no shipment required. (Encrypt the mail if you do that though.)

Today’s post is not related to the tinderbox, since I know it might start sounding a bit boring to read exclusively about that. On the other hand it’s a topic that I did find interesting from the tinderbox itself, since that has shown me how this is a problem indeed. It’s about mistakes, mistakes with make.

Quite a few packages seems to think it’s a good idea to simply call the -config scripts to find flags to use and libraries in the compile line for each source file, something like this:

CFLAGS = `sdl-config --cflags`
LDLIBS = `sdl-config --libs`

# or

.c.o:
        $(CC) $(CFLAGS) `sdl-config --cflags` -c $< -o $@

Can you see the problem already? The way make works, the command line is invoked through system() each time; sh is thus executed and it takes care of interpreting the command line; the `` calls will be then sub-called and then substituted. This, though, will cause three commands to be executed: sh, sdl-config and finally gcc to compile the file; per each source file. While the first and latter are part of the design, the latter is one more call, one more sub-process, called per each compilation that is not really useful.

Similarly this can become quite bothersome when instead of a single sdl-config call you got something like three or four pkg-config calls. While the executables and their libraries as well as the datafile they need will be kept hot in cache, this means they’ll also be kept in cache in expense of something else. And you can guess they are not that important.

One of the many solutions for this mistake is to use one GNU make extension (gmake is much better than others make for what concerns support extensions): the $(shell ) function.

SDL_CFLAGS := $(shell sdl-config --cflags)
SDL_LIBS := $(shell sdl-config --libs)

CFLAGS = $(SDL_CFLAGS)

The first two lines in this example set (:=, the simpler = in make define a macro, but doesn’t set it “fixed”) the variables by calling, once and at global scope, the sdl-config script. This way you don’t have to call it per translation unit. The side effect of this, beside adding the dependency over GNU make, is that the script will be called even when trying to clean the project, but there are more sophisticated ways to deal with that, for instance:

all-recursive:
        $(MAKE) SDL_CFLAGS="`sdl-config --cflags`" SDL_LIBS="`sdl-config --libs`" all

CFLAGS = $(SDL_CFLAGS)

(Of course this is just a quick and dirty example, there are other things to take into consideration there.)

A similar problem happens when you got packages that allows users to “configure” compiler and flags by creating files with them: you end up with `cat somefile` multiple times in a compile line. And that also means that you have to keep the files in cache; they might be small but they are still 4KiB of cache wasted for something that is not really useful: you can either set variables or set a configuration file that is then included by the main makefile.

Yes this might be smaller stuff, but when you got something like my tinderbox going, you probably want to spare even those process calls during a long build.

For A Parallel World. Theory lesson n.3: directory dependencies

Since this is not extremely common knowledge, I wanted to write down some more notes regarding the problem that Daniel Robbins reported in Ruby 1.9 which involves parallel make install problems.

This is actually a variant of a generic parallel install failure: lots of packages in the past assumed that make install is executed on a live filesystem and didn’t create the directories where to copy the files on. This of course fails for all the staging trees install (DESTDIR-based install), which are used by all distributions to build packages, and by Gentoo to merge from ebuilds. With time, and distributions taking a major role, most of the projects updated this so that they do create their directories before merging (although there are quite a few failing this still, just look for dodir calls in the ebuilds).

The problem we have here instead is slightly different: if you just have a single install target that depends at the same time on the rules that create the directories and on those that install the files, these doesn’t specify interdependencies:

install: install-dirs install-bins

install-dirs:
        mkdir -p /usr/bin

install-bins: mybin
        install mybin /usr/bin/mybin

(Read it like it used DESTDIR properly). When using serial make, the order the rules appear on the dependency list is respected and thus the directories are created before the binaries; with no problem. When using parallel make instead, the two rules are executed in parallel and if the install command may be executed before mkdir. Which makes the build fail.

The “quick” solution that many come to is to depend on the directory:

install: /usr/bin/mybin

/usr/bin:
        mkdir -p /usr/bin

/usr/bin/mybin: mybin /usr/bin
        install mybin /usr/bin/mybin

This is the same solution that Daniel came to; unfortunately this does not work properly; the problem is that this dependency is not just ensuring that the directory exists, but it also adds a condition on the timestamp of modification (mtime) of the directory itself. And since the directory’s mtime is updated whenever the mtime of its content is updated, this can become a problem:

flame@yamato foo % mkdir foo   
flame@yamato foo % stat -c '%Y' foo
1249082013
flame@yamato foo % touch foo/bar
flame@yamato foo % stat -c '%Y' foo
1249082018

This does seem to work for most cases, and indeed a similar patch was added already to Ruby 1.9 in Portage (and I’m going to remove it as soon as I have time). Unfortunately if there are multiple files that gets installed in a similar way, it’s possible to induce a loop inside make (installing the latter binaries will update the mtime of the directory, which will then have an higher mtime than the first binary installed).

There are two ways to solve this problem, neither look extremely clean, and neither are prefectly optimal, but they do work. The first is to always call mkdir before installing the file; this might sound overkill, but using mkdir -p it really has a small overhead compared to just calling it once.

install: /usr/bin/mybin

/usr/bin/mybin: mybin /usr/bin
        mkdir -p $(dir $@)
        install mybin /usr/bin/mybin

The second is to depend on a special time-stamped rule that creates the directories:

install: /usr/bin/mybin

usr-bin-ts:
        mkdir -p /usr/bin
        touch $@

/usr/bin/mybin: mybin usr-bin-ts
        install mybin /usr/bin/mybin

Now for Ruby I’d sincerely go with the former option rather than the latter, because the latter adds a lot more complexity and for quite little advantage (it adds a serialisation point, while mkdir -p execute in parallel). Does this help you?

For A Parallel World. Improvements: make call checks

This is a particularly Gentoo-oriented instance of the For A Parallel World series, please don’t look away too much because I’m not trying to explain how to improve software in general, this time, at least not directly.

My tinderbox, of which I’ve been writing a lot almost daily, is running on an 8-way system, Yamato, a dual quad Opteron (partially funded by users last year); by running on such a system, I can usually cut down the time to build packages thanks to parallel make, but this is not always possible.

There are instances of packages that serialise build either because they are bugged and would break when building in parallel, or because they simply are bugged and disable parallel make without understanding it. This is the reason why I started this series in the first place. Now, for ebuilds that do use serial make, calling emake -j1, I’ve already asked that bugs are kept open, so that either I or somebody else can take care of reproducing and fixing the various issues, instead of going around trying to get it to work. While this works just partially, it’s still an improvement over the previous state of “if it works in serial, close the bug”.

But there are a couple extra problems: before I joined, most ebuilds that wanted to avoid parallel make used make rather than emake -j1; while the latter is usually caught by repoman which warns about an upstream workaround, the former is not. It also makes it difficult to understand whether the non-parallel make is requested on purpose, or it was just overlooked, since rarely there are comments about that.

Thanks to Kevin Pyle, I’ve been tracking down these rogue ebuilds; he showed me a very simple snippet of code that can track down the callers of direct make in ebuilds:

make() {
        eqawarn "/etc/portage/bashrc QA notice: 'make' called by ${FUNCNAME[1]}"
        emake "$@"
}

Thanks to this snippet I’ve been able to identify already a few packages that called make but builds fine in parallel, and a couple that require serial make or fail to build or install (bugs opened and ebuild worked around). Hopefully, on the long term this check won’t hit any longer and ebuilds will work properly in parallel. It would really be a good thing, because processors these days are increasing the number of cores faster than they achieve extra speed, and being able to build stuff in parallel, as well as execute it in parallel, is the key to reduce the time to install of Gentoo.

Braindumping

I’m writing this entry while I’m waiting for my pasta to get ready, in the lunch break I’m having between finishing some job task. For a series of reasons I’m going slowly at it because life in the past few days have been difficult. Adding to my problem with sleep, my neighbours resumed waking me up “early” in the morning (early by my standards, that is). Yes I know that for most people, 11am is not “early”, but given that I always tended to work during the night (easier not to be disturbed by family, friends, Windows users, …), and that they know that (not adding the fact that my father works shifts so he also works nights quite often), I wouldn’t expect such amount of noise.

With “such amount of noise” I mean that I get woken up, while sleeping with my iPod playing podcasts, the headphones in, with my room’s door closed. And no I don’t have light sleep, once I get to to sleep, unless I know I have to wake up (either to receive a parcel, to go to work remotely, or whatever). This weekend I ended up sleeping just shy of six hours per night, which is not good for my health either. I have now ordered a pair of noise-isolating in-ear phones, they should arrive tomorrow via UPS, they at least tend to be quite on time. On the other hand the Italian postal service, that should deliver me three Amazon packages, takes the usual eternity.

For what most of my readers are concerned, I still have my tinderbox running, after a few tweaks. First of all, Zac provided me a new revision of the script that generates the list of packages that need to be upgraded (or installed) in the system, which takes good care of USE dependencies so that when they are expressed I can see them before the system tries to merge them in. Then I’m wondering about the order of merge. Since I noticed that more than a couple of time I had to suspend the tinderbox run in the midst of it, the lower-end of the list of packages tended to not be merged as often as the upper-end (where quite a few packages I know still fail to merge). This time I executed the tinderbox from the bottom up (more useful since the sys-* packages are lower), but I’m considering next time to just sort the list randomly before feeding it to the merge command so that there are better chances for all the packages to be built in a couple of iterations.

Speaking about tinderbox and packages, I noticed that there are lots of packages that waste time without good reason, doing useless processing. This includes, for instance, compressing the man pages before Portage does so. While one can understand that upstream would like to provide the complete features to the users, it’s also usually a task that distributions do pick up, and would make sense to provide an option “I’m a packager” to not execute them.

Now, you could argue that those are very small tasks, and even for packages installing ten or twenty man pages it doesn’t feel like too much time wasted. But please think of two things first of all: the compression often enough is done with a for loop in sh rather than with multiple Make rules (which would be ran in parallel), serialising the build, and taking more time on multi-core systems. Secondarily, the man pages will have to be decompressed and compressed again by Portage, so it’s about three time the work strictly needed.

Another problem is with packages that, not knowing where to find an interpreter, be it Perl, Ruby, Python or whatever else, check for it in configure or with other methods and then replace it in all their scripts. And once again quite often using for in sh rather than Make rules. Instead of doing that they should just use /usr/bin/env $interpreter to let the system find the right one, and not hardcode it in the files at all (unless you need a specific version, but that’s another point altogether).

Well, I’ve eaten my pasta now (bigoli with pesto, for those interested), so I’ll get a coffee and be back to work. I’ll try to write a more useful blog later on today after I’m done with work. I have lots of things to write about, included something about XBMC (from an user standpoint, I don’t have time to approach it with the packager’s eye).

For A Parallel World. Theory lesson n.3: on recursive make

There is one particular topic that was in my TODO list of things to write about in the “For A Parallel World” series, and that topic is recursive make, the most common form of build system in use in this world.

The problems with recursive make has been exposed since at least 1997 by the almost famous Recursive Make Considered Harmful paper. I suggest the reading of the paper to everybody who’s interested in the problems of parallelisation of build systems.

As it turns out, automake supports non-recursive forms quite well, and indeed I use it on at least one project of mine . Lennart also uses it in PulseAudio, as all the modules are built in the same Makefile.am file even though their sources (even the generated ones) are split among different directories.

Unfortunately the paper has to be taken with a grain of salt. The reason why I’m saying that is that if you read it there are at least a couple of places where the author seems to misknown his make rules and defines a rule with two output files, and one where he ignores temporary files naming problems .

There are, of course, other solutions to this problem, for instance I remember the make command for FreeBSD to be able to run into directories in parallel just fine, but I sincerely think here we have to stop one particular problem first. I don’t have too many problems with recursive make for different binaries or final libraries, or for libraries that are shared among targets. Yes they tend to put more serialisation into the process but it’s not tremendously bad, especially not for the non-parallel case where the problem does not seem to appear for most users at all.

What I think is a problem, and I seriously detest it, is when the only reason to use recursive make is to keep the layout of the built object files the same as the source files, and then create sub-directories for logical parts of the same target. With automake this tends to require the creation of convenience noinst libraries that are built and linked against, but never installed. While this works, it tends to increase tremendously the complexity of the whole build, and the time required to build them, since sometimes they get compiled not only into archive file (.a static libraries) but also in final ELF shared objects, depending on their use. Since we know that linking is slow we should try to avoid doing it for no good reason, don’t you think?

In general, the noinst_LTLIBRARIES presence means that either you’re grouping sources that will be used just one in a recursive make system, or you’re creating internal convenience libraries which might even be more evil than that, since it can create two huge binaries like the case of Inkscape.

Once again, if you want your build system reviewed, feel free to drop me an email, depending on how much time I’ve got I’ll do my best to point out eventual fixes, or actually fix it.