Autotools Mythbuster: On parallel testing

A “For A Parallel World” crossover!

Since now the tinderbox is actually running pretty good and the logs are getting through just fine, I’ve decided to spend some more time expanding the Autotools Mythbuster guide with more content, in particular in areas such as porting for automake 1.12 (and 1.13).

One issue though which I’ll have to discuss in that guide soon, and for which I’m posting already right now, is parallel testing, because it’s something that is not really well known, and is something that, at least for Gentoo, involves the EAPI=5 discussion.

Build systems using automake have a default target for testing purposes called check. This target is designed to build and execute testcases, in a pretty much transparent way. Usually this involves two main variables: check_PROGRAMS and TESTS. The former defines the binaries to build for the testcases, the latter which testcases to run.

This is counter-intuitive and might actually sound silly, but in some cases you want to build test programs as binaries, but call scripts instead to compare them. This is often the case when you test a library, as you want to actually compare the output of a test program with the known-good output.

Now, up to automake 1.12, if you run make -j16 check, what is parallelized is only the building of the binaries and targets; you can for instance make use of this with check_DATA to preprocess some source files (I do that for unpaper which only ships in the repository the original PNG files of the test data), but if your tests take time, and you have little stuff that needs to be built, then running make -j16 check is not going to be a big win. This added with the chance that the tests might just not work in parallel is why the default up to now in Gentoo is to run the tests in series.

But that’s why recent automake introduced the parallel-tests option, which is actually going to be the default starting from 1.13. In this configuration, the tests are executed by a driver script, which launches multiple copies of them at once, and then proceeds with receiving the results. Note that this is just an alternative default test harness, and Automake actually supports custom harnesses as well, which may or may not be run in parallel.

Anyway, this is something that I’ll have to write about in more details in my guide — please be patient. In the mean time you can see unpaper as an example, as I just updated the git tree to make the best use of the parallel tests harness (it actually saved me some code).

For A Parallel World: Again on directories

Parallel build, and parallel install, is not hard, in the sense that it usually doesn’t give you new, undocumented challenges; it actually seems to be repeating the same problems over and over again. Sometimes the exact same problem — as it seems Ruby upstream applied the stupid Funtoo patch instead of mine, and that made it fail again on a parallel install. Luckily I was able to fix it again for good, and now is sent to the ruby-core mailing list.

Another issue came up today, when I noticed a bug for OpenSC which turned out to be a parallel install failure. While Michelangelo’s quick fix is actually a smart way to deal with it quickly, I’ve preferred applying the correct fix, which I also sent to the opensc-devel mailing list.

So this is just a quick post to remember you all: if you see failures such as “file already exists”, remember you’re in front of a parallel install failure, and you can see my previous blog to understand how to fix it properly.

For A Parallel World: Parallel building is not passé

It’s been a while since I last wrote about parallel building. This has only to do with the fact that the tinderbox hasn’t been running for a long time (I’m almost set up with the new one!), and not with the many people who complained to me that spending time in getting parallel build systems to work is a waste of time.

This argument has been helped by the presence of a --jobs option to Portage, with them insisting that the future will have Portage building packages in parallel, so that the whole process will take less time, rather than shortening the single build time. I said before that I didn’t feel like it was going to help much, and now I definitely have some first hand experience to tell you that it doesn’t help at all.

The new tinderbox is a 32-way system; it has two 16-core CPUs, and enough RAM for each of them; you can easily build with 64 process at once, but I’m actually trying to push it further by using the unbound -j option (this is not proper, I know, but still). While this works nicely, we still have too many packages that force serial-building due to broken build systems; and a few that break in these conditions that would very rarely break on systems with just four or eight cores, such as lynx .

I then tried, during the first two rebuilds of world (one to set my choices in USE flags and packages, the other to build it hardened), running with five jobs in parallel… between the issue of the huge system set (yes that’s 4.24 years old article), and the fact that it’s much more likely to have many packages depending on one, rather than one depending on many, this still does not saturate the CPUs, if you’re still building serially.

Honestly seeing such a monstrous system take as much as my laptop, which is 14 in cores and 14 in RAM, to build the basic system was a bit… appalling.

The huge trouble seem to be for packages that don’t use make, but that could, under certain circumstances, be able to perform parallel building. The main problem with that is that we still don’t have a variable that tells us exactly how many build jobs we have to start, instead relying on the MAKEOPTS variable. Some ebuilds actually try to parse it to extract the number of jobs, but that would fail with configurations such as mine. I guess I should propose that addition for the next EAPI version… then we might actually be able to make use of it in the Ruby eclasses to run tests in parallel, which would make testing so much faster.

Speaking about parallel testing, the next automake major release (1.13 — 1.12 was released but it’s not in tree yet, as far as I can tell) will execute tests in parallel by default; this was optional starting 1.11 and now it’s going to be the default (you can still opt-out of course). That’s going to be very nice, but we’ll also have to change our src_test defaults, which still uses emake -j1 which forces serialisation.

Speaking about which, even if your package does not support parallel testing, you should use parallel make, at least with automake, to call make check; the reason is that the check target should also build the tests’ utilities and units, and the build can be sped up a lot by building them in parallel, especially for test frameworks that rely on a number of small units instead of one big executable.

Thankfully, for the day there are two more packages fixed to build in parallel: Lynx (which goes down from 110 to 46 seconds to build!) and Avahi (which I fixed so that it will install in parallel fine).

For A Parallel World: ebuild writing tip: faster optional builds

Today lurking on #gentoo-hardened I came to look at an ebuild written particularly badly, that exasperated one very bad construct for what concerns parallel builds (which are a very good thing with modern multi-core multi-thread CPUs):

src_compile() {
  if use foo; then
     emake foo || die
  fi

  if use bar; then
    emake bar || die
  fi

  if use doc; then
    emake doc || die
  fi
}

This situation wastes a lot of processing power: the three targets with all their dependencies will be taken into consideration serially, not in parallel; if you requested 12 jobs, but each of foo and bar only have three object files as dependencies, they should have been built at the same time, not in two different invocations.

I admit I made this mistake before, and even so recently, mostly related to documentation building, so how does one solve this problem? Well there are many options, my favourite being something along these lines:

src_compile() {
  emake 
    $(use foo && echo foo) 
    $(use bar && echo bar) 
    $(use doc && echo doc) 
    || die "emake failed"
}

Of course this has one problem in the fact that I don’t have a general target so it should rather be something more like this:

src_compile() {
  local maketargets=""

  if use bar ; then
    maketargets="${maketargets} bar"
  else
    maketargets="${maketargets} foo"
  fi

  emake ${maketargets} 
    $(use doc && echo doc) 
    || die "emake failed"
}

This will make sure that all the targets will be considered at once, and will leave make to take care of dependency resolution.

I tried this approach out in the latest revision of the Drizzle ebuild that I proxy-maintain for Pavel; the result is quite impressive because doxygen, instead of taking its dear time after the build completed, runs for about half of the build process (using up only one slot of the twelve jobs I allocate for builds on Yamato).

Obviously, this won’t make any difference if the package is broken with respect to parallel build (using emake -j1) and won’t make a difference when you’re not building in parallel, but why not doing it right, while we’re at it?

For A Parallel World. Theory lesson n.3: directory dependencies

Since this is not extremely common knowledge, I wanted to write down some more notes regarding the problem that Daniel Robbins reported in Ruby 1.9 which involves parallel make install problems.

This is actually a variant of a generic parallel install failure: lots of packages in the past assumed that make install is executed on a live filesystem and didn’t create the directories where to copy the files on. This of course fails for all the staging trees install (DESTDIR-based install), which are used by all distributions to build packages, and by Gentoo to merge from ebuilds. With time, and distributions taking a major role, most of the projects updated this so that they do create their directories before merging (although there are quite a few failing this still, just look for dodir calls in the ebuilds).

The problem we have here instead is slightly different: if you just have a single install target that depends at the same time on the rules that create the directories and on those that install the files, these doesn’t specify interdependencies:

install: install-dirs install-bins

install-dirs:
        mkdir -p /usr/bin

install-bins: mybin
        install mybin /usr/bin/mybin

(Read it like it used DESTDIR properly). When using serial make, the order the rules appear on the dependency list is respected and thus the directories are created before the binaries; with no problem. When using parallel make instead, the two rules are executed in parallel and if the install command may be executed before mkdir. Which makes the build fail.

The “quick” solution that many come to is to depend on the directory:

install: /usr/bin/mybin

/usr/bin:
        mkdir -p /usr/bin

/usr/bin/mybin: mybin /usr/bin
        install mybin /usr/bin/mybin

This is the same solution that Daniel came to; unfortunately this does not work properly; the problem is that this dependency is not just ensuring that the directory exists, but it also adds a condition on the timestamp of modification (mtime) of the directory itself. And since the directory’s mtime is updated whenever the mtime of its content is updated, this can become a problem:

flame@yamato foo % mkdir foo   
flame@yamato foo % stat -c '%Y' foo
1249082013
flame@yamato foo % touch foo/bar
flame@yamato foo % stat -c '%Y' foo
1249082018

This does seem to work for most cases, and indeed a similar patch was added already to Ruby 1.9 in Portage (and I’m going to remove it as soon as I have time). Unfortunately if there are multiple files that gets installed in a similar way, it’s possible to induce a loop inside make (installing the latter binaries will update the mtime of the directory, which will then have an higher mtime than the first binary installed).

There are two ways to solve this problem, neither look extremely clean, and neither are prefectly optimal, but they do work. The first is to always call mkdir before installing the file; this might sound overkill, but using mkdir -p it really has a small overhead compared to just calling it once.

install: /usr/bin/mybin

/usr/bin/mybin: mybin /usr/bin
        mkdir -p $(dir $@)
        install mybin /usr/bin/mybin

The second is to depend on a special time-stamped rule that creates the directories:

install: /usr/bin/mybin

usr-bin-ts:
        mkdir -p /usr/bin
        touch $@

/usr/bin/mybin: mybin usr-bin-ts
        install mybin /usr/bin/mybin

Now for Ruby I’d sincerely go with the former option rather than the latter, because the latter adds a lot more complexity and for quite little advantage (it adds a serialisation point, while mkdir -p execute in parallel). Does this help you?

For A Parallel World. Improvements: make call checks

This is a particularly Gentoo-oriented instance of the For A Parallel World series, please don’t look away too much because I’m not trying to explain how to improve software in general, this time, at least not directly.

My tinderbox, of which I’ve been writing a lot almost daily, is running on an 8-way system, Yamato, a dual quad Opteron (partially funded by users last year); by running on such a system, I can usually cut down the time to build packages thanks to parallel make, but this is not always possible.

There are instances of packages that serialise build either because they are bugged and would break when building in parallel, or because they simply are bugged and disable parallel make without understanding it. This is the reason why I started this series in the first place. Now, for ebuilds that do use serial make, calling emake -j1, I’ve already asked that bugs are kept open, so that either I or somebody else can take care of reproducing and fixing the various issues, instead of going around trying to get it to work. While this works just partially, it’s still an improvement over the previous state of “if it works in serial, close the bug”.

But there are a couple extra problems: before I joined, most ebuilds that wanted to avoid parallel make used make rather than emake -j1; while the latter is usually caught by repoman which warns about an upstream workaround, the former is not. It also makes it difficult to understand whether the non-parallel make is requested on purpose, or it was just overlooked, since rarely there are comments about that.

Thanks to Kevin Pyle, I’ve been tracking down these rogue ebuilds; he showed me a very simple snippet of code that can track down the callers of direct make in ebuilds:

make() {
        eqawarn "/etc/portage/bashrc QA notice: 'make' called by ${FUNCNAME[1]}"
        emake "$@"
}

Thanks to this snippet I’ve been able to identify already a few packages that called make but builds fine in parallel, and a couple that require serial make or fail to build or install (bugs opened and ebuild worked around). Hopefully, on the long term this check won’t hit any longer and ebuilds will work properly in parallel. It would really be a good thing, because processors these days are increasing the number of cores faster than they achieve extra speed, and being able to build stuff in parallel, as well as execute it in parallel, is the key to reduce the time to install of Gentoo.

For A Parallel World. Case Study n.7: single rule, multiple outputs

Today, Markus brought to my attention bug #247219, which I reported myself, against dev-util/bugle. It seemed to me like it was a parallel build failure, skimming at the log when I reported the issues, but it turns out not to be related to that. On the other hand, there is a parallel make issue there, which only works out of sheer luck of not hitting race conditions, so it might be a good idea to look into it nonetheless. Since it’s simpler to document these cases for the public than having to explain the same situation multiple time, I promised him I would blog about it, so here it is a new episode of the For A Parallel World series .

When you look at the build log, you can see there are multiple calls to mkdir and multiple regeneration of sources between it, which calls out to an issue already described with make rules and the way they are written:

/bin/mkdir -p include/budgie
/bin/mkdir -p include/budgie
/bin/mkdir -p include/budgie
/bin/mkdir -p include/budgie
/bin/mkdir -p include/budgie
budgie/budgie -I. -T `test -f src/data/gl.tu || echo './'`src/data/gl.tu ./bc/gl-glx.bc
budgie/budgie -I. -T `test -f src/data/gl.tu || echo './'`src/data/gl.tu ./bc/gl-glx.bc
budgie/budgie -I. -T `test -f src/data/gl.tu || echo './'`src/data/gl.tu ./bc/gl-glx.bc
budgie/budgie -I. -T `test -f src/data/gl.tu || echo './'`src/data/gl.tu ./bc/gl-glx.bc
budgie/budgie -I. -T `test -f src/data/gl.tu || echo './'`src/data/gl.tu ./bc/gl-glx.bc

This is not making anything fail, but it’s certainly making the system do more work than it should since it’s executing the same commands five times rather than once. Since the script does seem to produce the sources one by one but rather as a monolithic request, we cannot apply the same solution we used for the past case (breaking down the rules so that they build a single file each), so we have to serialise the build a bit to make sure it is executed just once.

To do so, the easiest way is to interpose between the final sources and the original files another rule, that would be called just once and that would be able to take care of rebuilding the files just when needed. This is usually done by creating a timestamp file, so to make sure that the rebuild hits as soon as the original files are changed, and that it stops it from running more than once if the files have been generated already.

Let’s see the original rules then:

$(BUDGIE_BUILT_SRCS): budgie/budgie$(EXEEXT) src/data/gl.tu $(budgie_all_bc_files)
        $(mkdir_p) include/budgie
        budgie/budgie$(EXEEXT) -I$(srcdir) -T `test -f src/data/gl.tu || echo '$(srcdir)/'`src/data/gl.tu $(budgie_main_bc_file)

As usual, there are multiple targets to the rule, even if the command is just the same and does not vary depending on the file that is executed. So we change the files to depend on a new timestamp file, and use that to create the files:

$(BUDGIE_BUILT_SRCS): built-sources-ts

built-sources-ts: budgie/budgie$(EXEEXT) src/data/gl.tu $(budgie_all_bc_files)
        $(mkdir_p) include/budgie
        budgie/budgie$(EXEEXT) -I$(srcdir) -T `test -f src/data/gl.tu || echo '$(srcdir)/'`src/data/gl.tu $(budgie_main_bc_file)
        touch $@

Now make will decide that for each of the files in BUDGIE_BUILT_SRCS it needs the built-sources-ts timestamp file, and thus call the actual generation rule once to create it, which incidentally creates the rest of the files.

The only remaining issue is to make sure that when you clean up the sources you also remove the file, so just scroll further down the Makefile.am and you can easily spot where to change it so that there is this line:

CLEANFILES = 
        $(BUDGIE_BUILT_SRCS) built-sources-ts 

The trick here was to add the timestamp file together with the files declared in the variable so that they are cleared too. Now with a new build you’ll see just one mkdir and script call instead of five, less work for the build machine.

For A Parallel World. Case Study n.6: parallel install versus install hooks

Service note: I start to fear for one of my drives, as soon as my local shop restocks the Seagate 7200.10 drives I’ll go get two more to replace the 250GB ones and put them under throughout tests.

I’ve already written in my series about some issues related to parallel install. Today I wish to show a different type of parallel install failure, which I found while looking at the logs of my current tinderbox run.

Before starting, though, I wish to explain one thing that might not be tremendously obvious to most people not used to work with build systems. While the parallel build failures are most of the time related to non-automake based buildsystems, which fail to properly express dependencies, or in which the authors mistook a construct for another, the parallel install failures are almost always related to automake. This is due to the fact that almost all custom-tailored buildsystems don’t allow parallel install in the first place. For most of them, the install target is just one single serial rule, which always works fine even when using multiple parallel jobs, but obviously slows down modern multicore systems. As automake supports parallel install targets, which makes it quite faster to install packages, it also adds the complexity that can cause parallel build failures.

So let’s see what the failure I’m talking about is; the package involved is gmime, with the Mono bindings enabled; Gentoo bug #248657, upstream bug #567549 (thanks to Jeffrey Stedfast, who quickly solved it!). The log of the failure is the following:

Making install in mono
make[1]: Entering directory `/var/tmp/portage/dev-libs/gmime-2.2.23/work/gmime-2.2.23/mono'
make[2]: Entering directory `/var/tmp/portage/dev-libs/gmime-2.2.23/work/gmime-2.2.23/mono'
make[2]: Nothing to be done for `install-exec-am'.
test -z "/usr/share/gapi-2.0" || /bin/mkdir -p "/var/tmp/portage/dev-libs/gmime-2.2.23/image//usr/share/gapi-2.0"
test -z "/usr/lib/pkgconfig" || /bin/mkdir -p "/var/tmp/portage/dev-libs/gmime-2.2.23/image//usr/lib/pkgconfig"
/usr/bin/gacutil /i gmime-sharp.dll /f /package gmime-sharp /root /var/tmp/portage/dev-libs/gmime-2.2.23/image//usr/lib
 /usr/bin/install -c -m 644 'gmime-sharp.pc' '/var/tmp/portage/dev-libs/gmime-2.2.23/image//usr/lib/pkgconfig/gmime-sharp.pc'
 /usr/bin/install -c -m 644 'gmime-api.xml' '/var/tmp/portage/dev-libs/gmime-2.2.23/image//usr/share/gapi-2.0/gmime-api.xml'
Failure adding assembly gmime-sharp.dll to the cache: Strong name cannot be verified for delay-signed assembly
make[2]: *** [install-data-local] Error 1
make[2]: Leaving directory `/var/tmp/portage/dev-libs/gmime-2.2.23/work/gmime-2.2.23/mono'
make[1]: *** [install-am] Error 2
make[1]: Leaving directory `/var/tmp/portage/dev-libs/gmime-2.2.23/work/gmime-2.2.23/mono'

To make it much more readable, the command and the error line in the output are the following:

/usr/bin/gacutil /i gmime-sharp.dll /f /package gmime-sharp /root /var/tmp/portage/dev-libs/gmime-2.2.23/image//usr/lib
Failure adding assembly gmime-sharp.dll to the cache: Strong name cannot be verified for delay-signed assembly

So the problem comes from the gacutil program that in turns comes from mono, which seems to be working on the just-installed file. But was it installed? If you check the complte log above, there is no install(1) call for the gmime-sharp.dll file that gacutil complains about, and indeed that is the problem. Just like I experienced earlier, Mono-related error messages needs to be interpreted to be meaningful. In this case, the actual error should be a “File not found” over /var/tmp/portage/dev-libs/gmime-2.2.23/image//usr/lib/mono/gmime-sharp/gmime-sharp.dll.

The rule that causes this is, as make reports, install-data-local, so let’s check that in the mono/Makefile.am file:

install-data-local:
        @if test -n '$(TARGET)'; then                                                                   
          if test -n '$(DESTDIR)'; then                                                         
            echo "$(GACUTIL) /i $(ASSEMBLY) /f /package $(PACKAGE_SHARP) /root $(DESTDIR)$(prefix)/lib";                
            $(GACUTIL) /i $(ASSEMBLY) /f /package $(PACKAGE_SHARP) /root $(DESTDIR)$(prefix)/lib || exit 1;     
          else                                                                                          
            echo "$(GACUTIL) /i $(ASSEMBLY) /f /package $(PACKAGE_SHARP) /gacdir $(prefix)/lib";                        
            $(GACUTIL) /i $(ASSEMBLY) /f /package $(PACKAGE_SHARP) /gacdir $(prefix)/lib || exit 1;             
          fi;                                                                                           
        fi

So it’s some special code that is executed to register the Mono/.NET assembly with the rest of the code, it does not look broken at a first glance, and indeed this is a very subtle build failure, because it does not look wrong at all, unless you know automake enough already. Although the build logs helps you a lot to find this out.

The gmime-sharp.dll file is created as part of the DATA class of files in automake, but install-data-local is not depending on them directly, its execution order is not guaranteed by automake at all. On the other hand, the install-data-hook rule is called after install-data completed, and after DATA is actually built. So the solution is simply to replace -local with -hook. And there you go.

Next…

For A Parallel World: Programming Pointers n.1: OpenMP

A new sub-series of For A Parallel World, to give some development pointers on how to make software make better use of parallel computing resources available with modern multicore systems. Hope some people can find it interesting!

While I was looking for a (yet unfound) scripting language that could allow me to easily write conversion scripts with FFmpeg that execute in parallel, I’ve decided to finally take a look to OpenMP, which is something I wanted to do for quite a while. Thanks to some Intel documentation I tried it out on the old benchmark I used to compare glibc, glib and libavcodec/FFmpeg for what concerned byte swapping.

The code of the byte swapping routine was adapted to this (note the fact that the index is ssize_t, which means there is opening for breakage when len is bigger than SSIZE_T_MAX):

void test_bswap_file(const uint16_t *input, uint16_t *output, size_t len) {
  ssize_t i;

#ifdef _OPENMP
#pragma omp parallel for
#endif
  for (i = 0; i < len/2; i++) {
    output[i] = GUINT16_FROM_BE(input[i]);
  }
}

and I compared the execution time, through the @rdtsc@ instruction, between the standard code and the OpenMP-powered one, with GCC 4.3. The result actually surprised me. I expected I would have needed to make this much more complex, like breaking it into multiple chunks, to be able to make it fast. Instead it actually worked nice running one cycle on each:

A bar graph showing the difference in speed between standard serial bswap and a simple OpenMP-powered variant

The results are the average of 100 runs for each size, the file is on disk accessed through VFS, it’s not urandom; the values are not important since they are the output of rdtsc() so they only work for comparison.

Now of course this is a very lame way to use OpenMP, there are much better more complex ways to make use of it but all in all, I see quite a few interesting patterns arising. One very interesting note is that I can make use of this in quite a few of the audio output conversion functions in xine-lib, if I was able to resume dedicating time to work on that. Unfortunately lately I’ve been quite limited in time, especially since I don’t have a stable job.

But anyway, it’s interesting to know; if you’re a developer interested in making your code faster on modern multicore systems, consider taking some time to play with OpenMP. With some luck I’m going to write more about it in the future on this blog.

For A Parallel World. Case Study n.5: parallel install

After a longish time, here for you a new chapter of my widely read series For A Parallel World, improving buildsystems to reduce build time on modern multiprocessor, multicore systems.

This time, rather than the usual build failures, I’m going to speak of a parallel install failure. Even though one can think of install as a task that rarely can fall into problems like race conditions and the such, and even though it’s probably the part that gets less boost when using parallel make on a multicore system (since it’s usually I/O bound rather than CPU bound), it’s actually one very fragile part of many packages.

One of the common failures is due to old install-sh script used to simulate the install command on systems too old to have a POSIX-compatible one, and which is also used to create directories recursively if mkdir -p is missing. For a series of reason, this hits pretty often on FreeBSD, but this is beside the point. This can be easily solved by replacing the old faulty script with an updated copy out of automake or libtool, which does not have problems at all.

A few times, the problem is instead due to a broken Makefile.am. Let’s take a practical example from some software I fixed recently after being called in action by nixnut: gramps . Please note that if you look at the bug now you’re going to spoil the post, since it contains the solution straight away, while I’m going to explain it step by step.

Let’s start from the reported build log:

test -z "/usr/share/gramps/docgen" || /bin/mkdir -p
"/var/tmp/portage/app-misc/gramps-3.0.3/image//usr/share/gramps/docgen"
 /usr/bin/install -c -m 644 'gtkprintpreview.glade'
'/var/tmp/portage/app-misc/gramps-3.0.3/image//usr/share/gramps/docgen/gtkprintpreview.glade'
 /usr/bin/install -c -m 644 'gtkprintpreview.glade'
'/var/tmp/portage/app-misc/gramps-3.0.3/image//usr/share/gramps/docgen/gtkprintpreview.glade'
/usr/bin/install: cannot create regular file
`/var/tmp/portage/app-misc/gramps-3.0.3/image//usr/share/gramps/docgen/gtkprintpreview.glade':
File exists
make[3]: *** [install-docgenDATA] Error 1
make[3]: *** Waiting for unfinished jobs....

As usual, the first thing we’re looking for when there is a parallel build (or install) failure are repeated commands. As I’ve shown in Case Study n. 2, when the same command is repeated multiple times it’s often due to mistakes in the Makefiles, thus before thinking of a problem with the dependencies, I check for that. It’s way more common especially on automake-based build systems.

So indeed we can see there are two calls to the install command for the file gtkprintpreview.glade (this also shows us that it’s not a problem of old and faulty install-sh script since the call is directly to the system command). Contrary to what happens when it’s a build rule that is wrongly expressed in the makefile, the double-call during install phase is usually present both using parallel jobs and not. The difference is that when the two calls happen sequentially, the second just overwrites the results of the first; wastes time but it’s successful. On the other hand when parallel jobs are used, the two calls are often enough happening at the same time, and thus we have a race condition.

Okay so next step as usual is to look at the incriminated Makefile.am:

[snip]
docgen_DATA = 
        gtkprintpreview.glade

dist_docgen_DATA = $(docgen_DATA)
[snip]

Here we’re at the core of the problem. The gtkprintpreview.glade file is part of the sources, and it has to be installed as part of the docgen class of files (thus in $docgendir). But the data installed in that path is listed twice, once in the docgen_DATA variable and one in dist_docgen_DATA, causing the file to be installed twice on two independent targets. Since the two targets are independent, when using parallel jobs they both will run at the same time the same command.

Let me try to explain what the mistake has been. By default the sources are packaged up in the final tarball, if they are not generated by rules from the make process; sometimes you wish files that are built by make to still be distributed, and thus you either have to use EXTRA_DIST or prefix dist_ to the class of the installed files, to explicit that the files have to be distributed. Unfortunately the gramps developers didn’t know automake well enough, and thought that dist_docgen_DATA worked quite a lot like EXTRA_DIST (maybe it actually used EXTRA_DIST in the past, for what I know), and thus duplicated the variable content.

The solution? Just replace the use of docgen_DATA with dist_docgen_DATA and remove the second definition, the problem is solved at the source.