After a longish time, here for you a new chapter of my widely read series For A Parallel World, improving buildsystems to reduce build time on modern multiprocessor, multicore systems.
This time, rather than the usual build failures, I’m going to speak of a parallel install failure. Even though one can think of install as a task that rarely can fall into problems like race conditions and the such, and even though it’s probably the part that gets less boost when using parallel make on a multicore system (since it’s usually I/O bound rather than CPU bound), it’s actually one very fragile part of many packages.
One of the common failures is due to old
install-sh script used to simulate the
install command on systems too old to have a POSIX-compatible one, and which is also used to create directories recursively if
mkdir -p is missing. For a series of reason, this hits pretty often on FreeBSD, but this is beside the point. This can be easily solved by replacing the old faulty script with an updated copy out of automake or libtool, which does not have problems at all.
A few times, the problem is instead due to a broken
Makefile.am. Let’s take a practical example from some software I fixed recently after being called in action by nixnut: gramps . Please note that if you look at the bug now you’re going to spoil the post, since it contains the solution straight away, while I’m going to explain it step by step.
Let’s start from the reported build log:
test -z "/usr/share/gramps/docgen" || /bin/mkdir -p "/var/tmp/portage/app-misc/gramps-3.0.3/image//usr/share/gramps/docgen" /usr/bin/install -c -m 644 'gtkprintpreview.glade' '/var/tmp/portage/app-misc/gramps-3.0.3/image//usr/share/gramps/docgen/gtkprintpreview.glade' /usr/bin/install -c -m 644 'gtkprintpreview.glade' '/var/tmp/portage/app-misc/gramps-3.0.3/image//usr/share/gramps/docgen/gtkprintpreview.glade' /usr/bin/install: cannot create regular file `/var/tmp/portage/app-misc/gramps-3.0.3/image//usr/share/gramps/docgen/gtkprintpreview.glade': File exists make: *** [install-docgenDATA] Error 1 make: *** Waiting for unfinished jobs....
As usual, the first thing we’re looking for when there is a parallel build (or install) failure are repeated commands. As I’ve shown in Case Study n. 2, when the same command is repeated multiple times it’s often due to mistakes in the Makefiles, thus before thinking of a problem with the dependencies, I check for that. It’s way more common especially on automake-based build systems.
So indeed we can see there are two calls to the
install command for the file
gtkprintpreview.glade (this also shows us that it’s not a problem of old and faulty install-sh script since the call is directly to the system command). Contrary to what happens when it’s a build rule that is wrongly expressed in the makefile, the double-call during install phase is usually present both using parallel jobs and not. The difference is that when the two calls happen sequentially, the second just overwrites the results of the first; wastes time but it’s successful. On the other hand when parallel jobs are used, the two calls are often enough happening at the same time, and thus we have a race condition.
Okay so next step as usual is to look at the incriminated Makefile.am:
[snip] docgen_DATA = gtkprintpreview.glade dist_docgen_DATA = $(docgen_DATA) [snip]
Here we’re at the core of the problem. The
gtkprintpreview.glade file is part of the sources, and it has to be installed as part of the
docgen class of files (thus in
$docgendir). But the data installed in that path is listed twice, once in the
docgen_DATA variable and one in
dist_docgen_DATA, causing the file to be installed twice on two independent targets. Since the two targets are independent, when using parallel jobs they both will run at the same time the same command.
Let me try to explain what the mistake has been. By default the sources are packaged up in the final tarball, if they are not generated by rules from the make process; sometimes you wish files that are built by make to still be distributed, and thus you either have to use
EXTRA_DIST or prefix
dist_ to the class of the installed files, to explicit that the files have to be distributed. Unfortunately the gramps developers didn’t know automake well enough, and thought that
dist_docgen_DATA worked quite a lot like
EXTRA_DIST (maybe it actually used
EXTRA_DIST in the past, for what I know), and thus duplicated the variable content.
The solution? Just replace the use of
dist_docgen_DATA and remove the second definition, the problem is solved at the source.