Buildsystem quirks: now you know why you don’t rely on uname

It is difficult to be angry at Linus for anything, on the count of how much good he did, and is doing to us with the whole idea of having Linux – it’s not like waiting for a GNU kernel would have helped – but honestly I feel quite bothered that he ended up making the decision of bumping the kernel’s version number like it was just a matter of public relations, without the need to consider the technical side of it, of software relying on the version numbers being, you know, meaningful. Which, honestly, reminded me of something but let’s not get that in front of us.

Let’s ignore the fact that kernel modules started to fail — they would probably be failing because of API changes without the need of anything as sophisticated as a version bump to 3. And let me be clear on one thing at least: it’s not the build failures that upset me – as I said last year I prefer it when packages fail at build time, rather than, subtly, at runtime.

At any rate, let’s begin with the first reason why you should not rely on uname results: cross-compilation. Like it or not, cross-compilation is still a major feature of a good build system. Actually, my main sources of revenue in the past five years have involved at least some kind of cross-compilation, and not just for embedded systems, which is why I stress out so often the importance of having a good cross-compiled build system.

So what happens with build systems and Linux 3? Well, let’s just say that if you try to distinguish between Linux 2.4 and 2.6, you should not check for major == 2 && minor >= 4 or something along those lines. A variation of this is what happens with ISC DHCP as it didn’t consider any major version beside 2.

Now in the case of DHCP, it’s a simple failure, because the build system is refusing to build at all when not understanding the uname results, but there are a number of other situations where this is not as clear, because the test is to enable features, or backends, or special support for Linux version 2.6, and hitting an unknown version leads to generic (or even worse, dummy!) code to be built. These situations are almost impossible to identify without actually using the software itself; even the tinderbox testing is useless most of the time with these situations, as the tests are probably also throttled down not to hit the Linux-specific codepaths.

And don’t worry, there are enough build systems that are designed so bad that this is not a random, virtual risk. You could take the build system for upower (and devicekit-power) before today: it decided which of its few backends to enable by checking for the presence of some header files on the system used for the build – which by itself is hindering cross-compilation – and if it had found no combination of files, it built in the dummy backend. For the curious, I’ve sent today a patch – that Richard Hughes applied right away, thanks Richard! – for upower to choose which backend to build based on the $host value that is handed over by autoconf, which finally makes it cross-compilable without passing extra parameters to ./configure (even though the override is still available of course).

How long will it take for all the bugs to be sorted out? I’m afraid the answer is impossible for me to give you. We might end up finding more bugs an year from now the same way we might not find any in the next six months.. unfortunately not all projects update at the same pace. I have an example of that in front of my eyes: my laptop’s HSDPA modem, that includes a GPS module, is well supported by the vanilla kernel, for what concerns the network connection… but at the same time, the GPS support is still lacking. While there is a project to support these cards the userland GPS driver still relies on HAL, rather than the new udev, which makes it quite useless for what I’m concerned.

So anyway, next time you write a build system, do not even consider uname … and if your build system rely on that, please fix it — and if you write an ebuild that relies on aleatory data such as uname results or presence of given files (which differs from given headers, be warned!), make sure that there is an override and .. use it.

For A Parallel World: ebuild writing tip: faster optional builds

Today lurking on #gentoo-hardened I came to look at an ebuild written particularly badly, that exasperated one very bad construct for what concerns parallel builds (which are a very good thing with modern multi-core multi-thread CPUs):

src_compile() {
  if use foo; then
     emake foo || die
  fi

  if use bar; then
    emake bar || die
  fi

  if use doc; then
    emake doc || die
  fi
}

This situation wastes a lot of processing power: the three targets with all their dependencies will be taken into consideration serially, not in parallel; if you requested 12 jobs, but each of foo and bar only have three object files as dependencies, they should have been built at the same time, not in two different invocations.

I admit I made this mistake before, and even so recently, mostly related to documentation building, so how does one solve this problem? Well there are many options, my favourite being something along these lines:

src_compile() {
  emake 
    $(use foo && echo foo) 
    $(use bar && echo bar) 
    $(use doc && echo doc) 
    || die "emake failed"
}

Of course this has one problem in the fact that I don’t have a general target so it should rather be something more like this:

src_compile() {
  local maketargets=""

  if use bar ; then
    maketargets="${maketargets} bar"
  else
    maketargets="${maketargets} foo"
  fi

  emake ${maketargets} 
    $(use doc && echo doc) 
    || die "emake failed"
}

This will make sure that all the targets will be considered at once, and will leave make to take care of dependency resolution.

I tried this approach out in the latest revision of the Drizzle ebuild that I proxy-maintain for Pavel; the result is quite impressive because doxygen, instead of taking its dear time after the build completed, runs for about half of the build process (using up only one slot of the twelve jobs I allocate for builds on Yamato).

Obviously, this won’t make any difference if the package is broken with respect to parallel build (using emake -j1) and won’t make a difference when you’re not building in parallel, but why not doing it right, while we’re at it?

Tell-tale signs that your Makefile is broken

Last week I sent out a broad last-rite email for a number of gkrellm plugins that my tinderbox reported warnings about that shows that they have been broken for a long time. This has been particularly critical because the current maintainer of all the gkrellm packages, Jim (lack), seems not to be very active on them.

The plugins I scheduled for removal are mostly showing warnings related to the gdk_string_width() function called with a completely different object than it should have been called with, which will result in unpredictable behaviour at runtime (most likely, it’ll crash). A few more were actually buffer overflows, or packages failing because their dependencies changed. If you care about a plugin that is scheduled for removal, you’re suggested to look into it yourself and start proxy-maintain it.

I originally though I was able to catch all of the broken packages; but since then, another one appeared with the same gdk_string_width() error, so I decided running the tinderbox specifically against the gkrellm plugins; there was another one missing and then I actually found all of them. A few more were reported ignoring LDFLAGS, but nothing especially bad turned up on my tinderbox.

What it did show though, is that the ignored LDFLAGS are just a symptom of a deeper problem: most of the plugins have broken Makefile that are very poorly written. This could be seen in a number of small things, but the obvious one is the usual ”job server unavailable” message that I have written about last year.

So here’s a good checklist of things that shows that your Makefile is broken:

  • you call directly the make command — while this works perfectly fine on GNU systems, where you almost always use the GNU make implementation, this is not the case in most BSD systems, and almost always the Makefile is good enough only to work with the GNU implemenation; the solution is to call $(MAKE) which is replaced with the name of the make implementation you’re actually using;
  • it takes you more than one command to run make in a subdirectory (this can also be true for ebuilds, mind you) — things like cd foo && make or even worse (cd foo; make; cd ..; ) are mostly silly to look at and, besides, will cause the usual jobserver unavailable warning; what you might not know here is that make is (unfortunately) designed to allow for recursive build, and provides an option to do so without requiring changing the working directory beforehand: make -C foo (which actually should be, taking the previous point into consideration, $(MAKE) -C foo) does just that, and only changes the working directory for the make process and its children rather than for the current process as well;
  • it doesn’t use the builtin rules — why keep writing the same rules to build object files? make already knows how to compile .c files into relocatable objects; instead of writing your rules to inject parameters, just use the CFLAGS variable like make is designed to do! Bonus points if, for final executables, you also use the built-in linking rule (for shared objects I don’t think there is one);
  • it doesn’t use the “standard” variable names — for years I have seen projects written in C++ insisting on using CPP and CPPFLAGS variables, well that’s wrong, as here “cpp” refers to the C Pre-Processor; the correct variables are CXX and CXXFLAGS; inventing your own variable names to express parameters that can be passed by the user tends to be a vary bad choice, as you break the expectations of the developers and packagers using your software.

Now, taking this into consideration, can you please clean up your packages? Pretty please with sugar on top?

Problems and mitigation strategies for –as-needed

Doug and Luca tonight asked me to comment about the --as-needed by default bug. As you can read there, my assessment is that the time is ready to bring on a new stage of --as-needed testing through use of forced --as-needed on specs files. If you wish to help testing with this (and I can tell you I’m working on testing this massively), you can do it by creating your own asneeded specs.

Note, Warning, End of the World Caution!

This is not a procedure for the generic users, this is for power users, who don’t mind screwing up their system even beyond repair! But it would help me (and the rest of Gentoo) testing.

First of all you have to create your own specs file, so issue this command:

# export SPECSFILE=$(dirname "$(gcc -print-libgcc-file-name)")/asneeded.specs
# export CURRPROFILE=/etc/env.d/gcc/$(gcc-config -c)
# gcc -dumpspecs | sed -e '/link:/,+1 s:--eh-frame-hdr: --as-needed:' > "$SPECSFILE"
# sed "${CURRPROFILE}" -e '1iGCC_SPECS='$SPECSFILE > "${CURRPROFILE}-asneeded"
# gcc-config "$(basename "${CURRPROFILE}")-asneeded"
# source /etc/profile

This will create the specs file we’re going to use, it just adds —as-needed for any linking command that is not static, then it would create a new configuration file for gcc-config pointing to that file. After this, —as-needed will be forced on. Now go rebuild your system and file bugs about the problems.

A package known to break with this is xmlrpc-c, which in turn makes cmake fail; as some people lack common sense (like adding a way to not have xmlrpc-c as a dependency because you might not want your cmake to ever submit the results of tests), this can get nasty for KDE users. But maybe, just maybe, someone can look into fixing the package at this point.

But xmlrpc-c does require some reflection on how to handle --as-needed in these cases: the problem is that the error is in one package (xmlrpc-c) and the failure in another (cmake) which makes it difficult to asses whether --as-needed break something; you might have a broken package, but nothing depending on it, and never notice. And a maintainer might not notice that his package is broken because other maintainers will get the bugs first (until they get redirected properly). Indeed it’s sub-optimal.

Interestingly enough, Mandriva actually started working on trying to resolve this problem radically, they inject -Wl,--no-undefined in their build procedures so that if a library is lacking symbols, the build dies sooner rather than later. This is fine up to a certain point, because there are times when a library does have undefined symbols, for instance if it has a recursive dependency over another library (which is the case of PulseAudio’s libpulse and libpulsecore, which I discussed with Lennart some time ago). Of course you can work this around by adding a further -Wl,--undefined that then tells ld to discard the former, but it requires more work, and more coordination with upstream.

Indeed, coordination with upstream is a crucial point here, since having to maintain --as-needed fixes in Gentoo is going to be cumbersome in the future, and even more if we start to follow Mandriva’s steps (thankfully, Mandriva is submitting the issues upstream so that they get fixed). But I admit I also haven’t been entirely straight on that; I pushed just today a series of patches to ALSA packages, one of which, to alsa-tools, was for --as-needed support (the copy we have in Portage also just works around the bug rather than fixing it). Maybe we need people that starts checking the tree for patches that haven’t been pushed upstream and tries to push them (with proper credits of course).

Another thing that we have to consider is that many times we have upstream that provide broken Makefiles, and while sending the fix upstream is still possible, fixing it in the ebuild takes more time than it is worth; this is why I want to refine and submit for approval my simple build eclass idea, that at least works as a mitigation strategy.

About buildsystems and upstreams

Donnie correctly commented that my earlier proposal is not really a solution that can be proposed upstream. It’s true, it isn’t really upstreamable at all. But I don’t think that’s a problem on its own.

The problem here is that most of these issues are most likely to be present in software that is not being actively developed, for which getting to upstream is nearly impossible. Other parts of it are caused by software that simply doesn’t have a build system at all, and just ships the .c files (like the piechart tool I found the other day), which also is probably unfixable since upstream decided not to provide a build system in the first place.

But, even if we were to decide to actually go this route for the packages,it doesn’t stop us from trying to contact upstream and propose them to get their build system fixed by either using autotools for most complex stuff or providing a simple sample Makefile that works by our standard for the small stuff. But patching the Makefile in distribution, it’s likely a waste of time.

Unfortunately, as I wrote recently even the best coder can write a stupid build system, which is unfortunately very true. And some of them, like Ragel author demonstrated lately, refuse to use automake at all, even when they fail to provide the correct basic functionality needed for a distribution to package his software. The reasoning still baffles me by the way.

So yeah I think this is a point that really needs to be faced with open mind, and a knife between your teeth to use against the most clueless of upstreams!

Fixing CFLAGS/LDFLAGS handling with a single boilerplate Makefile (maybe an eclass, too?)

So, in the last few weeks I’ve been filing bugs for packages that don’t respect CFLAGS (or CXXFLAGS) using the beacon trick. Beside causing some possibly false positives, the testing is going well.

The problem is that I found more than a couple of packages that either do call gcc manually (I admit I’m the author of a couple of ebuilds doing that) or where the patch to fix the Makefile would be more complex than just using a boilerplate makefile.

So what is the boilerplate makefile I talk about? Something like this:

$(TARGET): $(OBJS)
        $(CC) $(LDFLAGS) -o $@ $^ $(LIBS)

Does it work? Yes it does, and it will respect CFLAGS, CXXFLAGS and LDFLAGS just fine, the invocation on an ebuild (taking one I modified earlier today) would be as easy as:

src_compile() {
    emake CC="$(tc-getCC)" 
        TARGET="xsimpsons" 
        OBJS="xsimpsons.o toon.o" 
        LIBS="-lX11 -lXext -lXpm" || die "emake failed"
}

Now of course this would suck if you had to do it for every and each ebuild, but what if we were to simplify it to an eclass? Something like having an ebuild just invoking it this way:

ESIMPLE_TARGET="xsimpsons"
ESIMPLE_OBJS="xsimpsons.o toon.o"
ESIMPLE_LIBS="-lX11 -lXext -lXpm"

inherit esimple

For slightly more complicated things you could make it use PKG_CONFIG too…

ESIMPLE_TARGET="xsimpsons"
ESIMPLE_OBJS="xsimpsons.o toon.o"
ESIMPLE_REQUIRED="x11 xext xpm"

inherit esimple

so that it would call pkg-config for those rather than using the libraries directly (this would allow to simplify also picoxine’s ebuild for instance that uses xine-lib).

Even better (or maybe I’m getting over the top here ;)), one could make the eclass accept a possibly static USE flag that would call pkg-config --static instead of standard pkg-config and append -static to the LDFLAGS, so that the resulting binary would be, well, static…

If anybody has comments about this, to flesh it out before it could actually be proposed for an eclass, it would be a nice time to say here so we can start with the right foot!