GLIBC 2.17: what’s going to be a trouble?

So LWN reports just today on the release of GLIBC 2.17 which solves a security issue and looks like was released mostly to support the new AArch64 architecture – i.e. arm64 – but the last entry in the reported news is possibly going to be a major headache and I’d better post already about it so that we have a reference for it.

I’m referring to this:

The `clock_*' suite of functions (declared in <time.h>) is now available directly in the main C library. Previously it was necessary to link with -lrt to use these functions. This change has the effect that a single-threaded program that uses a function such as `clock_gettime' (and is not linked with -lrt) will no longer implicitly load the pthreads library at runtime and so will not suffer the overheads associated with multi-thread support in other code such as the C++ runtime library.

This is in my opinion the most important change, not only because, as it’s pointed out, C++ software would have quite an improvement not to link to the pthreads library, but also because it’s the only change listed there that I can foresee trouble with already. And why is that? Well, that’s easy. Most of the software out there will do something along these lines to see what library to link to when using clock_gettime (the -lrt option was not always a good idea because it’s not existing for most other operating systems out there, including FreeBSD and Mac OS X).

AC_SEARCH_LIB([clock_gettime], [rt])

This is good, because it’ll try either librt, or just without any library at all (“none required”) which means that it’ll work on both old GLIBC systems, new GLIBC systems, FreeBSD, and OS X — there is something else on Solaris if I’m not mistaken, which can be added up there, but I honestly forgot its name. Unfortunately, this can easily end up with more trouble when software is underlinked.

With the old GLIBC, it was possible to link software with just librt and have them use the threading functions. Once librt will be dropped automatically by the configuration, threading libraries will no longer be brought in by it, and it might break quite a few packages. Of course, most of these would already have been failing with gold but as you remembered, I wasn’t able to get to the whole tree with it, and I haven’t set up a tinderbox for it again yet (I should, but it’s trouble enough with two!).

What about --as-needed in this picture? A full hard-on implementation would fail on the underlinking, where pthreads should have been linked explicitly, but would also make sure to not link librt when it’s not needed, which would make it possible to improve the performance of the code (by skipping over pthreads) even when the configure scripts are not written properly (like for instance if they are using AC_CHECK_LIB instead of AC_SEARCH_LIB). But since it’s not the linkage of librt that causes the performance issue, but rather the one for pthreads, it actually works out quite well, even if some packages might keep an extra linkage to librt which is not used.

There is a final note that I need o write about and honestly worries me quite a bit more than all those above. The librt library has not been dropped — only the clock functions have been moved over to the main C library, but the library keeps asynchronous and list-based I/O operation interfaces (AIO and LIO), the POSIX message queues interfaces, the shared memory interfaces, and the timer interfaces. This means that if you’re relying on a clock_gettime test to bring in librt, you’ll end up with a failing package. Luckily for me, I’ve avoided that situation already on feng (which uses the message queues interface) but as I said I foresee trouble at least for some packages.

Well, I guess I’ll just have to wait for the ebuild for 2.17 to be in the tree, and run a new tinderbox from scratch… we’ll see what gets us there!

Are we done with LDFLAGS?

Quite some weeks ago, Markos stated asking for a more thorough testing of ebuilds to check whether they support LDFLAGS; he then described why that is important which can be summarised in “if LDFLAGS are ignored, setting --as-needed by default is pointless”. He’s right on this of course, but there are a few more tricks to consider.

You might remember that I described an alternative approach to --as-needed that I call “forced --as-needed”, by editing the GCC specifications. The idea is that by forcing the flag in through this method, packages ignoring LDFLAGS and packages using unpatched libtool won’t simply ignore it. This was one of the things I had in mind when I suggested that approach, but alas, as it happens, I wasn’t listened to.

Right now there are 768 bugs reported blocking the tracker of which 404 are open and, in the total, 635 were reported by me (mostly through the tinderbox but not limited to). And I’m still opening bugs on a daily basis for those.

But let’s say that these are all the problems there are, and that in two weeks time no package is reporting ignored LDFLAGS at all. Will that mean that all packages will properly build with --as-needed at that point? Not really.

Unfortunately, LDFLAGS-ignored warnings have both false positives (prebuild binary packages, packages linking with non-GNU linkers) and false partially-negatives. To understand that you have to understand how the current warning works. A few binutils versions ago, a new option was introduced, called --hash-style; this option is used to change the way the symbol table is hashed for faster loading by the runtime linker/loader (ld.so); the classic hashing algorithm (SysV) is not excessively good for very long, similar symbols that are way too common when using C++, so there has been some work to replace that with a better GNU-specific algorithm. I’ve followed most of the related development closely at the time, since Michael Meeks (of OpenOffice.org fame) actually came asking Gentoo for some help testing things out; it was that work that actually got me interested in linkers and the ELF format.

At any rate, while the original hash table was created in the .hash section, the new hash table, being incompatible, is in .gnu.hash. The original patch only replaced one with the other, but what actually landed in binutils was slightly different (it allows to choose between SysV, GNU, or both hash tables), and the default has been to enable both hash tables, so that older systems can load .hash and the new ones can load .gnu.hash; win win. On the other hand, on Gentoo (and many other distributions) where excessively old GLIBC versions are not supported at all, there is enough of a reason to use --hash-style=gnu to disable the generation of the SysV table entirely.

Now, the Portage warning is derived by this situation: if you add -Wl,--hash-style=gnu to your LDFLAGS, it will be checking the generated ELF files and warn if it finds the SysV .hash section. Unfortunately this does not work for non-Linux profiles (as far as I know, FreeBSD does not support the GNU hashing style yet; uClibc does), and will obviously report all the prebuilt binaries coming from proprietary packages. In those cases, you don’t want to strip .hash because it might be the only hash table preventing ld.so from doing a linear search.

So, what is the problem with this test? Well, let’s note one big difference between --as-needed and --hash-style flags: the former is positional, the second is not. That means that --as-needed, to work as intended, needs to be added before the actual libraries are listed, while --hash-style can come at any point in the command line. Unfortunately this means that if any package has a linking line such as

$(CC) $(LIBS) $(OBJECTS) $(LDFLAGS)

It won’t be hitting the LDFLAGS warning from Portage, but (basic) --as-needed would fail — OTOH, my forced --as-needed method would work just fine. So there is definitely enough work to do here for the next… years?

Your worst enemy: undefined symbols

What ties in reckless glibc unmasking GTK+ 2.20 issues Ruby 1.9 porting and --as-needed failures all together? Okay the title is a dead giveaway for the answer: undefined symbols.

Before deepening within the topic I first have to tell you about symbols I guess; and to do so, and to continue further, I’ll be using C as the base language for everyone of my notes. When considering C, then, a symbol is any function or data (constant or variable) that is declared extern; that is anything that is neither static or defined in the same translation unit (that is, source file, most of the time).

Now, what nm shows as undefined (U code) is not really what we’re concerned about; for object files (.o, just intermediate) will report undefined symbols for any function or data element used that is not in the same translation unit; most of those get resolved at the time all the object files get linked in to form a final shared object or executable — actually, it’s a lot more complex than this, but since I don’t care about describing here symbolic resolution, please accept it like it was true.

The remaining symbols will be keeping the U code in the shared object or executable, but most of them won’t concern us: they will be loaded from the linked libraries, when the dynamic loader actually resolve them. So for instance, the executable built from the following source code, will have the printf symbol “undefined” (for nm), but it’ll be resolved by the dynamic linker just fine:

int main() {
  printf("Hello, world!");
}

I have explicitly avoided using the fprintf function, mostly because that would require a further undefined symbol, so…

Why do I say that undefined symbols are our worst enemy? Well, the problem is actually with undefined, unresolved symbols after the loader had its way. These are either symbols for functions and data that is not really defined, or is defined in libraries that are not linked in. The former case is what you get with most of the new-version compatibility problems (glibc, gtk, ruby); the latter is what you get with --as-needed.

Now, if you have a bit of practice with development and writing simple commands, you’d be now wondering why is this a kind of problem; if you were to mistype the function above into priltf – a symbol that does not exist, at least in the basic C library – the compiler will refuse to create an executable at all, even if the implicit declaration was only treated as a warning, because the symbol is, well, not defined. But this rule only applies, by default, to final executables, not to shared objects (shared libraries, dynamic libraries, .so, .dll or .dylib files).

For shared objects, you have to explicitly ask to refuse linking them with undefined reference, otherwise they are linked just fine, with no warning, no error, no bothering at all. The way you can tell the linker to refuse that kind of linkage is passing the -Wl,--no-undefined flag; this way if there is even a single symbol that is not defined in the current library or any of its dependencies the linker will refuse to complete the link. Unfortunately, using this by default is not going to work that well.

There are indeed some more or less good reasons to allow shared objects to have undefined symbols, and here come a few:

Multiple ABI-compatible libraries: okay this is a very far-fetched one, simply for the difficulty to have ABI-compatible libraries (it’s difficult enough to have them API-compatible!), but it happens; for instance on FreeBSD you – at least used to – have a few different implementations of the threading libraries, and have more or less the same situation for multiple OpenGL and mathematical libraries; the idea behind this is actually quite simply; if you have libA1 and libA2 providing the symbols, then libB linking to libA1, and libC linking to libA2, an executable foo linking to libB and libC would get both libraries linked together, and creating nasty symbol collisions.

Nowadays, FreeBSD handles this through a libmap.conf file that allows to link always the same library, but then switch at load-time with a different one; a similar approach is taken by things like libgssglue that allows to switch the GSSAPI implementation (which might be either of Kerberos or SPKM) with a configuration file. On Linux, beside this custom implementation, or hacks such as that used by Gentoo (eselect opengl) to handle the switch between different OpenGL implementations, there seem to be no interest in tackling the problem at the root. Indeed, I complained about that when --as-needed was softened to allow this situation although I guess it at least removed one common complain about adopting the option by default.

Plug-ins hosted by a standard executable: plug-ins are, generally speaking, shared objects; and with the exception of the most trivial plugins, whose output is only defined in terms of their input, they use functions that are provided by the software they plug. When they are hosted (loaded and used from) by a library, such as libxine, they are linked back to the library itself, and that makes sure that the symbols are known at the time of creating the plugin object. On the other hand, when the plug-ins are hosted by some software that is not a shared object (which is the case of, say, zsh), then you have no way to link them back, and the linker has no way to discern between undefined symbols that will be lifted from the host program, and those that are bad, and simply undefined.

Plug-ins providing symbols for other plug-ins : here you have a perfect example in the Ruby-GTK2 bindings; when I first introduced --no-undefined in the Gentoo packaging of Ruby (1.9 initially, nowadays all the three C-based implementations have the same flag passed on), we got reports of non-Portage users of Ruby-GTK2 having build failures. The reason? Since all the GObject-derived interfaces had to share the same tables and lists, the solution they chose was to export an interface, unrelated to the Ruby-extension interface (which is actually composed of a single function, bless them!), that the other extensions use; since you cannot reliably link modules one with the other, they don’t link to them and you get the usual problem of not distinguish between expected and unexpected undefined symbols.

Note: this particular case is not tremendously common; when loading plug-ins with dlopen() the default is to use the RTLD_LOCAL option, which means that the symbols are only available to the branch of libraries loaded together with that library or with explicit calls to dlsym(); this is a good thing because it reduces the chances of symbol collisions, and unexpected linking consequences. On the other hand, Ruby itself seems to go all the way against the common idea of safety: they require RTLD_GLOBAL (register all symbols in the global procedure linking table, so that they are available to be loaded at any point in the whole tree), and also require RTLD_LAZY, which makes it more troublesome if there are missing symbols — I’ll get later to what lazy bindings are.

Finally, the last case I can think of where there is at least some sense into all of this trouble, is reciprocating libraries, such as those in PulseAudio. In this situation, you have two libraries, each using symbols from one another. Since you need the other to fully link the one, but you need the one to link the other, you cannot exit the deadlock with --no-undefined turned on. This, and the executable-plugins-host, are the only two reasons that I find valid for not using --no-undefined by default — but unfortunately are not the only two used.

So, what about that lazy stuff? Well, the dynamic loader has to perform a “binding” of the undefined symbols to their definition; binding can happen in two modes, mainly: immediate (“now”) or lazy, the latter being the default. With lazy bindings, the loader will not try to find the definition to bind to the symbol until it’s actually needed (so until the function is called, or the data is fetched or written to); with immediate bindings, the loader will iterate over all the undefined symbols of an object when it is loaded (eventually loading up the dependencies). As you might guess, if there are undefined, unresolved symbol, the two binding types have very different behaviours. An immediately-loaded executable will fail to start, and a loaded library would fail dlopen(); a lazily-loaded executable will start up fine, and abort as soon as a symbol is hit that cannot be resolved; and a library would simply make its host program abort at the same way. Guess what’s safer?

With all these catches and issues, you can see why undefined symbols are a particularly nasty situation to deal with. To the best of my knowledge, there isn’t a real way to post-mortem an object to make sure that all its symbols are defined. I started writing support for that in Ruby-Elf but the results weren’t really… good. Lacking that, I’m not sure how we can proceed.

It would be possible to simply change the default to be --no-undefined, and work around with --undefined the few that require the undefined symbols to be there (we decided to proceed that way with Ruby); but given the kind of support I’ve received before in my drastic decisions, I don’t expect enough people to help me tackle that anytime soon — and I don’t have the material time to work on that, as you might guess.

Let’s call a spade a spade

Some people thought that my previous blog about the libpng debacle was meant as an attack, or a derision, of the work and effort that Samuli put into getting us out of the libpng-1.2 mess we were. Let me be clear: it wasn’t. I’m glad that Samuli is there, without him I would probably have left Gentoo a long time ago, frustrated by nothing happening.

But I think that we shouldn’t hide our head under the sand and keep repeating “it’s all good, it’s all good”. It isn’t.

Samuli did the best he could to get us out of the trouble, which is much bigger than a single person, two, three or even a dozen could properly tackle with all the possible bases covered, at this point, unless there is consensus among the whole developer body, which isn’t there.

But first, I have to say that one thing I’m going to maintain was done wrong, in the rush of the moment: stabling libpng-1.4 as part of a security fix. That was simply reckless. But the fault does not lie in a single person, but rather in the general spirit of avoiding doing extra work… still, reckless or not, it’s done and we have to live with it, and learn from it.

And learning seems like we are; finally there is enough traction for --as-needed to become default as I wrote recently and I got to thank Samuli and Kacper without whom we wouldn’t be able to reach that point at all. I unfortunately still don’t see the same traction behind the hidea of dropping .la files. Removing them from gtk+ which doesn’t install any static library and thus does not need the .la files at all, would have solved if not all, most of the problems people had with the upgrade…

Oh and by the way, the update script for libpng is a hack and it will leave behind .la files when packages will start dropping them, as it changes their checksum and timestamp without updating the package database. The same is true for the (generic) lafilefixer which is why I’ll recommend again to apply the incremental one, as declared in the two posts linked at the beginning.

Finally, libpng-1.5 is going to be released sometime soon… either we make a plan now, or we’re going to suffer through another identical pain soon. And libpng is known for having security issues quite often… I already sent Samuli the plan I was thinking on this morning; I’ll write more details about that as I find the time.

Stable users’ libpng update

Seems like my previous post didn’t make enough of a fuss to get other developers to work on feasible solutions to avoid the problem to hit stable users… and now we’re back to square one for stable users.

Since I also stumbled across two problems today while updating my stable chroots and containers, that represent the local install of remote vservers, and a couple of testing environments for my work, I guess it’s worth writing of a couple of tricks you might want to know before proceeding.

Supposedly, you should be able to properly complete the update without running the libpng-1.4.x-update.sh hack! (and this is important because that hack will create a number of problems on the longer run, so please try to avoid it!). If you have been using --as-needed for a decent amount of time, the update should come almost painless. Almost.

I maintained that revdep-rebuild should be enough to take care of the update for you, but it comes with a few tricks here that make it slightly more complex. First of all, the libpng-1.4 package will try to “preserve” the old library by copying it inside itself, avoiding dynamic link breakage. This supposedly makes for a better user experience as you won’t hit packages that fail to start up for missing libraries, but has two effects; one is that you may be running a program with both libpng objects around, which is not safe; the second is that revdep-rebuild will not pick up the broken binaries at all, this way.

Additionally, there is a slot of the package that will bring in only the library itself, so that binary-only packages linked to the old libpng can be used still; if you have packages such as Opera installed, you might have this package brought in on your system; this will further complicate matters because it will then collide with libpng-1.4… bad thing.

These are my suggested instructions:

  • get a console login, make sure that GNOME, KDE, any other graphical interface is not running; this is particularly important because you might otherwise experience applications that crash mid-runtime;
  • emerge -C =libpng-1.2* make sure that you don’t have the old library around; this works for both the old complete package and for the new library-only binary compatibility package;
  • rm -f /usr/lib/libpng12.so* (replace lib/ with lib64/ on x86-64 hosts; this way you won’t have the old libraries around at all; actually this should be a no-op since you removed it, but this way you ensure you don’t have them around if you had already updated;
  • emerge -1 =libpng-1.4* installs libpng-1.4 without preserving the libraries above; if you had already updated, please do this anyway, this way you’ll make sure it registers the lack of the preserved libraries;
  • revdep-rebuild -- --keep-going it shouldn’t stop anywhere now, but since it might, it’s still a good idea to let it build as much as it can.

Make also sure you follow my suggestion of running lafilefixer incrementally after every merge, that way you won’t risk too much breakage when the .la files gets dropped (which I hope we’ll start doing systematically soon), by adding this snipped to your /etc/portage/bashrc:

post_src_install() {
    lafilefixer "${D}"
}

Important, if you’re using binary packages! Make sure to re-build =libpng-1.4* after you deleted the file, if you had updated it before; otherwise the package will have preserved the files, and will pack it up in the tbz2 file, reinstalling it every time you merged the binary.

This post brought to you by the guy who has been working the past four years to make sure that this problem is reduced to manageable size, and that has been attacked, defamed, insulted and so on so forth for expecting other developers to spend more time testing their packages. If you find this useful, you might want to consider thanking him somehow…

Enabling –as-needed, whose task is it?

A “fellow” Gentoo developer today told me that we shouldn’t try to get --as-needed working because it’s a task for upstream (he actually used the word “distributors” to intend that neither Gentoo nor any other vendor should do that, silly him)… this developer will go unnamed because I’ll also complain right away that he suggested I give up on Gentoo when I gave one particular reason (that I’ll repeat in a moment) for us to do that instead. Just so you know, if it was up to me, that particular developer right now would have his access revoked. Luckily for him, I have no such powers.

Anyway, let me try to put in proper writing why it should fall to Gentoo to enable that by default to protect our users.

The reason I gave above (it was a twitter exchange so I couldn’t articulate it completely), is that “Fedora does it already”. To be clear, both Fedora and Suse/Novell do it already. I’m quite sure Debian doesn’t do it, and I guess Ubuntu doesn’t do that either to keep in line with Debian. And the original hints we took to start with --as-needed came from AltLinux. This alone means that there is quite a bit of consensus out there that it should be a task for distributors to look at. And it should say a lot that the problems solved by --as-needed are marginal for binary distributions like the ones I named here; they all do that simply to reduce the time needed for their binary packages to rebuild, rather than to avoid breaking users’ systems!

But I’m the first person to say that the phrase “$OtherDistribution does it, why don’t you?” is bogus and actually can cause more headaches than it solves. Although most of the time, this is meant when $Otherdistribution is Ubuntu or a relative of theirs. I seriously think that we should take a few more hints from Fedora; not clone them, but they have very strong system-level developers working on their distributions. But that’s a different point altogether by now.

So what other reasons are there for us to provide --as-needed rather than upstream? Well, actually this is the wrong question; the one you should formulate is “Why is it not upstream’s task to use --as-needed?”. While I have been pushing --as-needed support in a few of my upstream packages before, I think by now that it’s not the correct solution. It all boils down to who knows better whether it’s safe to enable --as-needed or not. There are a few things you should assess before enabling --as-needed:

  • does the linker support --as-needed? it’s easier said than done; the linker might understand the flag, but supporting is it another story; there are older versions of ld still in active use that will crash when using it; other with a too greedy --as-needed that will drop libraries that are needed, and only recently the softer implementation was added; while upstream could check for a particualr ld version, what about backports?
  • do the libraries you link to, link to all their needed dependencies? one of the original problems with --as-needed when introduced to the tree was that you’d have to rebuild one of the dependencies because it relied on transitive linking, which --as-needed disallowed (especially in its original, un-softened form); how can a given package make sure that its dependencies are all fine before enabling --as-needed?
  • do the operating system at all support --as-needed? while Gentoo/FreeBSD uses modern binutils, and (most of) the libraries are built so that all the dependencies are linked in for --as-needed support, using it is simply not possible (or wasn’t possible at least…), because gcc will not allow for linking the threading libraries in for compatibility with pre-libmap.conf FreeBSD versions; this has changed recently for Gentoo since we only support more recent versions of the loader that don’t have that limitation; even more so, how can upstream know whether the compiler will have the fix already or not?

Of course you can “solve” most of the doubts by running runtime tests; but is that what upstreams should do? Running multiple tests from multiple packages require sharing the knowledge and risks for the tests to get out-of-sync one with the other; you have redundancy of work.. when instead the distributor can simply decide on whether using --as-needed is part of their priorities or not. It definitely is for Fedora, Suse, AltLinux… it should be for Gentoo as well, especially as a source-based distribution!

Of course, you can find some case-by-case where --as-needed will not work properly; PulseAudio is unfortunately one of those, and I haven’t had the time to debug binutils to see why the softer rules don’t work well in that case. But after years working on this idea, I’m very sure that it’s a very low percentage of stuff that fails to work properly with this, and we should not be taken hostage by a handful of packages out of over ten thousands!

But, when you don’t care about the users’ experience, when you’re just lazy or you think your packages are “special” and deserve to break any possible rule, you can afford yourself to ignore --as-needed. Is that the kind of developers Gentoo should have, though? I don’t think so!

Depend on what you use

New Note 20

To this day, we still get --as-needed failures for packages in Gentoo; both for new packages and bumps. To this day, checking the list of reverse dependencies of libpng is not enough to ensure that all the packages build fine with libpng-1.4 (as Samuli found out the hard way). One common problem in both is represented by missing dependencies, which in a big part are caused by transitional transitive dependencies.

Transitional Transitive dependencies are those caused by indirect linking; since I don’t want to bore you all repeating myself you can read about it in this post and this one and finally another one — yes I wrote a lot about the matter.

How do transitional transitive, indirect dependencies, cause trouble with both --as-needed and with upgrade verification? Well it depends on a number of factors actually:

  • the build might work on the developers’ systems because the libraries linked against indirectly bring in the actually needed libraries, either by DT_NEEDED or by libtool archives, but the former libraries aren’t used directly, thus --as-needed breaks the link — misguided link
  • the build might work properly because some of the used (and linked to) libraries optionally use (and link to) the otherwise missing libraries; this work just as long as they are not built without that support; for instance you might use OpenSSL and Curl, separately, then link to Curl only, expecting it to bring in OpenSSL… but Curl might be built against GnuTLS or NSS, or neither;
  • the build might work depending on the versions of libraries used, because one of the linked libraries might replace one library for another, dropping a needed library from the final link.

The same rules generally apply to the DEPEND and RDEPEND variables; relying on another package to bring in your own dependencies is a bad idea; even if you use GTK+ it doesn’t mean that you can avoid listing libpng as a package used, if you use it directly. On the other hand, listing libpng because it’s present in the final link (especially when not using --as-needed) is a bad idea which you definitely should avoid.

By ignoring transitional transitive dependencies, you invalidate the dependency-tree, which means we cannot rely on it when we’re trying to avoid huge fuckups if an important package changed API and ABI. This is why I have (wrongly) snapped back at Samuli for closing the libpng-1.4 tracker before I had the chance of running it through the tinderbox.

Bottomline, please always depend on what you use directly, both in linking and in ebuilds. Please!

Thanks to Odi for letting me know that I used (consistently) the wrong word in the article. This goes to show that I either should stop writing drafts at 3am or I should proofread them before posting.

What the tinderbox is doing for you

Since I’ve asked for some help with the tinderbox also with some pretty expensive hardware (the registered ECC RAM), I should probably also explain well what my tinderbox is doing for the final users. Some of the stuff is what any tinderbox would do, but as I wrote before, a tinderbox alone is not enough since it only ties into a specific target, and there are more cases to test.

The basic idea of the tinderbox is that it’s better if I – as a developer – file bugs for the packages, rather than users: not only I know how to file the bug, and often I know how to fix the bug as well; there is the chance that the issue will be fixed without any user hitting it (and thus being turned down by the failure).

So let’s see what my effort is targeted to. First of all, the original idea from which this started was to assess the --as-needed problem to have some data about the related problems. From that idea the things started expanding: while I am still running the --as-needed tests (and this is pretty important because there is new stuff that gets added that fails to build with --as-needed!), I also started seeing failures with new versions of GCC, and new versions of autotools, and after a while this started being a full-blown QA project for me.

While, for instance, Patrick’s tinderbox focuses on the stable branch of Gentoo, I’m actually focusing on unstable and “future” branches. This means that it runs over the full ~arch tree (which is much bigger than the stable one), and sometimes even uses masked packages, such as GCC, or autoconf. This way we can assess whether unmasking a toolchain package is viable or if there are too many things failing to work with it still.

But checking for these things also became just the preamble for more bugs to come: since I’m already building everything, it’s easy to find any other kind of failure: kernel modules not working with the latest version, minimum-flag combinations failing, and probably most importantly parallel make failures (since the tinderbox builds in parallel). And from there to adding more QA safety checks, the step was very short.

Nowadays, among the other things, I end up filing bugs for:

  • fetch failures: since the tinderbox has to download all the packages, it has found more than a couple packages that failed to fetch entirely — while the mirroring system should already report failed fetches for mirrored packages, those with mirror restrictions don’t run through that;
  • the --as-needed failures: you probably know already why I think this flag is a godsend, and yet not all developers use it to test their packages when they add it to portage;
  • parallel make failures, and parallel make misses: not only the tinderbox runs with parallel make enabled, and thus hits the parallel make failures (that are to be worked around to avoid users hitting them keeping a bug open to track them ), but thanks to Kevin Pyle it also checks for direct make calls which is a double-win, as stuff that would otherwise not use parallel make does in the tinderbox and I can fix the to either use emake for all or emake -j1 and file a bug;
  • new (masked) toolchains: GCC 4.4, autoconf 2.65, and so on: building with them when they are still masked helps identifying the problems way before users might stumble into them;
  • failing testsuites: this was added together with GCC 4.4 since the new strict-aliasing system caused quite a bit of packages to fail at runtime rather than build-time; while this does not make the coverage perfect, it’s also very important to identify --as-needed failures for libraries not using --no-undefined;
  • invalid directories: some packages install documentation and man pages out of place; others still refer to the deprecated /usr/X11R6 path; since I’m installing all the packages fully, it’s easy to check for such cases; while these are often petty problems, the same check identifies misplaced Perl modules which is a prerequisite for improving the Perl handling in Gentoo;
  • pre-stripped files in packages: some packages even when compiling source code tend to strip files before installing them; this is bad when you need to track down a crash because just setting -ggdb in your flags is not going to be enough;
  • maintainer-mode rebuilds: when a package causes maintainer-mode rebuild, it often executes configure twice, which is pretty bad (sometimes it takes more to run that script than to build the package); most users won’t care to file bugs about them, since they probably wouldn’t know what they are either, while I can usually do that in batch;
  • file collisions, and unexpected automagic dependencies are also tested; while Patrick’s tinderbox cleans the root at every merge, like binary distributions do in their build servers, and finds missing mandatory dependencies, my approach is the same applied by Gentoo users: install everything together; this helps finding file collisions between packages as well as finding some corner cases where packages fail only if a package is installed.

There are more cases for which I file bug, and you can even see that from the tinderbox quite a few packages were dropped entirely as unfixable. The whole point of it is to make sure that what we package and deliver to users builds and especially builds as intended, respecting policies and not feigning to work.

Now that you know what it’s used for… please help .

Tinderbox — Correcting the aim

While I haven’t been able to do much major Gentoo work lately (but I’m still working on fixing and bumping my own packages), I’m still working hard time on the tinderbox. As I’ve said a few times, my effort never started with the intention of being a full blown tinderbox, and only grew up from a few tests to what it is now. The work is hard, long, and heavy, but it gets results, I hope.

Stating from this month and stretching down to October, there will be quite a few packages that will be removed from the tree since they don’t build, some haven’t built in quite a few months if not years, and this should also mean faster emerge --sync (or at least, not excessively longer after a while – the main problem is that the tree is growing, but we have to cut out the dead branches).

Also, most of the issues related to the GCC 4.4 unmasking are getting pretty well known and should be getting fixed or booted out in a short while. The main trackers, for fortified sources, gcc 4.4, glibc 2.10, --as-needed and so on are, yes, pretty huge, but are also being emptied little by little. This hopefully will bring us to a much cleaner, sleeker tree.

Last night I went around fetching all the packages in the tree to make sure the tinderbox wouldn’t go waiting on the download to be completed; this alone was quite a task because it had to execute over 5700 packages (those that are queued up to be installed), and found out quite a few issues by itself; like fetch-restricted packages that didn’t use elog to output the fetch error – so you wouldn’t have a trace of why they failed to fetch! All the packages that require registration to fetch (or for which the download site was unavailable) are now masked in the tinderbox so they won’t be tried (and neither would the packages that depend on them).

There are quite a few obstacles still for a total coverage of the tree by the tinderbox; collisions are one of them, but there are also problems with cyclic dependencies, especially among Python and Ruby packages that are used for testing, and have circular dependencies with another package. All these are currently being looked at by the tinderbox with a huge performance drop. Another problem is with the dependencies that fail to build; for instance mplayer is currently failing to build because of some assembly mess-up; that in turn causes a domino effect with a huge number of other packages that tries to merge mplayer in.

To reduce the amount of stuff that will fail to merge, I’m also going to take measures for packages to be masked when I know they will cause other stuff to fail; this is the case of mplayer for instance, but also of PHP’s pear that fails (without any kind of error message) for every and each package.

Hopefully, one day most of the tree will build file and will not require me to do this kind of work…

Does as needed link make software faster?

Christian Ruppert (idl0r) asked me today whether --as-needed makes software faster or smaller. I guess this is one of the most confusing points about --as-needed; focus of tons of hearsay, and with different assertions all around the place. So I’ll do my best to explain this once and for all.

In perfect conditions, that is, if --as-needed were not doing anything at all, then it wouldn’t be changing anything. The flag is not magical, it does not optimise the output at all. The same exact result you would have if libtool wasn’t pushing all crap down the line, and if all the build systems only requested the correct dependencies.

When it does matter is when overlink is present. To understand what the term overlink refers to check my old post that explains a bit what --as-needed does, and shows the overlink case, the perfect link case, and what really happens.

Now, of course you’ll find reports of users saying that --as-needed makes software faster or smaller. Is this true, or false? It’s not easy to answer one straight answer because it depends on what it’s happening with and without the flag. If with the flag there are libraries loaded, directly and indirectly, that are not used (neither directly nor indirectly), then the process spawned from the executable will have less libraries loaded in the address space, and thus both be faster to load (no need to read, map in memory, and relocate those libraries) and smaller in memory (some libraries are “free” in memory, but most will have relocations, especially if immediate bindings (“now” bindings) are used, like happens for setuid executables.

Indeed, the biggest improvements you can have when comparing the with and without cases in a system, or in software, that uses immediate bindings. In that case, all the symbols from shared objects are bound at load, instead than at runtime, so the startup time for the processes are cut down sensibly. This does not only involve hardened systems, or setuid binaries, but also applications using plugins, that may be requesting immediate bindings (to reject the plugin, rather than aborting at runtime, in case of missing symbols).