Redundant symbols

So I’ve decided to dust off my link collision script and see what the situation is nowadays. I’ve made sure that all the suppression file use non-capturing groups on the regular expressions – as that should improve the performance of the regexp matching – make it more resilient to issues within the files (metasploit ELF files are barely valid), and run it through.

Well, it turns out that the situation is bleaker than ever. Beside the obvious amount of symbols with a too-common name, there are still a lot of libraries and programs exporting default bison/flex symbols the same way I found them in 2008:

Symbol yylineno@ (64-bit UNIX - System V AMD x86-64) present 59 times
Symbol yyparse@ (64-bit UNIX - System V AMD x86-64) present 53 times
Symbol yylex@ (64-bit UNIX - System V AMD x86-64) present 49 times
Symbol yy_flush_buffer@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_scan_buffer@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_scan_bytes@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_scan_string@ (64-bit UNIX - System V AMD x86-64) present 48 times
Symbol yy_create_buffer@ (64-bit UNIX - System V AMD x86-64) present 47 times
Symbol yy_delete_buffer@ (64-bit UNIX - System V AMD x86-64) present 47 times

Note that at least one library got to export them to be listed in this output; indeed these symbols are present in quite a long list of libraries. I’m not going to track down each and every of them though, but I guess I’ll keep an eye on that list so that if problems arise that can easily be tracked down to this kind of collisions.

Action Item: I guess my next post is going to be a quick way to handle building flex/bison sources without exposing these symbols, for both programs and libraries.

But this is not the only issue — I’ve already mentioned a long time ago that a single installed system already brings in a huge number of redundant hashing functions; on the tinderbox as it was when I scanned it, there were 57 md5_init functions (and this without counting different function names!). Some of this I’m sure boils down to gnulib making it available, and the fact that, unlike the BSD libraries, GLIBC does not have public hashing functions — using libcrypto is not an option for many people.

Action item: I’m not very big of benchmarks myself, never understood the proper way to go around getting the real data rather than being fooled by the scheduler. Somebody who’s more apt at that might want to gather a bunch of libraries providing MD5/SHA1/SHA256 hashing interfaces, and produce some graphs that can let us know whether it’s time to switch to libgcrypt, or nettle, or whatever else that provides us with good performance as well as with a widely-compatible license.

The presence of duplicates of memory-management symbols such as malloc and company is not that big of a deal, at first sight. After all, we have a bunch of wrappers that use interposing to account for memory usage, plus another bunch to provide alternative allocation strategies that should be faster depending on the way you use your memory. The whole thing is not bad by itself, but when you get one of graphviz’s libraries (libgvpr) to expose malloc something sounds wrong. Indeed, if even after updating my suppression filter to ignore the duplicates coming from gperftools and TBB, I get 40 copies of realloc() something sounds extremely wrong:

Symbol realloc@ (64-bit UNIX - System V AMD x86-64) present 40 times

Now it is true that it’s possible, depending on the usage patterns, to achieve a much better allocation strategy than the default coming from GLIBC — on the other hand, I’m also pretty sure that GLIBC’s own allocation improved a lot in the past few years so I’d rather use the standard allocation than a custom one that is five or more years old. Again this could use some working around.

In the list above, Thunderbird and Firefox for sure use (and for whatever reason re-expose) jemalloc; I have no idea if libhoard in OpenFOAM is another memory management library (and whether OpenFOAM is bundling it or not), and Mercury is so messed up that I don’t want to ask myself what it’s doing there. There are though a bunch of standalone programs listed as well.

Action item: go through the standalone programs exposing the memory interfaces — some of them are likely to bundle one of the already-present memory libraries, so just make them use the system copy of it (so that improvements in the library trickle down to the program), for those that use custom strategies, consider making them optional, as I’d expect most not to be very useful to begin with.

There is another set of functions that are similar to the memory management functions, which is usually brought in by gnulib; these are convenience wrappers that do error checking over the standard functions — they are xmalloc and friends. A quick check, shows that these are exposed a bit too often:

Symbol xmemdup@ (64-bit UNIX - System V AMD x86-64) present 37 times

In this case they are exposed even by the GCC tools themselves! While this brings me again to complain that gnulib show now actually be libgnucompat and be dynamically linked, there is little we can do about these in programs — but the symbols should not creep in system libraries (mandb has the symbols in its private library which is marginally better).

Action item: check the libraries exposing the gnulib symbols, and make them expose only their proper interface, rather than every single symbol they come up with.

I suppose that this is already quite a bit of data for a single blog post — if you want a copy of the symbols’ index to start working on some of the action items I listed, just contact me and I’ll send it to you, it’s a big too big to just have it published as is.

Symbolism and ELF files (or, what does -Bsymbolic do?)

Julian asked by mail – uh.. last month! okay I have a long backlog – what the -Bsymbolic and -Bsymbolic-functions options to the GNU linker do. The answer is not extremely complicated but it calls for some explanation on how function calls in Unix work. I say in Unix, because there are a few differences in how OS X and Windows behave if I recall correctly, and I’m definitely not an expert on those. I wish I was, then I would be able to work on a book to replace Levine’s Linkers and Loaders.

PLT and GOT tables diagram

Please don’t complain about the very bad drawing above, it’s just going to illustrate what’s going on, I did it on my iPad with a capacitive stylus. I’ll probably try to do a few more of these, since I don’t have my usual Intuos tablet, and I won’t have it until I’ll find my place in London.

You see, the whole issue of linking in Unix is implemented with a long list of tables: the symbol table, the procedure linking table (PLT) and the global offset table (GOT). All objects involved in a dynamic linking chain (executables and shared objects, ET_EXEC and ET_DYN) posses a symbol table, which mixes defined (exported) and undefined (requested) symbols. Objects that are exporting symbols (ET_DYN and ET_EXEC with plugins, callbacks or simply badly designed) posses a PLT, and PIC objects (most ET_DYN, with the exception of some x86 prebuilt objects, and PIE ET_EXEC) posses GOTs.

The GOT and the text section

Let’s start from the bottom, that is the GOT, or actually before the GOT itself to the executable code itself. For what ELF is concerned, by default (there are a number of options that change this but I don’t want to go there for now), data and functions sections are completely opaque. Access to functions and data has to be done through start addresses; for non-PIC objects, these are absolute addresses, as the objects are assumed to be loaded always at the same position; when using position-independent code (PIC), like the name hints, this position has to be ignored, so the data or function position has to be derived by using offsets from the load address for the object — when using non-static load addresses, and non-PIC objects, you actually have to patch the code to use the new full address, which is what causes a text relocation (TEXTREL), which requires the ability to write to the executable segments, which is an obvious security issue.

So here’s where the global offset table enters the scene. Whenever you’re referring to a particular data or function, you have an offset from the start of the containing section. This makes it possible for that section to be loaded at different addresses, and yet keep the code untouched. (Do note I’m actually simplifying a lot the whole thing, but I don’t want to go too much into details because half the people reading wouldn’t understand what I’m saying otherwise.)

But the GOT is only used when the data or function is guaranteed to be in the same object. If you’re not using any special option to either compiler or linker, this means only static symbols are addressed directly in the GOT. Everything else is accessed through the object’s PLT, in which all the functions that the object calls are added. This PLT has then code to ask the dynamic loader what address the given symbol is defined at.

Global and local symbol tables

To answer that question, the dynamic loader had to have a global table which resolve symbol names to addresses. This is basically a global PLT from a point of view. Depending on some settings in the objects, in the environment or in the loader itself, this table can be populated right when the application is being executed, or only when the symbols are requested. For simplicity, I’ll assume that what’s happening is the former, as otherwise we’ll end up in details that have nothing to do with what we were discussing to begin with. Furthermore there is a different complication added by the modern GNU loader, which introduced the indirect binding.. it’s a huge headache.

While the same symbol name might have multiple entries in the various objects’ symbol tables, because more than one object is exporting the same symbol, in the resolved table each symbol name has exactly one address, which is found by reading the objects’ GOTs. This means that the loader has to solve in some way the collisions that happen when multiple objects export the same symbol. And also, it means that there is no guarantee by default that an object that both exports and calls a given symbol is going to call its own copy.

Let me try to underline that point: symbols that are exported are added to the symbol table of an object; symbols that are called are added to the symbol table as undefined (if they are not there already) and they are added to the procedure linking table (which then finds the position via its own offset table). By default, with no special options, as I said, only static functions are called directly from the object’s global offset table, everything else is called through the PLT, and thus through the linker’s table of resolved symbols. This is actually what drives symbol interposing (which is used by LD_PRELOAD libraries), and what caused ancient xine’s problems which steered me to look into ELF itself.

Okay, I’m almost at -Bsymbolic!

As my post about xine shows, there are situations where going through the PLT is not the desired behaviour, as you want to ensure that an object calls its own copy of any given symbol that is defined within itself. You can do that in many ways; the simplest possible of options, is not to expose those symbols at all. As I said with default options, only static functions are called straight through the GOT, but this can be easily extended to functions that are not exposed, which can be done either by marking the symbols as hidden (happens at compile time), or by using a linker script to only expose a limited set of symbols (happens at link time).

This is logical: the moment when the symbols are no longer exported by the object, the dynamic loader has no way to answer for the PLT, which means the only option you have is to use the GOT directly.

But sometimes you have to expose the symbols, and at the same time you want to make sure that you call your own copy and not any other interposed copy of those symbols. How do you do that? That’s where -Bsymbolic and -Bsymbolic-functions options come into play. What they do is duplicate the GOT entries for the symbols that are both called and defined in a shared object: the loader points to one, but the object itself points to the other. This way, it’ll always call its own copy. An almost identical solution is applied, just at compile-time rather than link-time, when you use protected visibility (instead of default or hidden).

Unfortunately this make a small change in the semantics we’re used to: since the way the symbols are calculated varies depending on whether you’re referring to the symbol from within or outside the object, pointers taken without and outside will no longer match. While for most libraries this is not a problem, there are some cases where it really is. For instance in xine we hit a problem with the special memcpy() function implementation: it was a weak symbol, so simply a pointer to the actual function, which was being set within the libxine object… but would have been left unset for the external objects, including the plugins, for which it would still have been a NULL.

Comparing function symbols is a rare corner case, but comparing objects’ addresses is common enough, if you’re trying to see if a default, global object is being passed to your function instead of a custom one… in that case, having the addresses no longer matching is a big pain. Which is basically why you have -Bsymbolic-functions — it’s exactly like -Bsymbolic but limits itself to the functions, whereas the objects are still handled like if no option was passed. It’s a compromise to make it easier to not go through the PLT for everything, while not breaking so much code (it would then only break on corner cases like xine’s memcpy()).

By the way, if it’s not obvious, the use of symbolic resolution is not only about making sure that the objects know which function they are calling, it’s also a performance improvement, as it avoids a virtual round-trip to the dynamic loader, and a lookup of where the symbol is actually defined. This is minor for most functions, but it can be visible if there are many many functions that are being called. Of course it shouldn’t amke much of a difference if the loader is good enough, but that’s also a completely different story. As is the option for the compiler to emit two copies of a given function, to avoid doing the full preamble when called within the object. And again for what concerns link-time optimization, which is connected to, but indirectly, with what I’ve discussed above.

Oh and if it wasn’t clear from what I wrote you should not ever use the -Bsymbolic flag in your LDFLAGS variable in Gentoo. It’s not a flag you should mock with, only upstream developers should care about it.

A special kind of bundling

I know it was a very long time since I last posted about bundled libraries, and a long time since I actually worked on symbol collisions which is the original reason why I started working on Ruby-Elf — even though you probably wouldn’t tell nowadays, given how many more tools I implemented over the same library.

Since the tinderbox was idling, due to the recent changes in distfiles storage I decided to update the database of symbols. This actually discovered more than a few bugs in my code, for which I should probably write a separate blog post. In the mean time I’m just going to ask here what I already asked on my streams over to and Twitter: if you have access to an HP-UX machine, could you please check if there is an elf.h header, with a license permissive enough that I can look at it? I could fetch the entries from GNU binutils, but I’d rather not, since it’ll be mixing and matching code licensed under GPL-2 (only) and GPL-3 — although arguably constant names shouldn’t be copyrightable.

The Ruby-Elf code will be pushed tomorrow, as today gitorious wasn’t very collaborative, and I’ll probably follow with a blog post on the topic anyway.

Once I was able to get the output of the harvest and analyse script, I found an interesting, albeit worrisome, surprise. A long list of packages use gnulib’s MD5 code. The problem is that gnulib is not your average utility library: it isn’t installed and linked to, it is imported into the sources of the project using it. The original reason to design it this way was that it would provide replacement functions for the GNU extension interfaces, or even standard interfaces, that aren’t present on some systems, so that you wouldn’t have to stick to a twenty year old standard when you could have reduced the code logic by using modern interfaces.

What happens now? Well, gnulib carries not only replacement code for functions that are implemented in glibc abut not on other systems, but also a long list of interfaces that are not implemented in glibc either! And as it happens, even an MD5 implementation. Such an implementation is replicated at least 115 times into the tinderbox system, standing to the visible symbols — there might be a lot more, for when you hide the symbols or build a non-exported executable, my tools are not going to find them.

This use of gnulib is unlikely to go away anytime soon… unfortunately the more packages use gnulib, the more a security bug in gnulib would easily impact the distribution as a whole for a very long time. People, can we stop using gnulib like it was glib? Please? Just create a libgnutils or something, and make gnulib look for that before using its own modules, so that there is a chance to use a shared, common library instead… especially since some of the users of gnulib are libraries themselves, which cause the symbol collisions problem that is the original reason why I’m looking at this code…

Sigh! Rant out….

Ruby-Elf and collision detection improvements

While the main use of Ruby-Elf for me lately has been quite different – for instance with the advent of elfgrep or helping verifying LFS support – the original reason that brought me to write that parser was finding symbol collisions (that’s almost four years ago… wow!).

And symbol collisions are indeed still a problem, and as I wrote recently they don’t get very easy on the upstream developers’ eyes, as they are mostly an indication of possible aleatory problems in the future.

At any rate, the original script ran overnight, generated a huge amount of database, and then required more time to produce a readable output, all of which happened using an unbearable amount of RAM. Between the ability to run it on a much more powerful box, and the work done to refine it, it can currently scan Yamato’s host system in … 12 minutes.

The latest set of change that replaced the “one or two hours” execution time with the current “about ten minutes” (for the harvesting part, there are two more minutes required for the analysis) was part of my big rewrite of the script so that it used the same common class interfaces as the commands that are installed to be used with the gem as well. In this situation, albeit keeping the current single-threaded (more on that in a moment), each file analysed consists of three calls to the PostgreSQL backend, rather than being something in the ballpark of 5 plus one per symbol, and this makes it quite faster.

To achieve this I first of all limited the round-trips between Ruby and PostgreSQL when deciding whether a file (or a symbol) has been already added or not. In the previous iteration I was already optimising this a bit by using prepared statements (that seemed slightly faster than direct queries), but they didn’t allow me to embed the logic into them, so I had a number of select and insert statements depending on the results of those, which was bad not only because each selection would require converting data types twice (from PostgreSQL representation to C, then from that to Ruby), but also because it required to call into the database each time.

So I decided to bite the bullet and, even though I know it makes it a bunch of spaghetti code, I’ve moved part of the logic in PostgreSQL through stored procedures. Long live PL/SQL.

Also, to make it more solid in respect to parsing error on single object files, rather than queuing all the queries and then commit them in one big single transaction, I create single transactions to commit all the symbols of an object, as well as when creating the indexes. This allows me to skip over objects altogether if they are broken, without stopping the whole harvesting process.

Even after introducing the transaction on symbols harvesting, I found it much faster to run a single statement through PostgreSQL in a transaction, with all the symbols; since I cannot simply run a single INSERT INTO with multiple values (because I might hit an unique constrain, when the symbols are part of a “multiple implementations” object), at least I call the same stored procedure multiple times within the same statement. This had tremendous effect, even though the database is accessed through Unix sockets!

Since the harvest process now takes so little time to complete, compared to what it did before, I also dropped the split between harvest and analysis: analyse.rb is gone, merged into the harvest.rb script for which I have to write a man page, sooner or later, and get installed properly as an available tool rather than an external one.

Now, as I said before, this script is still single-threaded; on the other hand, all the other tools are “properly multithreaded”, in the sense that their code fires up a new Ruby thread per each file to analyse and the results are synchronised not to step on each other’s feet. You might know already that, at least for what concerns Ruby 1.8, threading is not really implemented and green threads are used instead, which means there is no real advantage in using them; that’s definitely true. On the other hand, on Ruby 1.9, even though the pure-Ruby nature of Ruby-Elf makes the GIL a main obstacle, threading would improve the situation by simply allowing threads to analyse more files while the pg backend gem would send the data over to PostgreSQL (which would probably also be helped by the “big” transactions sent right now). But what about the other tools that don’t use external extensions at all?

Well, threading elfgrep or cowstats is not really any advantage on the “usual” Ruby versions (MRI18 and 1.9), but it provides a huge advantage when running them with JRuby, as that implementation has real threads, it can scan multiple files at once (both when using asynchronous listing of input files with the standard input stream, and when providing all of them in one single sweep), and then only synchronise to output the results. This of course makes it a bit more tricky to be sure that everything is being executed properly, but in general makes the tools just the more sweet. Too bad that I can’t use JRuby right now for harvest.rb, as the pg gem I’m using is not available for JRuby, I’d have to rewrite the code to use JDBC instead.

Speaking about options passing, I’ve been removing some features I originally implemented; in the original implementation, the arguments parsing was asynchronous and incremental, without limits to recursion; this meant that you could provide a list of files preceded by the at-symbol as the standard input of the process, and each of that would be scanned for… the same content. This could have been bad already for the possible loops, but it also had a few more problems, among which there was the lack of a way to add a predefined list of targets if none was passed (which I needed for harvest.rb to behave more or less like before). I’ve since rewritten the targets’ parsing code to only work with a single-depth search, and relying on asynchronous arguments passing only through the standard input, which is only used when no arguments are given, either on command line or by default of the script. It’s also much faster this way.

For today I guess all these notes about Ruby-Elf would be enough; on the other hand, in the next days I hope to provide some more details about the information the script is providing me.. they aren’t exactly funny, and they aren’t exactly the kind of things you wanted to know about your system. But I guess this is a story for another day.

Hide those symbols!

Last week I have written in passing about my old linking-collision script. Since then I restarted working on it and I have a few comments to give again.

First of all, you might have seen the 2008 GhostScript bug — this is a funny story; back in 2008 when I started working on finding and killing symbol collisions between libraries and programs, I filed a bug with GhostScript (the AFPL version), since it exported a symbol that was present, with the same name, in libXfont and libt1. I found that particularly critical since they aren’t libraries used in totally different applications, as they are all related to rendering data.

At the time, the upstream GS developer (who happens to be one of the Xiph developers, don’t I just love those guys?) asked me to provide him with a real-world crash. Since any synthetic testcase I could come up with would look contrived, I didn’t really want to spend time trying to come up with one. Instead I argued the semantic of the problem, explaining why, albeit theoretical at that point, the problem should have been solved. No avail, the bug was closed at the time with a threat that anyone reopening it would have its account removed.

Turns out in 2011 that there is a program that does link together both libgs and libt1: Evince. And it crashes when you try to render a DVI document (through libt1), containing an Encapsuled PostScript (EPS) image (rendered through GhostScript library). What a surprise! Even though the problem was known and one upstream developer (Henry Stiles) knows that the proper fix is using unique names for internal functions and data entries, the “solution” was limited to the one colliding symbol, leaving all the others to be found in the future to have problems. Oh well.

Interestingly, most packages don’t seem to care about their internal symbols, be them libraries or final binaries. On final binaries this is usually not much of a problem, as two binaries cannot collide with one another, but it doesn’t mean that the symbol couldn’t collide with another library — for this reason, the script now ignores symbols that collide only between executables, but keeps listing those colliding with at least one library.

Before moving on to how to hide those symbols, I’d like to point out that the Ruby-Elf project page has a Flattr button, while the sources are on Gitorious GitHub for those who are curious.

Update (2017-04-22): as you may know, Gitorious was acquired by GitLab in 2015 and turned down the service. So the project is now on GitHub. I also stopped using Flattr a long time ago.

You can now wonder how to hide the symbols; one way that I often suggest is to use the GCC-provided -fvisibility=hidden support — this is obviously not always an option as you might want to support older versions, or simply don’t want to start adding visibility markers to your library. Thankfully there are two other options you can make use of; one is to directly use the version script support from GNU ld (compatible with Sun’s, Apple’s and gold for what it’s worth); basically you can then declare something like:

  local: *;

This way only the three named functions would be exported, and everything else will be hidden. While this option works quite nicely, it often sounds too cumbersome, mostly because version scripts are designed to allow setting multiple versions to the symbols as well. But that’s not the only option, at least if you’re using libtool.

In that case there are, once again, two separate options: one is to provide it with a list of exported symbols, similar to the one above, but with one-symbol-per-line (-export-symbols SYMBOL-FILE), the other is to provide a regular expression of symbols to export (-export-symbols-regex REGEX), so that you just have to name the symbols correctly to have them exported or not. This loses the advantage of multiple versions for symbols – but even that is a bit hairy so I won’t get there – but gains the advantage of working with generating Windows libraries as well, where you have to list the symbols to export.

I’d have to add here that hiding symbols for executables should also reduce their startup time, as the runtime loader ( doesn’t need to look up a long list of symbols when preparing the binary to be executed; the same goes for libraries. So in a utopia world where each library and program only exports its tiny, required list of symbols, the system should also be snappier. Think about it.

More details about symbol collisions

You might remember that I was quite upset by the amount of packages that block each other when they install the same files (or files with the same name). It might not be obvious – and actually it’s totally non-obvious – why I would insist on these to be solved whenever possible, and why I’m still wishing we had a proper way to switch around alternative implementations without using blockers, so I guess I’ll explain it here, while I also discuss again the problem of symbol collisions, which is also one very nebulous problem.

I have to say, first, that the same problem with blockers is present also with conflicting USE requirements; the bottom-line is the same: there are packages that you cannot have merged at the same time, and that’s a problem for me.

The problem can probably be solved just as well by changing the script I use for gathering ELF symbol information to check for collisions, but since that requires quite a bit for work just to work around this trouble, I’m fine with accepting a subset of packages, and ranting about the fact that we have no way (yet) to solve the problem of colliding packages, and incompatible USE requirements.

The script, basically, roams the filesystem to gather all the symbols that are exported by various ELF files, both executables and libraries, saves them into a PostgreSQL database, and then a separate script combines that data to generate a list of possibly colliding symbols. The whole point of this script is to find possible hard-to-diagnose bugs due to unwanted symbol interposing: an internal function or datum replaced with another from another ELF object (shared library or executable) that use the same name.

This kind of work is, as you can guess by now, hindered by the inability of keeping all the packages merged at once because then I cannot compare the symbols between them, and at the same time, it’s also hindered by those bad packages that install ELF files in uncommon and wrong paths (like /usr/share) — running the script over the whole filesystem would solve that problem, but the required amount of time to be wasted to run it is definitely going to be a problem.

On a different note, I have to point out that not all the cases where two objects export the same symbol are mistakes. Sometimes you’re willingly interposing symbols, like we do with the sandbox and google-perftools do with malloc and mmap; some other times, you’re implementing a specific interface, which might be that of a plugin but might also be required by a library, like the libwrap interface for allow/deny symbols. In some cases, plugins will not interfere one with the other because they get loaded with the RTLD_LOCAL option that tells the runtime loader that the symbols are not to be exported.

This makes the whole work pretty cumbersome: some packages will have to be ignored altogether, others will require specific rules, others will have to be fixed upstream, and some are actually issues with bundled libraries. The whole work is long, tedious and will not bring huge improvements all around, but it’s a nice polishing action that any upstream should really consider to show that they do care about the quality of their code.

But if I were to expect all of them to start actually making use of hidden symbols and the like, I’d probably be waiting forever. For now, I guess I’ll have to start with making use of these ideas on the projects I contribute to, like lscube. It might end up getting used more often.

What the tinderbox is doing for you

Since I’ve asked for some help with the tinderbox also with some pretty expensive hardware (the registered ECC RAM), I should probably also explain well what my tinderbox is doing for the final users. Some of the stuff is what any tinderbox would do, but as I wrote before, a tinderbox alone is not enough since it only ties into a specific target, and there are more cases to test.

The basic idea of the tinderbox is that it’s better if I – as a developer – file bugs for the packages, rather than users: not only I know how to file the bug, and often I know how to fix the bug as well; there is the chance that the issue will be fixed without any user hitting it (and thus being turned down by the failure).

So let’s see what my effort is targeted to. First of all, the original idea from which this started was to assess the --as-needed problem to have some data about the related problems. From that idea the things started expanding: while I am still running the --as-needed tests (and this is pretty important because there is new stuff that gets added that fails to build with --as-needed!), I also started seeing failures with new versions of GCC, and new versions of autotools, and after a while this started being a full-blown QA project for me.

While, for instance, Patrick’s tinderbox focuses on the stable branch of Gentoo, I’m actually focusing on unstable and “future” branches. This means that it runs over the full ~arch tree (which is much bigger than the stable one), and sometimes even uses masked packages, such as GCC, or autoconf. This way we can assess whether unmasking a toolchain package is viable or if there are too many things failing to work with it still.

But checking for these things also became just the preamble for more bugs to come: since I’m already building everything, it’s easy to find any other kind of failure: kernel modules not working with the latest version, minimum-flag combinations failing, and probably most importantly parallel make failures (since the tinderbox builds in parallel). And from there to adding more QA safety checks, the step was very short.

Nowadays, among the other things, I end up filing bugs for:

  • fetch failures: since the tinderbox has to download all the packages, it has found more than a couple packages that failed to fetch entirely — while the mirroring system should already report failed fetches for mirrored packages, those with mirror restrictions don’t run through that;
  • the --as-needed failures: you probably know already why I think this flag is a godsend, and yet not all developers use it to test their packages when they add it to portage;
  • parallel make failures, and parallel make misses: not only the tinderbox runs with parallel make enabled, and thus hits the parallel make failures (that are to be worked around to avoid users hitting them keeping a bug open to track them ), but thanks to Kevin Pyle it also checks for direct make calls which is a double-win, as stuff that would otherwise not use parallel make does in the tinderbox and I can fix the to either use emake for all or emake -j1 and file a bug;
  • new (masked) toolchains: GCC 4.4, autoconf 2.65, and so on: building with them when they are still masked helps identifying the problems way before users might stumble into them;
  • failing testsuites: this was added together with GCC 4.4 since the new strict-aliasing system caused quite a bit of packages to fail at runtime rather than build-time; while this does not make the coverage perfect, it’s also very important to identify --as-needed failures for libraries not using --no-undefined;
  • invalid directories: some packages install documentation and man pages out of place; others still refer to the deprecated /usr/X11R6 path; since I’m installing all the packages fully, it’s easy to check for such cases; while these are often petty problems, the same check identifies misplaced Perl modules which is a prerequisite for improving the Perl handling in Gentoo;
  • pre-stripped files in packages: some packages even when compiling source code tend to strip files before installing them; this is bad when you need to track down a crash because just setting -ggdb in your flags is not going to be enough;
  • maintainer-mode rebuilds: when a package causes maintainer-mode rebuild, it often executes configure twice, which is pretty bad (sometimes it takes more to run that script than to build the package); most users won’t care to file bugs about them, since they probably wouldn’t know what they are either, while I can usually do that in batch;
  • file collisions, and unexpected automagic dependencies are also tested; while Patrick’s tinderbox cleans the root at every merge, like binary distributions do in their build servers, and finds missing mandatory dependencies, my approach is the same applied by Gentoo users: install everything together; this helps finding file collisions between packages as well as finding some corner cases where packages fail only if a package is installed.

There are more cases for which I file bug, and you can even see that from the tinderbox quite a few packages were dropped entirely as unfixable. The whole point of it is to make sure that what we package and deliver to users builds and especially builds as intended, respecting policies and not feigning to work.

Now that you know what it’s used for… please help .

Bundling libraries: the curse of the ancients

I was very upset by one comment from Ardour’s lead developer Paul Davis in a recently reported “bug” about the un-bundling of libraries from Ardour in Gentoo. I was, to be honest, angry after reading his comment, and I was tempted to answer badly for a while; but then I decided my health was more important and backed away, thought about it, then answered how I answered (which I hope is diplomatic enough). Then I thought it might be useful to address the problem in a less concise way and explain the details.

Ardour is bundling a series of libraries; like I wrote previously, there are problems related to this and we dealt with them by just unbundling the libraries, now Ardour is threatening to withdraw support from Gentoo as a whole if we don’t back away from that decision. I’ll try to address his comments in multiple parts, so that you can understand why it really upset me.

First problem: the oogie-boogie crashes<

It’s a quotation from Adam Savage from MythBusters, watch the show if you want to actually know the detail; I learnt about it from Irregular Webcomic years ago, but I have only seen it about six months ago, since in Italy it only passes on satellite pay TV, and the DVDs are not available (which is why they are in my wishlist).

Let’s see what exactly Paul said:

Many years ago (even before Gentoo existed, I think) we used to distribute Ardour without the various C++ libraries that are now included, and we wasted a ton of time tracking down wierd GUI behaviour, odd stack tracks and many other bizarre bugs that eventually were traced back to incompatibilities between the way the library/libraries had been compiled and the way Ardour was compiled.

I think that I now coined a term for my own dictionary, and will call this the syndrome of oogie-boogie bugs, for each time I hear (or I’m found muttering!) “we know of past bad behaviour”. Sorry but without documentation, these things are like unprovable myth, just like the one Adam commented upon (the “pyramid power”). I’m not saying that these things didn’t happen, far form that I’m sure they did, the problem is that they are not documented and thus are unprovable, and impossible to dissect and correct.

Also, I’m not blaming Paul or the Ardour team to be superficial, because, believe it or not, I suffer(ed, hopefully) from that syndrome myself: some time ago, I reported to Mart that I had maintainer mode-induced rebuilds on packages that patched both and, and that thus the method of patching both was not working; while I still maintain that it’s more consistent to always rebuild autotools (and I know I have to write on why is that), Mart pushed me into proving it, and together we were able to identify the problem: I was using XFS for my build directory, which has sub-second mtime precision, while he was using ext3 with mtime precise only to the second, so indeed I was experiencing difficulties he would never have been able to reproduce on his setup.

Just to show that this goes beyond this kind of problem, since I joined Gentoo, Luca told me to be wary about suggesting use of -O0 when debugging because it can cause stuff to miscompile. I never accepted his word for it because that’s just how I am, and he didn’t have any specifics to prove it. Turns out he wasn’t that wrong after all, since if you build FFmpeg with -O0 and Sun’s compiler, it cannot complete the link. The reason for this is that with older GCC, and Sun’s compiler, and others I’m sure, -O0 turns off the DCE (Dead Code Elimination) pass entirely, and cause branches like if (0) to be compiled anyway. FFmpeg relies on the DCE pass to always happen. (there is more to say about relying on the DCE pass but that’s another topic altogether).

So again, if you want to solve bugs of this kind, you have to just do like the actual Mythbusters: document, reproduce, dissect, fix (or document why you have to do something rather than just saying you have to do it). Not having the specifics of the problem, makes it an “oogie-boogie” bug and it’s impossible to deal with it.

Second problem: once upon a time

Let me repeat one particular piece of the previous quote from Paul Davis (emphasis mine): “Many years ago (even before Gentoo existed, I think)”. How many years ago is that? Well, since I don’t want to track down the data on our own site (I have to admit I found it appalling that we don’t have a “History” page), I’ll go around quoting Wikipedia. If we talk about Gentoo Linux with this very name, the 1.0 version has been released on 2002, March 31 (hey it’s almost seven years go by now). If we talk about Daniel’s project, Enoch Linux 0.75 was released in December 1999, which is more than nine years ago. I cannot seem to be able to confirm Paul’s memories since their Subversion repositories seems to have discarded the history information from when they were in CVS (it reports the first commit in 2005, which is certainly wrong if we consider that Wikipedia puts their “Initial Release” in 2004).

Is anything the same as it was at that time? Well, most likely there are still pieces of code that are older than that, but I don’t think any of those are in actual use nowadays. There have been, in particular, a lot of transitions since then. Are difficulties found at that time of any relevance nowadays? I sincerely don’t think so. Paul also don’t seem to have any documentation of newer happenings of this, and just says that they don’t want to spend more time on debugging these problems:

We simply cannot afford the time it takes to get into debugging problems with Gentoo users only to realize that its just another variation on the SYSLIBS=1 problem.

I’ll go around that statement in more details in the next problem, but for now let’s accept that there has been no documentation of new cases, and that all that it goes here is bad history. Let’s try to think about what that bad history was. We’re speaking about libraries, first of all, what does that bring us? If you’re an avid reader of my blog, you might remember what actually brought me to investigate bundled libraries in the first place: symbol collisions ! Indeed this is very likely, if you remember I did find one crash in xine due to the use of system FFmpeg, caused by symbol collisions. So it’s certainly not a far-fetched problem.

The Unix flat namespace to symbols is certainly one big issue that projects depending on many libraries have to deal with; and I admit there aren’t many tools that can deal with that. While my collision analysis work has focused up to now to identify the areas of problem, it only helps in the big scheme of things to find possible candidate to collision problems. This actually made me think that I should adapt my technique to identify problems in a much smaller scale, giving one executable in input and identifying duplicated symbols. I just added this to my TODO map.

Anyway, thinking about the amount of time passed since Gentoo’s creation (and thus what Paul think is when the problems started to happen), we can see that there is at least one big “event horizon” in GNU/Linux since then (and for once I use this term, because it’s proper to use it here): the libc5 to libc6 migration ; the HOWTO I’m linking, from Debian, was last edited in 1997, which puts it well in the timeframe that Paul described.

So it’s well possible that people at the time went to use libraries built for one C library with Ardour built with a different one, which would have created, almost certainly, subtle and difficult to identify (for a person not skilled with linkers at least) issues. And it’s certainly not the only possible cause of similar crashes, or even worse unexpected behaviour. If we look again at Paul’s comment, he speaks of “C++ libraries”; I know that Ardour is written in C++ and I think I remember some of the libraries being built being written in C++ too; I’m not sure if he’s right at calling all of them “C++ libraries” (C and C++ are two different languages, even if the foreign calling convention glue is embedded in the latter’s language), but given even a single one is as such, it can open a different Pandora’s vase.

See, if you look at GCC’s history, it wasn’t long before Enoch 0.75 release that a huge paradigm shift initiated for Free Software compilers. The GNU C Compiler, nowadays the GNU Compiler Collection, forked the Experimental/Enhanced GNU Compiler System (EGCS) in 1997, which was merged back into GCC with the historical release 2.95 in April 1999. EGCS contained a huge amount of changes, a lot related to C++. But even that wasn’t near perfection; for many, C++ support was mostly ready from prime time only after release 3 at least, so there were wild changes going on at that time. Libraries built with different versions of the compiler at the time might as well had wildly differently built symbols with the same name, and even worse, they would have been using different STL libraries. Add to the mix the infamous 2.96 release of GCC as shipped by RedHat, I think the worse faux-pas in the history of RedHat itself, with so many bugs due to backporting that a project I was working with at the time (NoX-Wizard) officially unsupported it, suggesting to use either 2.95 or 3.1. We even had an explicit #error out if the 2.96 release was used!

A smaller scale paradigm shift has happened with the release of GCC 3.4 and the change from to which is what we use nowadays. Mixing libraries using the two ABIs and the STL versions caused obvious and non-obvious crashes; we still have software using the older ABI, and that’s why we have libstdc++-v3 around; Mozilla, Sun and Blackdown hackers certainly remember that time because it was a huge mess for them. It’s one very common (and one of my favourite) arguments against the use of C++ for mission-critical system software.

Also, GCC’s backward compatibility is near non-existent: if you build something with GCC 4.3, without using static libraries, executing it on a system with GCC 4.2 will likely cause a huge amount of problems (forward compatibility is always ensured though). Which adds up to the picture I already painted. And do we want to talk about the visibility problem? (on a different note I should ask Steve for a dump of my old blog to merge here, it’s boring not remembering that one post was written on the old one).

I am thus not doubting at all of Paul’s memories regarding problems with system libraries and so on so forth. I also would stress another piece of his comment: “eventually were traced back to incompatibilities between the way the library/libraries had been compiled and the way Ardour was compiled”. I understand he might not actually just refer to the compiler (and compiler version) used in the build; so I wish to point out two particular GCC options: -f(no-)exceptions and -f(no-)rtti.

These two options enable or disable two C++ language features: exceptions handling and run-time type information. I can’t find any reference to that in the current man page, but I remember that it warned that mixing code built with and without it in the same software unit was bad. I wouldn’t expect it to be any different now sincerely. In general the problem is solved because each piece of software builds its own independent unit, in the form of executable or shared object, and the boundary between those is subject to the contract that we call ABI. Shared libraries built with and without those options are supposed to work fine together (I sincerely am not ready to bet though), but if the lower-level object files are mixed together, bad things may happen, and since we’re talking about computers, they will, in the moment you don’t want them to. It’s important to note here for all the developers not expert with linkers that static libraries (or more properly, static archives) are just a bunch of object files glued together, so linking something statically still means linking lower-level object files together.

So the relevance of Paul’s memories is, in my opinion, pretty low. Sure shit happened, and we can’t swear that it’ll never happen again (most likely it will), but we can deal with that, which brings me to the next problem:

Third problem: the knee-jerk reaction

Each time some bug happens that is difficult to pin down, it seems like any developer tries to shift the blame. Upstream. Downstream. Sidestream. Ad infinitum. As a spontaneous reflex.

This happens pretty often with distributions, especially with Gentoo that gives users “too much” freedom with their software, but most likely in general, and I think this is the most frequent reason for bundling libraries. By using system libraries developers lose what they think is “control” over their software, which in my opinion is often just sheer luck. Sometimes developers admit that their reasons are just desire to spend the less time possible working on issues, some other times they try to explicitly move the blame on the distributions or other projects, but at the end of the day the problem is just the same.

Free software is a moving target; you might developer software against a version of a library, not touch the code for a few months, it works great, and then a new version is released and your software stops working. And you blame the new release. You might be right (new bug introduced), or you might be wrong (you breached the “contract” called API, some change happened and something that was not guaranteed to work in any particular way changed the way it worked, and you relied on the old behaviour). In either case, the answer “I don’t give a damn, just use the old version” is a sign of something pretty wrong with your approach.

The Free Software spirit should be the spirit of collaboration. If a new release of a given dependency breaks your software, you should probably just contact the author and try to work out between the two project what the problem is; if it’s a bug introduced, make sure there is a testsuite, and that the testsuite includes a testcase for the particular issue you found. Writing testcases for bugs that happened in the past is exactly why testsuites are so useful. If the problem is that you relied on a behaviour that has changed, the author might know how not to rely on that and have code that work as expected, or might take steps to make sure nobody else tries that (either by improving documentation or changing the interface so that the behaviour is not exposed). Bundling the dependency citing multiple problems and giving no option is usually not the brightest step.

I’m all for giving working software to users by default, so I can understand bundling the library by default; I just think that it should either be documented why that’s the case or give a chance of not using it. Someone somewhere might actually be able to find what the problem is. Just give him a chance. In my previous encounter with Firefox’s SQLite, I received a mail from Benjamin Smedberg:

Mozilla requires a very specific version of sqlite that has specific compiled settings. We know that our builds don’t work with later or earlier versions, based on testing. This is why we don’t build against system libsqlite by design.

They know based on testings that they can’t work with anything else. What does that testing consists of, I still don’t know. Benjamin admitted he didn’t have the specifics, and relied me to Shawn Wilsher who supposedly had more details, but he never got back at me with those details. Which is quite sad since I was eager to find what the problem was because SQLite is one of the most frequent oogei-boogei sources. I even noted that the problem with SQLite seems to lie upstream, and I still maintain that in this case; while I said before that it’s a knee-jerk reaction, I also have witnessed to more than a few project having problems with SQLite, myself I had my share of headaches because of that. But this should really start make us think that maybe, just maybe, SQLite needs help.

But we’re not talking about SQLite here, and trust me that most upstreams will likely help you out to forwardport your code, fix issues and so on so forth. Even if you, for some reason I don’t want to talk about now, decided to change the upstream library after bundling, often times you can get it back to a vanilla state by pushing your changes upstream. I know it’s feasible even for the most difficult upstreams, because I have done just that with FFmpeg, with respect to xine’s copy.

But just so that we’re clear, it does not stop with libraries, the knee-jerk reaction happens with CFLAGS too; if you have many users reporting that using wild CFLAGS break your software, the most common reaction is to just disallow custom CFLAGS, while the reasoned approach would be to add a warning and then start to identify the culprit; it might be your code assuming something that is not always true, or it might be a compiler bug, in either case the solution is to fix the culprit instead of just disallowing anybody from making use of custom flags.

Solution: everybody’s share

So for now I dissected Paul’s comment into three main problems; I could probably write more about each of them, and I might if the points are not clear, but the post is already long enough (but I didn’t want to split it down because it would take too long to be available), and I wanted to reach a conclusion with a solution, which is what I already posted in my reply to the bug.

The solution to this problem is to give everybody something to do. Instead of “blacklisting Gentoo” like Paul proposed, they should just do the right thing and leave us to deal with the problems caused by our choices and our needs. I have already pointed out some of these in my three-parts article for LWN (part 1, part 2 and part 3). This means that if you get user reporting some weird behaviour, using the Gentoo ebuild, your answer should not be “Die!” but “You should report that to the Gentoo folks over at their bugzilla”. Yes I know it is a much longer phrase and that it requires much more typing, but it’s much more user friendly and actually provides us all with a way to improve the situation.

Or you could also do the humble thing and ask for help. I already said that before, but if you got problem with anything I have written about, and have a good documentation of what the problem is, you can write me; of course I don’t always have time to fix your issues, sometimes I don’t even have time to look at them in a timely fashion I’m afraid, but I never sent away someone because I didn’t like them. The problem is that most of the time I’m not asked at all.

Even if you might end up asking me some question that would be very silly if you knew the topic, I’m not offended by those; just like I’d rather not be asked to learn all about the theory behind psychoacoustic to find why libfaad is shrieking my music, I don’t pretend that Paul knows all the inside out of linking problems to find out why the system libraries cause problems. I (and others like me) have the expertise to identify relatively quickly a collision problem; I should also be able to provide tools to identify that more quickly. But if I don’t know of the problem, I cannot magically fix it; well, not always at least .

So Paul, this is an official offer; if you can give me the details of even a single crash or misbehaviour due to the use of system libraries, I’d be happy to look into it.

The battles and the losses

In the past years I picked up more than a couple of “battles” to improve Free Software quality all over. Some of these were controversial, like `–as-needed and some of them have been just lost causes (like trying to get rid of C++ strict requirements on server systems). All of those though, were fought with the hope of improving the situation all over, and sometimes the few accomplishments were quite a satisfaction by themselves.

I always thought that my battle for --as-needed support was going to be controversial because it does make a lot of software require fixes, but strangely enough, this has been reduced a lot. Most of the newly released software works out of the box with --as-needed, although there are some interesting exceptions, like GhostScript and libvirt. On the positive exceptions, there is for instance Luis R. Rodriguez, who made a new release of crda just to apply an --as-needed fix with a failure that was introduced in the previous release. It’s very refreshing to see that nowadays maintainers of core packages like these are concerned with these issues. I’m sure that when I’ve started working on --as-needed nobody would have made a new point release just to address such an issue.

This makes it much more likely for me to work on adding the warning to the new --as-needed and even more needed for me to find why ld fails to link PulseAudio libraries even though I’d have expected him to.

Another class of changes that I’ve been working on that have shown more interest around than I would have expected is my work on cowstats which, for the sake of self-interest, formed most of the changes in the ALSA 1.0.19 release for what concerns the userland part of the packages (see my previous post on the matter).

On this case, I wish first to thank _notadev_ for sending me Linkers and Loaders, that is going to help me improve Ruby-Elf more and more; thanks! And since I’m speaking of Ruby-Elf, I finally decided its fate: it’ll stay. My reasoning is that first of all I was finally able to get it to work with both Ruby 1.8 and 1.9 adding a single thin wrapper (that is going to be moved to Ruby Bombe once I actually finish that), and most importantly, the code is there, I don’t want to start from scratch, there is no point in that, and I think that both Ruby 1.9 and JRuby can improve from each other (the first losing the Global Interpreter Lock and the other one trying to speed up its starting time). And I could even decide to find time to write a C-based extension, as part of Ruby-Bombe, that takes care of byteswapping memory, maybe even using OpenMP.

Also, Ruby-Elf have been serving its time a lot with the collision detection script which is hard to move to something different since it really is a thin wrapper around PostgreSQL queries, and I don’t really like to deal with SQL in C. Speaking about the collision detection script, I stand by my conclusion that software sucks (but proprietary software stinks too).

Unfortunately while there are good signs to the issue of bundled libraries, like Lennart’s concerns with the internal copies of libltdl in both PulseAudio (now fixed) and libcanberra (also staged for removal) the whole issue is not solved yet, there are still packages in the tree with a huge amount of bundled libraries, like Avidemux and Ardour, and more scream to enter (and thankfully they don’t always do). -If you’d like to see the current list of collisions, I’ve uploaded the LZMA-compressed output of my script.- If you want you can clone Ruby-Elf and send me patches to extend the suppression files, to remove further noise from the file.

At any rate I’m going to continue my tinderboxing efforts, while waiting for the new disks, and work on my log analyser again. The problem with that is I really am slow at writing Python code, so I guess it would be much easier if I were to reimplement the few extra functions that I’m using out of Portage’s interface in Ruby and use those, or find a way to interface with Portage’s Python interface from Ruby. This is probably a good enough reason for me to stick with Ruby, sure Python can be faster, sure I can get better multithreading with C and Vala, but it takes me much less time to write these things with Ruby than it would take me in any of the other languages. I guess it’s a problem with the mindset.

And on the other hand, if I have problems with Ruby I should probably just find time to improve the implementation; JRuby is enough evidence to show that my beef against Ruby 1.9 runtime not supporting multithreading are an implementation issue and not a language issue.

I’m glad I’m not a DBA

Today, even though it’s the new year’s eve, I’ve spent it working just like any other day, looking through the analysis log for my linking collisions script, to find some more crappy software in need of fixes. As it turns out, I found quite a bit of software, but I also confirmed to myself I have crappy database skills.

The original output of the script, already taking quite a long time to process, didn’t sort the symbols by name, but just by count, so to show the symbols with most collisions first and the ones that related to one or two files later. It also didn’t sort the name of the objects where the symbols could be find, which caused quite an issue as from time to time the list changed sorting so the list of elements wasn’t easy to compare between symbols.

Yesterday I added sorting to both fields so that I could have a more pleasant og to read, but it caused the script to slow down tremendously. At which point I noticed that maybe, just maybe, PostgreSQL didn’t optimise my tables, even though I had created views, in the hope of it being smart enough to use them as optimisation options. So I created two indexes, one for the name of the objects and one for the name of the symbols, with the default handler (btree).

The harvesting process now slowed down of a good 50%. Instead of taking less than 40 minutes, it took about an hour, but then when I launched the analysis script, it generated the whole 30MB log file in a matter of minuts rather than requiring me hours, I never have been able to let the analysis script complete its work before, and now it did it in minutes.

I have no problem to say that my database skills suck, which is probably why I’m much more of a system developer than a webapp developer.

Now at least i won’t have many more doubts about adding a way to automatically expand “multimplementations”: with the speed it has now I can well get it to merge in the data from the third table without many issues. But still, seeing how much my SQL skills are pointless, I’d like to ask some help on how to deal with this.

Basically, I have a table with paths, each of which refers to a particular object, which I call “multimplementation” (and groups together all the symbols related to a particular library ignoring things like ABI versioning and different sub-versions). For each of the multimplementation I have to get a descriptive name to report to users. When there is just one path linked to that object, that path should be used; when there are two paths, the name of the object, plus the two paths should be used; for more than two paths, the object name and the path of the first object should be used, with ellipses to indicate that there are more.

If you want to see the actual schema, you can find it on ruby-elf’s repository in the tools directory.

There are more changes to the database that I should do to make it much more feasible to connect the paths (and thus the objects) to the package names, but at least now with the speed it took it seems to be feasible to run these check on a more stable basis on the tinderbox. If only I could find an easy way to have incremental harvesting, I might as well be able to run it on my actual system too.