Gold readiness obstacle #6: more versioning trouble

And let’s keep talking about gold and the issues I’m encountering trying to build Gentoo with it. Waiting to see if glibc will ever get to implement the base versioning or if Ian would like to implement default versions I’ve found a third problem with the versioning support.

The problem, which I’ve reported as Sourceware bug #12893 as of today, is displayed by the libdebian-installer package: somehow the link editor is reporting two duplicated symbols… in the same source file. At first I thought it was some nasty bug with the link editor’s internals; what I found instead was a curious setting fort the source code itself.

What happens is this: one source file defines a function (with an additional alias, let’s ignore that for now); then it also uses the already described .symver directive to provide an alias which has a version:

int di_system_prebaseconfig_append (const char *udeb, const char *format, ...)
{ /* doesn't care about its body */ }

__asm__ (".symver di_system_prebaseconfig_append,di_system_prebaseconfig_append@LIBDI_4.0");

Do note that the original symbol is not marked static, which means it is also exposed as a base version. This by itself would be just fine, if it wasn’t that said source file is compiled into a translation unit, which is (indirectly, but I’ll simplify) used to produce a shared library. Said shared library uses a link editor version script to set the symbol/version pair of various symbols, like it is often done by properly-designed shared objects.

What becomes a problem is that the symbol’s name is also listed in the version script; which means the link editor will take the unversioned (base) symbol, and label it with the designed version (LIBDI_4.0)… causing two symbols with same name and version to be created. It should appear obvious that something’s wrong with the logic of this whole situation.

Why does this not cause a problem with the old bfd link editor? I’ve got no idea, although I can possibly speculate on the fact that the two symbols not only have the same name and version, but also the same address, which is likely to give the link editor enough clue to simply disregard the duplicate. On the other hand, the solution of this should probably be applied to the libdebian-installer package, the design of which I know nothing about; it might have been intended to support both shared and static linking, but it would look quite strange even in such a case.

At any rate, I’ll have to wait for Ian to express his opinion, and in the mean time I’ll be catching up with a few more buggy packages. I guess I don’t have any hurry, given that libtirpc is not fixed after all as it still reports missing symbols, but it does so only on glibc 2.14, which means that the main tinderbox won’t be able to be of much help for a little while.

Gold readiness obstacle #2: base versioning

This is the second post in the series analysing the obstacles we face if we want to actually make use of gold as system link editor at some point in time in the future. As I said for the previous one, please make your interest in the topic explicit, as it is a draining exercise to me, due to huge lack of interest by many other developers.

I have already noted up in part #1 that I have submitted a patch for gold and it wasn’t merged, which ticked me off a bit. In this post I’ll explain what that patch was about. This is particularly interesting to me, because, while it is in a very commonly used package, this problem wouldn’t be an “obstacle” as much as it is, in my view, if it wasn’t that I was doing paid work to look into it.

I have already written yesterday, and a number of times before, how you use ELF symbol versioning, so I won’t go back to the topic right now. What I’ll repeat here is that there are two main reasons to use symbol versioning: preventing symbol collisions, as it is used by the Berkeley DB slots I wrote about yesterday, or preserving binary compatibility when making incompatible change to functions’ ABIs but wanting to keep the same library ABI (and thus, soname).

For the former task, you can use the same blanked version information for all the symbols, as I noted, while for the latter you need a more surgical approach. What you usually do, when you stabilized the interface the first time around, is marking with the same version string all the functions. When one of those functions need to be replaced, then, you use source-level symbol versioning to provide a new “default” version of the symbol, together with an explicitly-versioned copy of the symbol that abides to the previously-used ABI. For more details about this, you can see the Binutils documentation that shows the example I’m going to pick up here.

     __asm__(".symver original_foo,foo@");
     __asm__(".symver old_foo,foo@VERS_1.1");
     __asm__(".symver old_foo1,foo@VERS_1.2");
     __asm__(".symver new_foo,foo@@VERS_2.0");

The code above is taken verbatim from the latest version of GNU ld (bfd) documentation. What it translates to, is this:

  • the original, replaced/deprecated interface of the foo() function is implemented with the (hidden) symbol original_foo;
  • two further, replaced/deprecated versions of foo() are implemented as old_foo and old_foo1;
  • finally, new_foo implements the most current version of foo().

How does this work in practice? Well, first of all the headers should only declare the newest interface of foo() – that is new_foo – so that new programs only can use that. When linking a new binary, the link editor will know to satisfy foo() references lacking a version with that version, not because the version string is “higher” (the version string has no meaning for link editor and runtime loaders, it’s just a string); but because it is marked as the “default” version (see the double at symbol in the directive. The other interfaces don’t have to be in the headers, and they will be ignored by the link editor, like they weren’t there. Software built against a previous version of the library, where the default version for foo() was VERS_1.2 or VERS_1.1, would still reference those versions; the runtime loader (ld.so) would then look those up, rather than VERS_2.0.

Lovely, isn’t it? You can improve your interface, solve age-old issues, without having to break the ABI, with the sole “little” downside of increasing the size of the library itself… and relying on a feature only available, for what I know, on GLIBC and maybe FreeBSD (you can achieve the same effect on Windows, but their approach is massively different, anyway let’s ignore that for now). Before somebody says that you actually double the size of the code, I’d like to point out that most of the time, the old function can be expressed as a call to the new function, with properly adapted parameters, unless you’re really changing the function to something entirely different.

For those wondering, using this approach with C++ is very complex and I’d probably say impossible: the ABI for C++ libraries includes the vtable for classes; when adding a new function to a class you change the vtable, increasing its size, and causing the ABI to change. It is for this reason that Trolltech used D-pointers in Qt for a long time, and why KDE had many problems introducing new features and fixing old bugs within a major release cycle.

Now let’s go back to our story of gold, fuse, and symbols.

The fuse library is designed to keep as binary compatible as possible with its predecessors, at least when built for GLIBC (it has special rules to not version interfaces when building for uClibc for instance). This is because it is designed to allow proprietary filesystem providers — for instance the Mac version is used by Parallels to provide their shared folders support. Unfortunately it seems like this wasn’t a requirement in their original implementation, which was built wihtout version information for symbols. This happens quite often actually.

The Binutils example code above fortunately shows exactly how to deal with that: you declare a symbol with no version information. This is called the “base version”, and can only be referenced as the sole version in a linker script, or by omitting the version specifier in a .symver directive. This works with the GNU assembler (as) and with the BFD link editor, but when creating a library with a base-versioned symbols with gold, you get an error:

libtool: link: i686-pc-linux-gnu-gcc -shared  .libs/fuse.o .libs/fuse_kern_chan.o .libs/fuse_loop.o .libs/fuse_loop_mt.o .libs/fuse_lowlevel.o .libs/fuse_mt.o .libs/fuse_opt.o .libs/fuse_session.o .libs/fuse_signals.o .libs/cuse_lowlevel.o .libs/helper.o .libs/subdir.o .libs/iconv.o .libs/mount.o .libs/mount_util.o   -lrt -ldl -Wl,--as-needed  -pthread -Wl,--version-script -Wl,./fuse_versionscript -Wl,-O1 -Wl,--hash-style=gnu   -pthread -Wl,-soname -Wl,libfuse.so.2 -o .libs/libfuse.so.2.8.5
/usr/lib/gcc/i686-pc-linux-gnu/4.6.0/../../../../i686-pc-linux-gnu/bin/ld: error: symbol __fuse_exited has undefined version 

It might not be as quick to be said but this message simply means that gold does not support linking objects containing base-versioned symbols. Is it just a missing feature? Not really. I mean, the feature itself is missing, and indeed is simple to implement, to the point I have implemented it, and you can find the patch for it in Sourceware bug #12261 which is still pending.

The problem here is that even though GNU bfd/ld implements that feature, it is a pointless feature to implement, right now. The problem lies not in the link editor but in the runtime loader (ld.so). As you can see from the testcases provided by Ian in the bug linked above, GLIBC does not do what it is expected to.

What you expect is that, when the loader finds an undefined (requested) symbol, without an attached version information, it would look for a symbol with the corresponding name with base version (no version attached to the definition), and failing that it would look for the one in the default version. What actually happens is that the loader simply picks the first symbol it finds with the same name, without caring about the version if it wasn’t specified in the customer. It is just sheer luck if it finds the one that was intended to be found.

What’s the morale? Well, we have one advertised feature that never worked but that a few projects, such as fuse, wanted to rely upon… I don’t disagree with Ian that this should be fixed in GLIBC first, and that for now gold is just exposing code that doesn’t work. Unfortunately Ian’s requests about the feature went unanswered – and due to Drepper just dumping the list of bug numbers without description in the NEWS files I can’t tell if it was addressed in the new 2.14 version – which means we still have no clue whether this is a functionality that will ever be useful or not. I’ll have to try again if fuse project would agree at just dumping the symbols for now, since they cannot be useful with current glibc versions.

Again, expressing your interest on the topic helps me judge how much weight to put on it outside of my dayjob. Thanks in advance.

Linkers and names

I received a question by mail the other day from Randy, about sonames and in particular their versioning which I already discussed about a couple of times. On the other hand, his concerns are good enough to warrant for another blog post on the matter, hoping not to bore my readers too much with it.

First of all, let me repeat that there are a number of different names that a shared library “owns”, on Unix and Unix-like systems based on ELF:

  • the actual name of the filename;
  • the name the link editor (or build-time linker if you prefer, ld) uses to link to it;
  • the name the loader (or dynamic linker if you prefer, ld.so) uses to find it.

The obvious way to correlate all of them together is to use symbolic links; and that’s why your /usr/lib directory is full of those. But how that happen?

When linking, the link editor (ld) can take a library as parameter in two different ways: either with full path to the library (such as /usr/lib/libcrypto.so.0.9.8) or via a link-library parameter (-lcrypto); in the first case, the linker know exactly which files it has to link against; in the latter, though, it has to go look for it.

The paths where the linker search for the library is worth discussing separately, and all in all the library search path list is not something that is important to the aim of this article. What we do care about is the basename the linker is going to look out for; well, the answer is easy, it applies the same substitution you’d have with sed and s/^-l(.+)$/lib1.so/ — it prepends lib and adds .so to the end of the library name. This is why all the libraries you can link to have a .so suffix-only variant, beside the full name and the version-suffix symlink.

Obviously, when the link-editor adds the dependencies on the libraries to the output, it has to use some kind of name; the full path is not a viable option (you might be linking to a non-installed library that will be installed later), so you can either use the library base name (PE – the format used by Windows – did this, but Microsoft seems to have changed idea when .NET came around), or find some other value to link (no pun intended) it to, which is what we’re concerned with here, since that’s what Unix (and Linux) do.

Each library declares its canonical name (the soname, standing for Shared Object Name); the linker will copy that value from the tag called DT_SONAME to a DT_NEEDED entry tag of the output file. Now, these two tags are really what’s important in this discussion; as you saw, the two are respectively consumed and produced by the link editor. The loader, instead, only cares about DT_NEEDED: it uses that tag to know the filename of the libraries to look for in its library search path (that does not necessary have to be the same as the one for the link editor).

Actually, it’s not all true; there are a number of libraries that don’t declare a soname at all, in some cases that’s correct, for instance plugins don’t need a soname at all since they are never linked against, but a good many times, it’s actually a bug in the build system itself. But that’s a topic for another day.

The loader, for good or bad, does not check that the value in DT_SONAME of the loaded library corresponds to the DT_NEEDED tag it encountered — this is why symlinking libraries with incompatible ABIs seem to work at first: the loader does not fail to load them. This means that it really doesn’t matter what a DT_NEEDED tag points to, as long as, within the library path, there is a file (or symlink) with that name. This lets you do some cool but braindead thinks, which I’d rather not write of myself.

To ensure that the loader is able to find the library, then, there is need to have the symlinks in place that correspond to the name it’s going to look for. Now this can be done by two pieces of software, at two different times. The first is, well, libtool, which takes care of the task both at build-time and install-time, and makes both the link used by the loader, and that used by the link editor to find the library. A number of other build systems imitate that behaviour as well.

The second component that does so is ldconfig, but only on GNU systems; by default it recreates all the symlinks to sonames in the library search path (of the loader); with various options, it processes particular libraries or paths. The result is pretty nifty, but not portable. I discovered this myself a long time ago when working on Gentoo/FreeBSD, as the preplib command from Portage failed on that architecture.

With hindsight, I wish I didn’t ask for preplib to die, but rather rewrite it to use scanelf to do the job rather than ldconfig.

What probably confused Randy, is that many documents out there assumes the libtool/Linux semantic of soname and versioning. I call it libtool/Linux because libtool changes its soname semantic depending on the operating system it’s working on, as again I found out while working on Gentoo/FreeBSD. In Linux, it produces a file with a three-components name, and sets DT_SONAME to just the first component; on FreeBSD (by default) it creates a file with a two-components name, and sets DT_SONAME to the full name.

In Gentoo/FreeBSD, we hack libtool to use the same naming scheme as Linux, since that is what library developers tend to think about when setting the version components. Without this hack, each minor version of glib would have a new soname, requiring all the linked software to be re-built, even if the ABI didn’t change, or changed in a compatible way.

I admit I haven’t even started looking up the reasoning for either the version scheme; they are probably mostly historical; in the case of FreeBSD, having the soname and the full name to match is probably needed to avoid the need for symlinks (as ldconfig does not create them as I said above). In the case of Linux, I’m not really sure why so much complexity. Without digressing too much, the components passed to -version-info don’t match directly the actual components used in either the Linux or FreeBSD semantics; they are both related by way of not-too-simple arithmetic functions.

At any rate, neither the link editor nor the loader have a real concept of “ABI version”, which is why they don’t enforce it at all. So you can have libraries that don’t even update their ABI version at all, or others that use the full version as ABI, and bump it even if there aren’t incompatible changes.

So to answer Randy’s final question, this happens because each project is free to decide what its soname is, and thus its soversion. In the case of OpenSSL, they use a three-components version that is the same of the file name, and of the version of the released package, as can be seen by looking at the following scanelf -S output from the tinderbox :

tinderbox ~ # scanelf -S /usr/lib/libcrypto.so.*
 TYPE   SONAME FILE 
ET_DYN libcrypto.so.0.9.8 /usr/lib/libcrypto.so.0.9.8 
ET_DYN libcrypto.so.1.0.0 /usr/lib/libcrypto.so.1.0.0 

Any document that implies explicitly that the soname has to consist of the library name and its ABI version, is simply wrong. That is a practice that definitely should be followed, but there are no technical restrictions in there, they are only practical restrictions.

Yes, it is a sorry state of the affairs.

Your worst enemy: undefined symbols

What ties in reckless glibc unmasking GTK+ 2.20 issues Ruby 1.9 porting and --as-needed failures all together? Okay the title is a dead giveaway for the answer: undefined symbols.

Before deepening within the topic I first have to tell you about symbols I guess; and to do so, and to continue further, I’ll be using C as the base language for everyone of my notes. When considering C, then, a symbol is any function or data (constant or variable) that is declared extern; that is anything that is neither static or defined in the same translation unit (that is, source file, most of the time).

Now, what nm shows as undefined (U code) is not really what we’re concerned about; for object files (.o, just intermediate) will report undefined symbols for any function or data element used that is not in the same translation unit; most of those get resolved at the time all the object files get linked in to form a final shared object or executable — actually, it’s a lot more complex than this, but since I don’t care about describing here symbolic resolution, please accept it like it was true.

The remaining symbols will be keeping the U code in the shared object or executable, but most of them won’t concern us: they will be loaded from the linked libraries, when the dynamic loader actually resolve them. So for instance, the executable built from the following source code, will have the printf symbol “undefined” (for nm), but it’ll be resolved by the dynamic linker just fine:

int main() {
  printf("Hello, world!");
}

I have explicitly avoided using the fprintf function, mostly because that would require a further undefined symbol, so…

Why do I say that undefined symbols are our worst enemy? Well, the problem is actually with undefined, unresolved symbols after the loader had its way. These are either symbols for functions and data that is not really defined, or is defined in libraries that are not linked in. The former case is what you get with most of the new-version compatibility problems (glibc, gtk, ruby); the latter is what you get with --as-needed.

Now, if you have a bit of practice with development and writing simple commands, you’d be now wondering why is this a kind of problem; if you were to mistype the function above into priltf – a symbol that does not exist, at least in the basic C library – the compiler will refuse to create an executable at all, even if the implicit declaration was only treated as a warning, because the symbol is, well, not defined. But this rule only applies, by default, to final executables, not to shared objects (shared libraries, dynamic libraries, .so, .dll or .dylib files).

For shared objects, you have to explicitly ask to refuse linking them with undefined reference, otherwise they are linked just fine, with no warning, no error, no bothering at all. The way you can tell the linker to refuse that kind of linkage is passing the -Wl,--no-undefined flag; this way if there is even a single symbol that is not defined in the current library or any of its dependencies the linker will refuse to complete the link. Unfortunately, using this by default is not going to work that well.

There are indeed some more or less good reasons to allow shared objects to have undefined symbols, and here come a few:

Multiple ABI-compatible libraries: okay this is a very far-fetched one, simply for the difficulty to have ABI-compatible libraries (it’s difficult enough to have them API-compatible!), but it happens; for instance on FreeBSD you – at least used to – have a few different implementations of the threading libraries, and have more or less the same situation for multiple OpenGL and mathematical libraries; the idea behind this is actually quite simply; if you have libA1 and libA2 providing the symbols, then libB linking to libA1, and libC linking to libA2, an executable foo linking to libB and libC would get both libraries linked together, and creating nasty symbol collisions.

Nowadays, FreeBSD handles this through a libmap.conf file that allows to link always the same library, but then switch at load-time with a different one; a similar approach is taken by things like libgssglue that allows to switch the GSSAPI implementation (which might be either of Kerberos or SPKM) with a configuration file. On Linux, beside this custom implementation, or hacks such as that used by Gentoo (eselect opengl) to handle the switch between different OpenGL implementations, there seem to be no interest in tackling the problem at the root. Indeed, I complained about that when --as-needed was softened to allow this situation although I guess it at least removed one common complain about adopting the option by default.

Plug-ins hosted by a standard executable: plug-ins are, generally speaking, shared objects; and with the exception of the most trivial plugins, whose output is only defined in terms of their input, they use functions that are provided by the software they plug. When they are hosted (loaded and used from) by a library, such as libxine, they are linked back to the library itself, and that makes sure that the symbols are known at the time of creating the plugin object. On the other hand, when the plug-ins are hosted by some software that is not a shared object (which is the case of, say, zsh), then you have no way to link them back, and the linker has no way to discern between undefined symbols that will be lifted from the host program, and those that are bad, and simply undefined.

Plug-ins providing symbols for other plug-ins : here you have a perfect example in the Ruby-GTK2 bindings; when I first introduced --no-undefined in the Gentoo packaging of Ruby (1.9 initially, nowadays all the three C-based implementations have the same flag passed on), we got reports of non-Portage users of Ruby-GTK2 having build failures. The reason? Since all the GObject-derived interfaces had to share the same tables and lists, the solution they chose was to export an interface, unrelated to the Ruby-extension interface (which is actually composed of a single function, bless them!), that the other extensions use; since you cannot reliably link modules one with the other, they don’t link to them and you get the usual problem of not distinguish between expected and unexpected undefined symbols.

Note: this particular case is not tremendously common; when loading plug-ins with dlopen() the default is to use the RTLD_LOCAL option, which means that the symbols are only available to the branch of libraries loaded together with that library or with explicit calls to dlsym(); this is a good thing because it reduces the chances of symbol collisions, and unexpected linking consequences. On the other hand, Ruby itself seems to go all the way against the common idea of safety: they require RTLD_GLOBAL (register all symbols in the global procedure linking table, so that they are available to be loaded at any point in the whole tree), and also require RTLD_LAZY, which makes it more troublesome if there are missing symbols — I’ll get later to what lazy bindings are.

Finally, the last case I can think of where there is at least some sense into all of this trouble, is reciprocating libraries, such as those in PulseAudio. In this situation, you have two libraries, each using symbols from one another. Since you need the other to fully link the one, but you need the one to link the other, you cannot exit the deadlock with --no-undefined turned on. This, and the executable-plugins-host, are the only two reasons that I find valid for not using --no-undefined by default — but unfortunately are not the only two used.

So, what about that lazy stuff? Well, the dynamic loader has to perform a “binding” of the undefined symbols to their definition; binding can happen in two modes, mainly: immediate (“now”) or lazy, the latter being the default. With lazy bindings, the loader will not try to find the definition to bind to the symbol until it’s actually needed (so until the function is called, or the data is fetched or written to); with immediate bindings, the loader will iterate over all the undefined symbols of an object when it is loaded (eventually loading up the dependencies). As you might guess, if there are undefined, unresolved symbol, the two binding types have very different behaviours. An immediately-loaded executable will fail to start, and a loaded library would fail dlopen(); a lazily-loaded executable will start up fine, and abort as soon as a symbol is hit that cannot be resolved; and a library would simply make its host program abort at the same way. Guess what’s safer?

With all these catches and issues, you can see why undefined symbols are a particularly nasty situation to deal with. To the best of my knowledge, there isn’t a real way to post-mortem an object to make sure that all its symbols are defined. I started writing support for that in Ruby-Elf but the results weren’t really… good. Lacking that, I’m not sure how we can proceed.

It would be possible to simply change the default to be --no-undefined, and work around with --undefined the few that require the undefined symbols to be there (we decided to proceed that way with Ruby); but given the kind of support I’ve received before in my drastic decisions, I don’t expect enough people to help me tackle that anytime soon — and I don’t have the material time to work on that, as you might guess.

The why and how of RPATH

This post is brought to you by a conversation with Fabio, which actually reminded me of an older conversation I had with someone else (exactly whom, right now, escapes me) about the ability to inject RPATH values into already-built binaries. I’m sorry to have forgotten who asked me about that, I hope he or she won’t take it bad.

But before I go to the core of the problem, let me try to give a brief introduction of what we’re talking about here, because jumping straight to talk about injection is going to be too much for most of my readers, I’m sure. Even though, this whole topic reconnects with multiple other smaller topics I discussed about in the past on my blog, so you’ll see a few links here and there for that.

First of all, what the heck is RPATH? When using dynamic linking (shared libraries), the operating system need to know where to look for the libraries an executable uses; to do so, each operating system has one or more PATH variables, plus eventual configuration files, that are used to look up the libraries; on Windows, for instance, the same PATH variable that is used to find the commands is used to load the libraries; and the libraries are looked for in the same directory where the executable is first of all. On Unix, the commands and the libraries use distinct paths, by design, and the executable’s directory is not searched for; this is also because the two directories are quite distinct (/usr/bin vs /usr/lib as an example). The GNU/Linux loader (and here the name is proper, as the loader comes out of GLIBC — almost identical behaviour is expected by uClibc, FreeBSD and other OSes but I know that much for sure) differs extensively from the Sun loader; I say this because I’ll introduce the Sun system later.

In your average GNU/Linux system, including Gentoo, the paths where to look up the libraries in are defined in the /etc/ld.so.conf file; prepended to that list, the LD_LIBRARY_PATH variable is used. (There is a second variable, LDPATH LIBRARY_PATH, that tells the gcc frotnend to the linker where to look for libraries to link to at build time, rather than to load — update: thanks to Fabio who pointed me I got the wrong variable; LDPATH is used by the env-update script to set the proper search paths in the ld.so.conf file ). All the executables will look in all these paths for both the libraries they link to directly and for non-absolute dlopen() calls. But what happens with private libraries — libraries that are shared only among a small number of executables coming from the same package?

The obvious choice is to install them normally in the same path as the general libraries; this does not require playing with the search paths at all, but it causes two problems: the build-time linker will still find them during link time, and it might not be what you want and it increases the number of files present in the single directory (which means accessing its content slows down, little by little). The common alternative approach is installing it in a sub-directory that is specific to the package (automake already provides a pkglib installation class for this type of libraries). I already discussed and advocated this solution so that internal libraries are not “shown” to the rest of the software on the system.

Of course, adding the internal library directories to the global search path also means slowing down the libraries’ load, as more directories are being searched when looking for the libraries. To solve this issue, runpath first and rpath now is used. The DT_RPATH is a .dynamic attribute that provides further search paths for the object it is listed in (it can be used both for executables and for shared objects); the list is inserted in-between the LD_LIBRARY_PATH environment variable and the search libraries from the /etc/ld.so.conf file. In that paths, there are two special cases: $ORIGIN is expanded to be the path of the directory where the object is found, while both empty and . values in the list are meant to refer to the current work directory (so-called insecure rpath).

Now, while rpath is not a panacea and also slightly slow down the load of the executable, it should have decent effects, especially by not requiring further symlinks to switch among almost-equivalent libraries that don’t have ABIs that stable. They also get very useful when you want to build a software you’re developing so that it loads your special libraries rather than the system one, without relying on wrapper scripts like libtool does.

To create an RPATH entry, you simply tell it to the linker (not the compiler!), so for instance you could add to your ./configure call the string LDFLAGS=-Wl,-rpath,/my/software/base/my/lib/.libs to build and run against a specific version of your library. But what about an already-linked binary? The idea of using RPATHs make it also for a nicer handling of binary packages and their dependencies, so there is an obvious advantage in having the RPATH editable after the final link took place… unfortunately this isn’t as easy. While there is a tool called chrpath that allows you to change an already-present RPATH, and especially to delete one (it comes handy to resolve insecure rpath problems), it has two limitations: the new RPATH cannot be longer than the previous one, and you cannot add a new one from scratch.

The reason is that the .dynamic entries are fixed-sized; you can remove an RPATH by setting its type to NULL, so that the dynamic loader can skip over it; you can edit an already-present RPATH by changing the string it points to in the string table, but you cannot extend neither .dynamic nor the string table itself. This reduces the usefulness of RPATH for binary package to almost nothing. Is there anything we can do to improve on this? Well, yes.

Ali Bahrami at Sun already solved this problem in June 2007, which means just over three years ago. They implemented it with a very simple trick that could be implemented totally in the build-time linker, without having to change even just a bit of the dynamic loader: they added padding!

The only new definition they had to add was DT_SUNW_STRPAD, a new entry in the .dynamic table that gives you the size of the padding space at the end of the string table; together with that, they added a number of extra DT_NULL entries in the same .dynamic table. Since all the entries in .dynamic are fixed sized, a DT_NULL can become a DT_RPATH without a problem. Even if some broken software might expect the DT_NULL at the end of the table (which would be wrong anyway), you just need to keep them at the end. All the ELF software should ignore the .dynamic entries they don’t understand, as long as they are in the reserved specific ranges, at least.

Unfortunately, as far as I know, there is no implementation of this idea in the GNU toolchain (GNU binutils is where ld is). It shouldn’t be hard to implement, as I said; it’s just a matter of emitting a defined amount of zeros at the end of a string table, and add a new .dynamic tag with its size… the same rpath command from OpenSolaris would probably be usable on Linux after that. I considered porting this myself, but I have no direct need for it; if you develop proprietary software for Linux, or need this feature for deployment, though, you can contact me for a quote on implementing it. Or you can do it yourself or find someone else.