What ties in reckless glibc unmasking GTK+ 2.20 issues Ruby 1.9 porting and --as-needed
failures all together? Okay the title is a dead giveaway for the answer: undefined symbols.
Before deepening within the topic I first have to tell you about symbols I guess; and to do so, and to continue further, I’ll be using C as the base language for everyone of my notes. When considering C, then, a symbol is any function or data (constant or variable) that is declared extern; that is anything that is neither static or defined in the same translation unit (that is, source file, most of the time).
Now, what nm
shows as undefined (U
code) is not really what we’re concerned about; for object files (.o
, just intermediate) will report undefined symbols for any function or data element used that is not in the same translation unit; most of those get resolved at the time all the object files get linked in to form a final shared object or executable — actually, it’s a lot more complex than this, but since I don’t care about describing here symbolic resolution, please accept it like it was true.
The remaining symbols will be keeping the U
code in the shared object or executable, but most of them won’t concern us: they will be loaded from the linked libraries, when the dynamic loader actually resolve them. So for instance, the executable built from the following source code, will have the printf
symbol “undefined” (for nm
), but it’ll be resolved by the dynamic linker just fine:
int main() {
printf("Hello, world!");
}
I have explicitly avoided using the fprintf
function, mostly because that would require a further undefined symbol, so…
Why do I say that undefined symbols are our worst enemy? Well, the problem is actually with undefined, unresolved symbols after the loader had its way. These are either symbols for functions and data that is not really defined, or is defined in libraries that are not linked in. The former case is what you get with most of the new-version compatibility problems (glibc, gtk, ruby); the latter is what you get with --as-needed
.
Now, if you have a bit of practice with development and writing simple commands, you’d be now wondering why is this a kind of problem; if you were to mistype the function above into priltf
– a symbol that does not exist, at least in the basic C library – the compiler will refuse to create an executable at all, even if the implicit declaration was only treated as a warning, because the symbol is, well, not defined. But this rule only applies, by default, to final executables, not to shared objects (shared libraries, dynamic libraries, .so
, .dll
or .dylib
files).
For shared objects, you have to explicitly ask to refuse linking them with undefined reference, otherwise they are linked just fine, with no warning, no error, no bothering at all. The way you can tell the linker to refuse that kind of linkage is passing the -Wl,--no-undefined
flag; this way if there is even a single symbol that is not defined in the current library or any of its dependencies the linker will refuse to complete the link. Unfortunately, using this by default is not going to work that well.
There are indeed some more or less good reasons to allow shared objects to have undefined symbols, and here come a few:
Multiple ABI-compatible libraries: okay this is a very far-fetched one, simply for the difficulty to have ABI-compatible libraries (it’s difficult enough to have them API-compatible!), but it happens; for instance on FreeBSD you – at least used to – have a few different implementations of the threading libraries, and have more or less the same situation for multiple OpenGL and mathematical libraries; the idea behind this is actually quite simply; if you have libA1 and libA2 providing the symbols, then libB linking to libA1, and libC linking to libA2, an executable foo
linking to libB and libC would get both libraries linked together, and creating nasty symbol collisions.
Nowadays, FreeBSD handles this through a libmap.conf
file that allows to link always the same library, but then switch at load-time with a different one; a similar approach is taken by things like libgssglue
that allows to switch the GSSAPI implementation (which might be either of Kerberos or SPKM) with a configuration file. On Linux, beside this custom implementation, or hacks such as that used by Gentoo (eselect opengl
) to handle the switch between different OpenGL implementations, there seem to be no interest in tackling the problem at the root. Indeed, I complained about that when --as-needed
was softened to allow this situation although I guess it at least removed one common complain about adopting the option by default.
Plug-ins hosted by a standard executable: plug-ins are, generally speaking, shared objects; and with the exception of the most trivial plugins, whose output is only defined in terms of their input, they use functions that are provided by the software they plug. When they are hosted (loaded and used from) by a library, such as libxine
, they are linked back to the library itself, and that makes sure that the symbols are known at the time of creating the plugin object. On the other hand, when the plug-ins are hosted by some software that is not a shared object (which is the case of, say, zsh
), then you have no way to link them back, and the linker has no way to discern between undefined symbols that will be lifted from the host program, and those that are bad, and simply undefined.
Plug-ins providing symbols for other plug-ins : here you have a perfect example in the Ruby-GTK2 bindings; when I first introduced --no-undefined
in the Gentoo packaging of Ruby (1.9 initially, nowadays all the three C-based implementations have the same flag passed on), we got reports of non-Portage users of Ruby-GTK2 having build failures. The reason? Since all the GObject-derived interfaces had to share the same tables and lists, the solution they chose was to export an interface, unrelated to the Ruby-extension interface (which is actually composed of a single function, bless them!), that the other extensions use; since you cannot reliably link modules one with the other, they don’t link to them and you get the usual problem of not distinguish between expected and unexpected undefined symbols.
Note: this particular case is not tremendously common; when loading plug-ins with dlopen()
the default is to use the RTLD_LOCAL
option, which means that the symbols are only available to the branch of libraries loaded together with that library or with explicit calls to dlsym()
; this is a good thing because it reduces the chances of symbol collisions, and unexpected linking consequences. On the other hand, Ruby itself seems to go all the way against the common idea of safety: they require RTLD_GLOBAL
(register all symbols in the global procedure linking table, so that they are available to be loaded at any point in the whole tree), and also require RTLD_LAZY
, which makes it more troublesome if there are missing symbols — I’ll get later to what lazy bindings are.
Finally, the last case I can think of where there is at least some sense into all of this trouble, is reciprocating libraries, such as those in PulseAudio. In this situation, you have two libraries, each using symbols from one another. Since you need the other to fully link the one, but you need the one to link the other, you cannot exit the deadlock with --no-undefined
turned on. This, and the executable-plugins-host, are the only two reasons that I find valid for not using --no-undefined
by default — but unfortunately are not the only two used.
So, what about that lazy stuff? Well, the dynamic loader has to perform a “binding” of the undefined symbols to their definition; binding can happen in two modes, mainly: immediate (“now”) or lazy, the latter being the default. With lazy bindings, the loader will not try to find the definition to bind to the symbol until it’s actually needed (so until the function is called, or the data is fetched or written to); with immediate bindings, the loader will iterate over all the undefined symbols of an object when it is loaded (eventually loading up the dependencies). As you might guess, if there are undefined, unresolved symbol, the two binding types have very different behaviours. An immediately-loaded executable will fail to start, and a loaded library would fail dlopen()
; a lazily-loaded executable will start up fine, and abort as soon as a symbol is hit that cannot be resolved; and a library would simply make its host program abort at the same way. Guess what’s safer?
With all these catches and issues, you can see why undefined symbols are a particularly nasty situation to deal with. To the best of my knowledge, there isn’t a real way to post-mortem an object to make sure that all its symbols are defined. I started writing support for that in Ruby-Elf but the results weren’t really… good. Lacking that, I’m not sure how we can proceed.
It would be possible to simply change the default to be --no-undefined
, and work around with --undefined
the few that require the undefined symbols to be there (we decided to proceed that way with Ruby); but given the kind of support I’ve received before in my drastic decisions, I don’t expect enough people to help me tackle that anytime soon — and I don’t have the material time to work on that, as you might guess.
Look how the GOT and the PLT works.
cyp@cyp ~ $ ./helloHello world!cyp@cyp ~ $Please add a “n” to the post.
Don’t miss the point of the whole story. > Cyp 🙂
get a better shell > Cyp
Thank you very much for this, I found it really usefull, was stuck with a software i’ve inherited, that compiled succesfully, but threw a bunch of linker errors while running.Using the –no-undefined flag helped me debug the error.