Symbolism and ELF files (or, what does -Bsymbolic do?)

Julian asked by mail – uh.. last month! okay I have a long backlog – what the -Bsymbolic and -Bsymbolic-functions options to the GNU linker do. The answer is not extremely complicated but it calls for some explanation on how function calls in Unix work. I say in Unix, because there are a few differences in how OS X and Windows behave if I recall correctly, and I’m definitely not an expert on those. I wish I was, then I would be able to work on a book to replace Levine’s Linkers and Loaders.

PLT and GOT tables diagram

Please don’t complain about the very bad drawing above, it’s just going to illustrate what’s going on, I did it on my iPad with a capacitive stylus. I’ll probably try to do a few more of these, since I don’t have my usual Intuos tablet, and I won’t have it until I’ll find my place in London.

You see, the whole issue of linking in Unix is implemented with a long list of tables: the symbol table, the procedure linking table (PLT) and the global offset table (GOT). All objects involved in a dynamic linking chain (executables and shared objects, ET_EXEC and ET_DYN) posses a symbol table, which mixes defined (exported) and undefined (requested) symbols. Objects that are exporting symbols (ET_DYN and ET_EXEC with plugins, callbacks or simply badly designed) posses a PLT, and PIC objects (most ET_DYN, with the exception of some x86 prebuilt objects, and PIE ET_EXEC) posses GOTs.

The GOT and the text section

Let’s start from the bottom, that is the GOT, or actually before the GOT itself to the executable code itself. For what ELF is concerned, by default (there are a number of options that change this but I don’t want to go there for now), data and functions sections are completely opaque. Access to functions and data has to be done through start addresses; for non-PIC objects, these are absolute addresses, as the objects are assumed to be loaded always at the same position; when using position-independent code (PIC), like the name hints, this position has to be ignored, so the data or function position has to be derived by using offsets from the load address for the object — when using non-static load addresses, and non-PIC objects, you actually have to patch the code to use the new full address, which is what causes a text relocation (TEXTREL), which requires the ability to write to the executable segments, which is an obvious security issue.

So here’s where the global offset table enters the scene. Whenever you’re referring to a particular data or function, you have an offset from the start of the containing section. This makes it possible for that section to be loaded at different addresses, and yet keep the code untouched. (Do note I’m actually simplifying a lot the whole thing, but I don’t want to go too much into details because half the people reading wouldn’t understand what I’m saying otherwise.)

But the GOT is only used when the data or function is guaranteed to be in the same object. If you’re not using any special option to either compiler or linker, this means only static symbols are addressed directly in the GOT. Everything else is accessed through the object’s PLT, in which all the functions that the object calls are added. This PLT has then code to ask the dynamic loader what address the given symbol is defined at.

Global and local symbol tables

To answer that question, the dynamic loader had to have a global table which resolve symbol names to addresses. This is basically a global PLT from a point of view. Depending on some settings in the objects, in the environment or in the loader itself, this table can be populated right when the application is being executed, or only when the symbols are requested. For simplicity, I’ll assume that what’s happening is the former, as otherwise we’ll end up in details that have nothing to do with what we were discussing to begin with. Furthermore there is a different complication added by the modern GNU loader, which introduced the indirect binding.. it’s a huge headache.

While the same symbol name might have multiple entries in the various objects’ symbol tables, because more than one object is exporting the same symbol, in the resolved table each symbol name has exactly one address, which is found by reading the objects’ GOTs. This means that the loader has to solve in some way the collisions that happen when multiple objects export the same symbol. And also, it means that there is no guarantee by default that an object that both exports and calls a given symbol is going to call its own copy.

Let me try to underline that point: symbols that are exported are added to the symbol table of an object; symbols that are called are added to the symbol table as undefined (if they are not there already) and they are added to the procedure linking table (which then finds the position via its own offset table). By default, with no special options, as I said, only static functions are called directly from the object’s global offset table, everything else is called through the PLT, and thus through the linker’s table of resolved symbols. This is actually what drives symbol interposing (which is used by LD_PRELOAD libraries), and what caused ancient xine’s problems which steered me to look into ELF itself.

Okay, I’m almost at -Bsymbolic!

As my post about xine shows, there are situations where going through the PLT is not the desired behaviour, as you want to ensure that an object calls its own copy of any given symbol that is defined within itself. You can do that in many ways; the simplest possible of options, is not to expose those symbols at all. As I said with default options, only static functions are called straight through the GOT, but this can be easily extended to functions that are not exposed, which can be done either by marking the symbols as hidden (happens at compile time), or by using a linker script to only expose a limited set of symbols (happens at link time).

This is logical: the moment when the symbols are no longer exported by the object, the dynamic loader has no way to answer for the PLT, which means the only option you have is to use the GOT directly.

But sometimes you have to expose the symbols, and at the same time you want to make sure that you call your own copy and not any other interposed copy of those symbols. How do you do that? That’s where -Bsymbolic and -Bsymbolic-functions options come into play. What they do is duplicate the GOT entries for the symbols that are both called and defined in a shared object: the loader points to one, but the object itself points to the other. This way, it’ll always call its own copy. An almost identical solution is applied, just at compile-time rather than link-time, when you use protected visibility (instead of default or hidden).

Unfortunately this make a small change in the semantics we’re used to: since the way the symbols are calculated varies depending on whether you’re referring to the symbol from within or outside the object, pointers taken without and outside will no longer match. While for most libraries this is not a problem, there are some cases where it really is. For instance in xine we hit a problem with the special memcpy() function implementation: it was a weak symbol, so simply a pointer to the actual function, which was being set within the libxine object… but would have been left unset for the external objects, including the plugins, for which it would still have been a NULL.

Comparing function symbols is a rare corner case, but comparing objects’ addresses is common enough, if you’re trying to see if a default, global object is being passed to your function instead of a custom one… in that case, having the addresses no longer matching is a big pain. Which is basically why you have -Bsymbolic-functions — it’s exactly like -Bsymbolic but limits itself to the functions, whereas the objects are still handled like if no option was passed. It’s a compromise to make it easier to not go through the PLT for everything, while not breaking so much code (it would then only break on corner cases like xine’s memcpy()).

By the way, if it’s not obvious, the use of symbolic resolution is not only about making sure that the objects know which function they are calling, it’s also a performance improvement, as it avoids a virtual round-trip to the dynamic loader, and a lookup of where the symbol is actually defined. This is minor for most functions, but it can be visible if there are many many functions that are being called. Of course it shouldn’t amke much of a difference if the loader is good enough, but that’s also a completely different story. As is the option for the compiler to emit two copies of a given function, to avoid doing the full preamble when called within the object. And again for what concerns link-time optimization, which is connected to, but indirectly, with what I’ve discussed above.

Oh and if it wasn’t clear from what I wrote you should not ever use the -Bsymbolic flag in your LDFLAGS variable in Gentoo. It’s not a flag you should mock with, only upstream developers should care about it.

Your worst enemy: undefined symbols

What ties in reckless glibc unmasking GTK+ 2.20 issues Ruby 1.9 porting and --as-needed failures all together? Okay the title is a dead giveaway for the answer: undefined symbols.

Before deepening within the topic I first have to tell you about symbols I guess; and to do so, and to continue further, I’ll be using C as the base language for everyone of my notes. When considering C, then, a symbol is any function or data (constant or variable) that is declared extern; that is anything that is neither static or defined in the same translation unit (that is, source file, most of the time).

Now, what nm shows as undefined (U code) is not really what we’re concerned about; for object files (.o, just intermediate) will report undefined symbols for any function or data element used that is not in the same translation unit; most of those get resolved at the time all the object files get linked in to form a final shared object or executable — actually, it’s a lot more complex than this, but since I don’t care about describing here symbolic resolution, please accept it like it was true.

The remaining symbols will be keeping the U code in the shared object or executable, but most of them won’t concern us: they will be loaded from the linked libraries, when the dynamic loader actually resolve them. So for instance, the executable built from the following source code, will have the printf symbol “undefined” (for nm), but it’ll be resolved by the dynamic linker just fine:

int main() {
  printf("Hello, world!");
}

I have explicitly avoided using the fprintf function, mostly because that would require a further undefined symbol, so…

Why do I say that undefined symbols are our worst enemy? Well, the problem is actually with undefined, unresolved symbols after the loader had its way. These are either symbols for functions and data that is not really defined, or is defined in libraries that are not linked in. The former case is what you get with most of the new-version compatibility problems (glibc, gtk, ruby); the latter is what you get with --as-needed.

Now, if you have a bit of practice with development and writing simple commands, you’d be now wondering why is this a kind of problem; if you were to mistype the function above into priltf – a symbol that does not exist, at least in the basic C library – the compiler will refuse to create an executable at all, even if the implicit declaration was only treated as a warning, because the symbol is, well, not defined. But this rule only applies, by default, to final executables, not to shared objects (shared libraries, dynamic libraries, .so, .dll or .dylib files).

For shared objects, you have to explicitly ask to refuse linking them with undefined reference, otherwise they are linked just fine, with no warning, no error, no bothering at all. The way you can tell the linker to refuse that kind of linkage is passing the -Wl,--no-undefined flag; this way if there is even a single symbol that is not defined in the current library or any of its dependencies the linker will refuse to complete the link. Unfortunately, using this by default is not going to work that well.

There are indeed some more or less good reasons to allow shared objects to have undefined symbols, and here come a few:

Multiple ABI-compatible libraries: okay this is a very far-fetched one, simply for the difficulty to have ABI-compatible libraries (it’s difficult enough to have them API-compatible!), but it happens; for instance on FreeBSD you – at least used to – have a few different implementations of the threading libraries, and have more or less the same situation for multiple OpenGL and mathematical libraries; the idea behind this is actually quite simply; if you have libA1 and libA2 providing the symbols, then libB linking to libA1, and libC linking to libA2, an executable foo linking to libB and libC would get both libraries linked together, and creating nasty symbol collisions.

Nowadays, FreeBSD handles this through a libmap.conf file that allows to link always the same library, but then switch at load-time with a different one; a similar approach is taken by things like libgssglue that allows to switch the GSSAPI implementation (which might be either of Kerberos or SPKM) with a configuration file. On Linux, beside this custom implementation, or hacks such as that used by Gentoo (eselect opengl) to handle the switch between different OpenGL implementations, there seem to be no interest in tackling the problem at the root. Indeed, I complained about that when --as-needed was softened to allow this situation although I guess it at least removed one common complain about adopting the option by default.

Plug-ins hosted by a standard executable: plug-ins are, generally speaking, shared objects; and with the exception of the most trivial plugins, whose output is only defined in terms of their input, they use functions that are provided by the software they plug. When they are hosted (loaded and used from) by a library, such as libxine, they are linked back to the library itself, and that makes sure that the symbols are known at the time of creating the plugin object. On the other hand, when the plug-ins are hosted by some software that is not a shared object (which is the case of, say, zsh), then you have no way to link them back, and the linker has no way to discern between undefined symbols that will be lifted from the host program, and those that are bad, and simply undefined.

Plug-ins providing symbols for other plug-ins : here you have a perfect example in the Ruby-GTK2 bindings; when I first introduced --no-undefined in the Gentoo packaging of Ruby (1.9 initially, nowadays all the three C-based implementations have the same flag passed on), we got reports of non-Portage users of Ruby-GTK2 having build failures. The reason? Since all the GObject-derived interfaces had to share the same tables and lists, the solution they chose was to export an interface, unrelated to the Ruby-extension interface (which is actually composed of a single function, bless them!), that the other extensions use; since you cannot reliably link modules one with the other, they don’t link to them and you get the usual problem of not distinguish between expected and unexpected undefined symbols.

Note: this particular case is not tremendously common; when loading plug-ins with dlopen() the default is to use the RTLD_LOCAL option, which means that the symbols are only available to the branch of libraries loaded together with that library or with explicit calls to dlsym(); this is a good thing because it reduces the chances of symbol collisions, and unexpected linking consequences. On the other hand, Ruby itself seems to go all the way against the common idea of safety: they require RTLD_GLOBAL (register all symbols in the global procedure linking table, so that they are available to be loaded at any point in the whole tree), and also require RTLD_LAZY, which makes it more troublesome if there are missing symbols — I’ll get later to what lazy bindings are.

Finally, the last case I can think of where there is at least some sense into all of this trouble, is reciprocating libraries, such as those in PulseAudio. In this situation, you have two libraries, each using symbols from one another. Since you need the other to fully link the one, but you need the one to link the other, you cannot exit the deadlock with --no-undefined turned on. This, and the executable-plugins-host, are the only two reasons that I find valid for not using --no-undefined by default — but unfortunately are not the only two used.

So, what about that lazy stuff? Well, the dynamic loader has to perform a “binding” of the undefined symbols to their definition; binding can happen in two modes, mainly: immediate (“now”) or lazy, the latter being the default. With lazy bindings, the loader will not try to find the definition to bind to the symbol until it’s actually needed (so until the function is called, or the data is fetched or written to); with immediate bindings, the loader will iterate over all the undefined symbols of an object when it is loaded (eventually loading up the dependencies). As you might guess, if there are undefined, unresolved symbol, the two binding types have very different behaviours. An immediately-loaded executable will fail to start, and a loaded library would fail dlopen(); a lazily-loaded executable will start up fine, and abort as soon as a symbol is hit that cannot be resolved; and a library would simply make its host program abort at the same way. Guess what’s safer?

With all these catches and issues, you can see why undefined symbols are a particularly nasty situation to deal with. To the best of my knowledge, there isn’t a real way to post-mortem an object to make sure that all its symbols are defined. I started writing support for that in Ruby-Elf but the results weren’t really… good. Lacking that, I’m not sure how we can proceed.

It would be possible to simply change the default to be --no-undefined, and work around with --undefined the few that require the undefined symbols to be there (we decided to proceed that way with Ruby); but given the kind of support I’ve received before in my drastic decisions, I don’t expect enough people to help me tackle that anytime soon — and I don’t have the material time to work on that, as you might guess.

Ruby-Elf and Sun extensions

I’ve written in my post about OpenSolaris that I’m interested in extending Ruby-Elf to parse and access Sun-specific extensions, that is the .SUNW_* sections of ELF files produced under OpenSolaris. Up to now I only knew the format, and not even that properly, of the .SUNW_cap section, that contains hardware and software capabilities for an object file or an executable, but I wasn’t sure how to interpret that.

Thanks to Roman, who sent me the link to the Sun Linker and Libraries Guide (I did know about it but I lost the link to it quite a long time ago and then I forgot it existed), now I know some more things about Sun-specific sections, and I’ve already started implementing support for those in Ruby-Elf (unfortunately I’m still looking for a way to properly test for them, in particular I’m not yet sure how I can check for the various hardware-specific extensions — also I have no idea how to test the Sparc-specific data since my Ultra5 runs FreeBSD, not Solaris). Right at the moment I write this, Ruby-Elf can properly parse the capabilities section with its flags, and report them back. Hopefully, with no mistakes, since only basic support is in the regression test for now.

One thing I really want to implement in Ruby-Elf is versioning support, with the same API I’m currently using for GNU-style symbol versioning. This way it’ll be possible for ruby-elf based tools to access both GNU and Sun versioning information as it was a single thing. Too bad I haven’t looked up yet how to generate ELF files with Sun-style versioning support. Oh well, it’ll be one more thing I’ll have to learn. Together with a way to set visibility with Sun Studio, to test the extended visibility support they have in their ELF extended format.

In general, I think that my decision of going with Ruby for this is very positive, mostly because it makes it much easier to support new stuff by just writing an extra class and hook it up, without needing “major surgery” every time. It’s easy and quick to implement new stuff and new functions, even if the tools will require more time and more power to access the data (but with the recent changes I did to properly support OS-specific sections, I think Ruby-Elf is now much faster than it was before, and uses much less memory, as only the sections actually used are loaded). Maybe one day once I can consider this good enough I’ll try to port it to some compiled language, using the Ruby version as a flow scheme, but I don’t think it’s worth the hassle.

Anyway, if you’re interested in improving Ruby-Elf and would like to see it improve even further, so that it can report further optimisations and similar things (like for instance something I planned from the start: telling which shared objects for which there’s a NEEDED line are useless, without having to load the file trough ld.so to use the LD_* variables), I can ask you one thing and one thing only: a copy of Linkers and Loaders that I can consult. I tried preparing a copy out of the original freely available HTML files for the Reader but it was quite nasty to see, nastier than O’Reilly freely-available eBooks (which are bad already). It’s in my wishlist if you want.

About gold and speed

Okay yet another post writing about the gold linker. I finally checked out what I wanted: indeed gold does not do the collapse of duplicated substring: my script reports this:

flame@enterprise ruby-elf % ruby -Ilib tools/assess_duplicate_save.rb /usr/lib/libbonobo-2.so
/usr/lib/libbonobo-2.so: current size 36369, full size 44794 difference 8425
flame@enterprise ruby-elf % ruby -Ilib tools/assess_duplicate_save.rb /usr/lib/libbonobo-2.so
/usr/lib/libbonobo-2.so: current size 44860, full size 44860 difference 0

I wonder about the size difference between the original and the new one, I actually didn’t expect a bigger size than the one my script reported. I suppose it could be that my scripts tends to count full duplicated strings just once. After all, finding full duplicates is a lot easier than substrings, and quite faster. Of course, it’s also true that you cannot have duplicates in the symbols’ names, you can only have duplicates in sonames, NEEDED entries and runpaths.

But for anybody who would like to try gold after Bernard’s post, pay attention! The binaries it generates seems to be prone to TEXTRELs, which is quite bad.

 * QA Notice: The following files contain runtime text relocations
 *  Text relocations force the dynamic linker to perform extra
 *  work at startup, waste system resources, and may pose a security
 *  risk.  On some architectures, the code may not even function
 *  properly, if at all.
 *  For more information, see http://hardened.gentoo.org/pic-fix-guide.xml
 *  Please include this file in your report:
 *  /var/tmp/portage/gnome-base/libbonobo-2.22.0/temp/scanelf-textrel.log
 * TEXTREL usr/bin/activation-client
TEXTREL usr/bin/echo-client-2
TEXTREL usr/bin/bonobo-activation-run-query
TEXTREL usr/libexec/bonobo-activation-server
TEXTREL usr/sbin/bonobo-activation-sysconf
TEXTREL usr/lib64/bonobo-2.0/samples/bonobo-echo-2

As the warning tells you, the textrel will make the startup of programs slower, it will make the program waste memory, and it will not work properly with Hardened kernels.

Also, now that I look at the strings that are in the .dynstr of the gold-linked libbonobo-2 and not in the old version, it seems like gold introduces two extra NEEDED entries (libgcc_s.so.1 and ld-linux, the runtime linker), which I’m unsure where they come from, nor why they should be in there. There is also an extra end symbol that is not generated by standard ld. I don’t know why “.libs/libbonobo-2.so.0.0.0” is saved, I can’t see it referenced in readelf -d output, I’ll ahve to dig up if it’s actually used at all.

As you can guess, gold is really not ready for primetime at all. Also, I start to wonder how much of gold improved performance is actually related to collapsing the duplicate sub-strings.

Sincerely, I start to think that a lot of that performance gain is due to the sub-string collapsing, I tried to think of it in multiple ways, and that particular feature is overly complex for modern systems. And the fact that gold has its biggest performance improvement with C++, where that feature is always useless (you don’t have a common ending sub-string on C++ symbols either prefixing or suffixing them), makes me think that a lot of the performance improvement is laying there, rather than on the other re-writing efforts (C++ usage and ELF-only linker).

I think I’ll see to cut out some time to hack at binutils and see if I can disable the collapse through an option, which would be quite nice.

Some words about global variables

While almost all courses of practical software engineer will tell you to avoid entirely the use of global variables in your software, sometimes there are reasons to use them. It usually applies only to programs, as you can easily assume that you don’t risk re-entrancy problems; libraries on the other hand should really try to avoid using global variables, especially global state variables, for their own nature.

I’ll let you guess which software makes an happy use of global variables.. yeah you guessed right: xine-ui. Now it’s actually comprehensible, as most of that stuff has no re-entrancy requirement, and just passing around a context structure would make the code messier. There are, though, quite a few notes on the topic, and this is why I decided to write something about it.

There are at least three ways to store the global variables:

  • declaring and defining them one by one, so that they are all a bunch of different symbols at linker level;
  • declaring them inside a structure (optionally anonymous) and then define a single global pointer to that structure to be shared;
  • again declaring them inside a structure (again optionally anonymous) and then define a single instance of that structure to be shared.

Up to Wednesday, xine-ui only did a mix of the last two cases, and I was the one introducing the last case, for your information.

There is no “one size fit all” solution, as it’s almost obvious with software design problems, so most likely a properly designed software will use a mix of these three cases. In xine-ui I introduced the first case Wednesday early morning, but let me first describe the various pros and cons.

Starting from the first, the problem with it is easy to guess even after just reading the point: they are a bunch of different symbols. If you don’t properly hide your symbols, each symbol is accessed through the PLT (Procedure Linking Table), and this is an expensive operation; also the symbols get exported and thus can be interposed, and if you’re using PIE, they also have to pass through the GOT (Global Offset Table). Also, they can easily get shifted between .bss, .data and .data.rel for pointers, which makes it more likely that they use multiple in-memory pages.

The difference between the last two instead, both providing a single symbol and thus less expensive to access even without hiding the symbols, stands on how and where the memory is allocated. Using a global pointer to the state structure allows you to allocate and deallocate it as needed, so for instance if it’s the state of a dialog window that user has to explicitly request and then is closed, it can be allocated upon request, and freed after the dialog was closed. The big part of the memory area is thus allocated in the heap, but on the worst case of nothing else ending up in .bss, it will cause a 4KiB page to be allocated just to keep the 4 or 8 bytes pointer (so in most cases, if the structure is smaller than 4KiB it’s still better to use a global instance).

On the other hand, when using a global instance of the state structure, it will be reserved either in the .data, .data.rel or .bss sections, depending on whether there are pointers or not, or if the structure is initialised as empty. It will, thus, most likely make better use of the memory, as it will just use the page for that section rather than allocating a page just for a single pointer.

Now of course one would suppose that the first case is never useful, as the other two seem to have less invasive disadvantages. Still, it’s not so.

Let’s focus on comparing the first and the last cases, as they both use statically-allocated memory (in sections) rather than dynamically-allocated memory (heap). When you have a single huge structure instance that contains pointers and parameters with a default value, and you’re building with PIC, the instance will fall into .data.rel, which – without prelink – will trigger a COW directly at the start of the program, as the dynamic linker will have to relocate it. This will create multiple problems, for instance the definition of a single long array might fall partly on the original (disk-backed) page, and partly on the new private page allocated for the process, resulting in a missed cacheline; or depending on the implementation – not Linux’s case, as far as I can see, but I certainly can see uses of this to mitigate the problem I just described – it might cause the copy on write of a huge .data.rel section which contains data that needs not to be relocated and that might even still have its default value. These problems are mitigated when you use multiple variables because they’ll enter the right section as they need.

But the other main difference between the three cases is in the way the code is built to access the data:

  • in the case of a global pointer, the compiler will take the address of the variable containing the pointer, dereference it to get the address of the memory area where the structure reside, sum to that the offset of the variable to access in it, and then dereference the address just obtained to access the data;
  • in the case of a global instance, the compiler will take the address of the instance directly, then sum to that the offset of the variable, and dereference the address just obtained to access the data;
  • in the case of single variables, the compiler will just take the address of the variable and use that to access the data.

While most of the compilers will see to optimise the second case so that the difference between the last two is minimal, if any, I find it better to keep the compiler to guess too much.

But the difference is not yet finished; again we can compare the global instance method with the single variables method, this time for what concerns ordering. When you declare a structure, the order of the element is exactly the one you’ve written; if you don’t pack the structure explicitly, padding will be added so that the alignment of the variables is correct for the architecture. This means that this structure:

struct {
  char d;
  void *p;
} a;

will require 16 bytes on a 64-bit architecture (and 8 on a 32-bit architecture), wasting either 7 or 3 bytes depending on the alignment (this is why dwarves) was created. While x86 and amd64 architectures can access just as easily non-aligned data, most RISC architectures can’t, and even on x86/amd64 advanced features like SSE and similar require alignment of variables.

So what is the relation between this and the two methods I described? Well, as I said, the order you use for the structure will remain unchanged, while this can help to order the variables so that variables accessed together are kept together to fall in the same cacheline, padding might waste quite a bit of stuff. The order of variables declaration isn’t imperative, instead, and the linker can easily reorder them to fill the holes on its own. It can also make use of advanced optimisation, for instance you can use my method for reducing unused symbols linking.

If you really really really know that some variables are always accessed together and thus should stay on the same cacheline and not reordered, then add them to a small structure. Not a huge one, just a small one with the minimum variables possible, it will be treated like a single element, will lose some of the advantages of having the variables split (reordering, direct access to data), but as this is usually an exception, it shouldn’t be much of a problem.

How to avoid unused functions to creep into final binaries

I already wrote about using ld --gc-sections to identify unused code but I didn’t go much into the details on how to use that knowledge to avoid unused code to creep into your binaries.

Of course this is one possible solution, I’m sure there are more solution to this, but I find this pretty nice and easy to deal with. It relies a bit on the linker being smart enough to kill unused units, but if the linker you’re using can’t even get to this point, I’d sincerely suggest looking for a better one. GNU ld need to be told to make the right optimisation, but should work fine afterward.

Also this trick relies on hiding the internal symbols, so either you have updated your package to properly use visibility, or you could look into linker scripts and similar tools.

The trick is actually easy. Take as an example a software like xine-ui (where I’m applying this trick as I write), that has some optional feature. One of this is, for instance, the download of skins from the xinehq.de server; it’s an optional feature, and it defines its set of functions that are used only if the feature is enabled.

Up till I a few days ago, these functions were always built, as just the initialisation one was removed by #ifdef. As most of them were static, luckily they would have been optimised out by the compiler already. Unfortunately this is not always reliable, so some of them were still built.

What I did was actually simple, I took the functions used to download the data and moved all of them on their own unit. When the support for downloadable skins is disabled, the whole file is not built, and the functions will not be defined. On the two places where the two entrypoints of that code, I used #ifdef.

Up to now there is nothing really important about unused functions. As I said, the functions were mostly static, so most of them would have been removed already, with the exclusion of the two entrypoints, and one more function that was called by the closing entrypoint.

The unused functions start to appear now. There are two functions inside the “xine toolkit” (don’t get me started with that, please) that are used just by the skin downloader dialog. If the skin downloader support is not built, they become unused. How to solve this? Adding an #ifdef inside the xitk code for a feature of xine-ui proper is not a good design, so what?

It just requires to know that xitk is built as an archive library (static library, .a), and that even without any further option, ld does not link objects found in archives if there is no symbol needed from them. Move the two functions out on their own file, they will not be requested if the skin downloader is not used, so they won’t be linked in the final file. 1KB of code gone.

When you’re not linking through an archive file, though, you need to do something more: add --gc-sections. Don’t add other flags like -ffunction-section and -fdata-section, they’ll create too much overhead; without those the sections will be split per-file, and then collapsed altogether; that will allow you to discard the unused units’ data even if you’re linking the object file explicitly.

The problem is usually that you find yourself with a huge amount of files this way; but this is not necessarily a bad thing, if you don’t overdesign stuff like xine-ui, where almost all the source units for X11 frontend re-declare the gGui global pointer explicitly, rather than leaving that in the common include file…

Today: how to identify unused exported functions and variables

And here it comes a new post for my technical readers, after a little digression about the Gentoo’s today news, a new post about helping software maintenance by analysis of the resulting binaries.

This time the objective is trying to get rid of functions and variables that are not being used in your codepath. The obvious way to check for this is to use the -Wunused flag during compile, GCC will inform you about unused parameters, local variables and constants, static variables and constants, and static functions. Unfortunately it will not inform you if a non-static function, global variable or constant is not used, as they could be used in a different translation unit.

The linker, though, might know. The linker knows all the translation units that are being linked together, and can analyse what is needed and what is not, to an extent. Unfortunately, it can’t, by default, check for which functions, variables or constants are not used, but we can easily help it out.

First of all, we need the -Wl,--gc-sections option for the linker, but make sure this is not a safe flag to pass globally, so you should never put it in your LDFLAGS in make.conf. What this flag do is to ask the linker to get rid of unused sections. This means for instance that if a file exports a few static constants, and they are never used by anything in the executable you’re building, the linker can drop the .rodata section from that file.

There are a few problems with this, the first problem is that the sections (.data, .rodata, .text, and so on) often contain more than one variable, constant or function, so the linker can’t drop them unless all of them are unused. The other problem is that the the symbols exported by shared objects can be accessed outside the object, and thus can’t be dropped.

The solution to the second problem is to properly use visibility. If the functions are marked with default (or protected) visibility, they can be used externally, so the linker will always think they are used; on the other hand, the functions marked with hidden visibility instead will be checked for being used or not, and could be discarded. So if you want to apply this method on a shared object, the prerequisite is to make sure that only the symbols that are public are being exported.

For the first problem, the solution requires no change to the sources, which means it’s quite easier to take care of, which is good because it affects even executables (non-shared objects). As we said, the linker can only discard an entire section, so the obvious solution is to have every symbol on its own section: every function in a different .text section, very variable in a different .data section and every constant in a different .rodata section. It might sound difficult to do, but the compiler already has two flags that can be used for this: -ffunction-sections and -fdata-sections.

Again: these are not flags you want to use globally. This is especially true since they tend to create way bigger files, which will certainly be slower.

The two flags above will tell the compiler to emit a different section for every symbol, which is exactly what we need: this way the linker will be able to discard single symbols (functions, variables, constants) if they are never referenced.

Now when the linker links the shared object, or the executable, it will take care internally of discarding the unused sections, which will mean unused functions, variables and constants. This is nice, isn’t it? Unfortunately, while you can tweak your build system to use -Wl,--gc-sections whenever available (once you make sure that the sections that will be discarded are fine to be discarded), you shouldn’t force -ffunction-sections and -fdata-sections on users, as the output will be bigger and slower for no good reason.

So you should tweak your source instead of relying on this behaviour. But before tweaking the sources, you need to know what to get rid of. The difficult way to do this is to check the symbols with and without -Wl,--gc-sections, and then remove the symbols not listed in the version with it. This is not what I do ;) The easy way to do this is to use another option that the GNU linker provides you: -Wl,--print-gc-sections. With this option, all the discarded sections will be reported to you, and so you can easily see what the linker found not to be needed.

ccache x86_64-pc-linux-gnu-gcc -shared  .libs/xineplug_decode_gsm610_la-gsm610.o -Wl,--whole-archive ../../contrib/gsm610/.libs/libgsm610.a -Wl,--no-whole-archive  -Wl,--rpath -Wl,/home/flame/devel/repos/xine/xine-lib-1.2/enterprise/src/xine-engine/.libs ../../src/xine-engine/.libs/libxine.so -L/usr/lib64  -march=athlon64 -Wl,-z -Wl,defs -Wl,--gc-sections -Wl,-O1 -Wl,--as-needed -Wl,--hash-style=gnu -Wl,--version-script=/home/flame/devel/repos/xine/xine-lib-1.2/enterprise/linker.map -Wl,--print-gc-sections -Wl,-soname -Wl,xineplug_decode_gsm610.so -o .libs/xineplug_decode_gsm610.so
/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: Removing unused section '.text.gsm_add' in file '../../contrib/gsm610/.libs/libgsm610.a(add.o)'
/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: Removing unused section '.text.gsm_mult' in file '../../contrib/gsm610/.libs/libgsm610.a(add.o)'
/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: Removing unused section '.text.gsm_mult_r' in file '../../contrib/gsm610/.libs/libgsm610.a(add.o)'
/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: Removing unused section '.text.gsm_abs' in file '../../contrib/gsm610/.libs/libgsm610.a(add.o)'
/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: Removing unused section '.text.gsm_L_mult' in file '../../contrib/gsm610/.libs/libgsm610.a(add.o)'
/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: Removing unused section '.text.gsm_L_add' in file '../../contrib/gsm610/.libs/libgsm610.a(add.o)'
/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/../../../../x86_64-pc-linux-gnu/bin/ld: Removing unused section '.text.gsm_L_sub' in file '../../contrib/gsm610/.libs/libgsm610.a(add.o)'
[snip]

As you can see, this is an actual output from xine-lib building with these option. I set xine-lib-1.2 to use -Wl,--gc-sections whenever available for plugins and xine-lib itself, this because when we use the internal copies of libraries there are already entire sections (without using -ffunction-sections or -fdata-sections) that are unused, and thus using --gc-sections will improve the situation alone already.

The message from the linker should be clear enough, the unused sections will be all listed on stderr, and the name of the section will tell you whether it’s a function, a variable or a constant, as the name of the actual section is prefixed: .text for functions, .data for variables, .rodata for constants. When there’s a number after a variable or constant name, like .5659, that means it’s a local static variable or constant, local to a function. Which function, I’m afraid I do not know how to get offhand.

When I’ll have more time I’ll see if I can get a script to do the work, acting even on non-visibility-enabled shared objects, although that most likely will require a few more performance-crippling gcc flags to be used during build.

We can consider this part of a phase akin to profiling, ran on specially-compiled code, which is not going to be used in production seriously. It can be quite of help though, to identify which part of the code is being maintained and compiled for no reason at all. Or to identify code that should be used and is not.

One thing for which KDEPIM sucks

I’m referring to KDEPIM from KDE 3, I certainly hope this kind of problems is fixed for KDE 4; if it isn’t, that would likely be a good reason why KDE should reform itself in a new way.

So, last night I started working on a simple script (that for now uses a mixture of Ruby and readelf calls) that loads all the symbols present in the libraries on the system, and then checks for duplicates, to see which symbols would collide if loaded in the same address space.

A bit of history on this. If you ever got into Michael Meeks’s patches for the infamous -Bdirect option, he talked about the incompatibility of that with the technique of interposing. Symbol interposing is a way to provide different implementations of the same interface on different libraries, allowing to shift between one and the other by changing the way they are linked or by using LD_PRELOAD. A common case of interposing are the threading libraries; the C library usually has weak symbols for the various pthread_* functions, beside from pthread_new that is the one used to actually create the threads, so that a library that can be used both with and without threading can still define its own mutexes and use them, making the calls no-ops when no threads are used.

To use interposing, the same symbol is present in more than one library; usually the library always linked in has a weak symbol, while the implementations have normal symbols. This when used consciously is called interposing, but when this happens (way more often) without a proper conscious use, this is instead symbol collision; a symbol collision is bad, because a library expecting to use a certain function in a certain way might be using it in a totally different way because of the collision.

Even when the symbol is just the same, for the same function, symbol collisions requires more work from the linker to identify them, hash values and prelink not always works, so they usually should be avoided. One way to avoid them, without renaming the functions, if they are used only internally, is to use hidden visibility. This solves the issue, but needs to be properly implemented to be good.

So, my tool checks for symbols’ collisions, by gathering all the symbols in all the libraries installed on my system in a table of an sqlite database, and then counting how many times the same symbol is present. Of course there are false positives, cases in which the presence of different symbols with the same name implies neither interposing nor collisions, and that’s the case of most plugin infrastructures: the plugins export one or more specific symbols that have a given name, the loader knows to look for them and binds them in a per-plugin data structure. In the case of xine, this is achieved through the xine_plugin_info structure present in every plugin. To avoid reporting those as collisions (they are not) I also added a suppressions file, where I can declare regular expressions of symbols (for some files) that needs not to be counted as collision.

The output of the script shown me this:

Symbol soap_bool2s(soap*, bool) present 3 times
  /usr/kde/3.5/lib64/kde3/kio_groupwise.so
  /usr/kde/3.5/lib64/libkcal_groupwise.so.1.0.0
  /usr/kde/3.5/lib64/libkabc_groupwise.so.1.0.0
Symbol soap_in_int(soap*, char const*, int*, char const*) present 3 times
  /usr/kde/3.5/lib64/kde3/kio_groupwise.so
  /usr/kde/3.5/lib64/libkcal_groupwise.so.1.0.0
  /usr/kde/3.5/lib64/libkabc_groupwise.so.1.0.0
Symbol soap_s2bool(soap*, char const*, bool*) present 3 times
  /usr/kde/3.5/lib64/kde3/kio_groupwise.so
  /usr/kde/3.5/lib64/libkcal_groupwise.so.1.0.0
  /usr/kde/3.5/lib64/libkabc_groupwise.so.1.0.0

and so on for a long time. This means that the same internal library is being linked in those three shared objects, and replicated. This is bad, because it wastes memory (as the code is not shared between the instances of programs using those three libraries) and on-disk space (as the code is replicated on three files when it should be on just one).

I’ve prepared a patch to kdepim-kresources (the package in which these libraries are) that instead of declaring libgwsoap an internal library declares it as an installed library, so that the three libraries links to the libgwsoap rather than linking it in.

The results?

flame@enterprise ~ % qsize kdepim-kresources
kde-base/kdepim-kresources-3.5.6: 186 files, 25 non-files, 64362.24 KB
flame@enterprise ~ % qsize kdepim-kresources
kde-base/kdepim-kresources-3.5.6: 191 files, 25 non-files, 37207.401 KB

The first is before the change, the second after the change.
And this is far from being the sole part of kdepim having such a stupid problem.

Sorry KDE guys, but KDEPIM 3.x is a failure, when it comes to properly build and install stuff.