Sub-optimal optimisations?

While writing Implications of pure and constant functions I’ve been testing some code that I was expecting to be optimised by GCC. I was surprised to find a lot of my testcases were not optimised at all.

I’m sincerely not sure whether these are due to errors on GCC, to me expecting the compiler to be smarter than it can feasibly be right now, or to the “optimised” code to be more expensive than the code that is actually being generated.

Take for instance this code:

int somepurefunction(char *str, int n)
  __attribute__((pure));

#define NUMTYPE1 12
#define NUMTYPE2 15
#define NUMTYPE3 12

int testfunction(char *param, int type) {
  switch(type) {
  case 1:
    return somepurefunction(param, NUMTYPE1);
  case 2:
    return somepurefunction(param, NUMTYPE2);
  case 3:
    return somepurefunction(param, NUMTYPE3);
  }

  return -1;
}

I was expecting in this case the compiler to identify cases 1 and 3 as identical (by coincidence) and then merge them in a single branch. This would have made debugging quite hard actually (as you wouldn’t be able to discern the two case) but it’s a nice reduction on code, I think. Neither on x86_64 nor on Blackfin, neither 4.2 nor 4.3 actually merge the two cases leaving the double code in there.

Another piece of code that wasn’t optimised as I was expecting it to be is this:

unsigned long my_strlen(const char *str)
  __attribute__((pure));
char *strlcpy(char *dst, const char *str, unsigned long len);

char title[20];
#define TITLE_CODE 1
char artist[20];
#define ARTIST_CODE 2

#define MIN(a, b) ( a < b ? a : b )

static void set_title(const char *str) {
  strlcpy(title, str, MIN(sizeof(title), my_strlen(str)));
}

static void set_artist(const char *str) {
  strlcpy(artist, str, MIN(sizeof(artist), my_strlen(str)));
}

int set_metadata(const char *str, int code) {
  switch(code) {
  case TITLE_CODE:
    set_title(str);
    break;
  case ARTIST_CODE:
    set_artist(str);
    break;
  default:
    return -1;
  }

  return 0;
}

I was expecting here a single call to my_strlen(), as it’s a pure function, and in both branches it’s the first call. I know it’s probably complex code once unrolled, but still gcc at least was better at this than intel’s and sun’s compilers!

Both Intel’s and Sun’s, even at -O3 level, emit four calls to my_strlen(), as they can’t even optimise the ternary operation! Actually, Sun’s compiler comes last for optimisation, as it doesn’t even inline set_title() and set_artist().

Now, I haven’t tried IBM’s PowerPC compiler as I don’t have a PowerPC box to develop on anymore (although I would think a bit about the YDL PowerStation, given enough job income in the next months — and given Gentoo being able to run on it), so I can’t say anything about that, but for these smaller cases, I think GCC is beating other proprietary compilers under Linux.

I could check Microsoft’s and Borland’s Codegear’s compilers, but it was a bit out of my particular scope at the moment.

If I did think a bit before about supporting non-GNU compilers for stuff like xine and unieject, I start to think it’s not really worth the time spent on that at all, if this is the result of their compilations…

Some numbers about duplicate strings collapsing in ld

I haven’t had time to test gold yet, but at least Bernard saved me from finding out how to do so :)

I plan on doing it today, as I just finished a script, based again on my ruby-elf, that calculates how much space is saved by collapsing duplicated strings as one.

Interestingly enough, the biggest differences don’t come from glibc as I expected, although it is in the third place:

/usr/lib64/libbonobo-2.so: current size 36369, full size 44794 difference 8425
/usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/auto/SVN/_Core/_Core.so: current size 15464, full size 19296 difference 3832
/lib64/libc-2.7.so: current size 21430, full size 24361 difference 2931
/usr/lib64/libasound.so: current size 37976, full size 40698 difference 2722
/usr/lib64/libMagickWand.so: current size 17056, full size 19740 difference 2684

At least in the case of ALSA, the problem sees to lie in __-prefied entries. But I see a lot of them having the dlsym name in it, which seems to be a way to mix different ALSA libraries… probably something that hidden visibility could take care of, which would probably be very nice for a lot of reasons: 38KB of strings seems to be a huge amount of memory to me.

For what concerns libMagickWand, it seems like it’s wrapping around functions from libMagick itself, which explains why they all are copied with a Magick prefix to them.

In the case of the SVN Perl bindings, which are present with almost all of their modules, it seems to me like a very handling of symbols by swig’s part: there are lots of _wrap_-prefixed symbols, followed by the name of the original symbol.

As you have guessed by now, most of the libraries having huge saving by collapsing strings are those implementing wrappers around other libraries. This is because they usually just add a prefix to the symbol, and the names of defined and imported symbols are handled in a single pool of strings. If instead of a prefix they added a suffix, there would be no saving at all.

What does all of this tell me? Well first of all I think that the extra memory that would be used up by disabling the collapsing (I care about the memory, not the on-disk size, unless we’re talking about embedded systems), may as well be worth a faster linking time; eventually, it would be nice to be able to just use the collapsing for the libraries that implement those wrappers.

On a different level, one could ask why all the _wrap_ functions in Subversion’s Perl bindings are exported, as one would probably expect them not to be needed but internally. Seems like Perl bindings often suffer from this problem, as also Net-SSLeay has a similar problem, even though the prefix is XS_Net__SSLeay_ instead.

And one can easily see a probable mistake done by stuff that wraps symbols around: adding a suffix rather than a prefix. When using a prefix, in case you need to reduce the size, you can do the collapsing, when using a suffix, you simply can’t, at least the way ELF files are handled right now. Or change the name of the original symbol, like to capitalise the first letter of the symbol, or to drop a common prefix to replace it with another, which is what the XML-LibXML binding does (and thus doesn’t share a lot of strings that it could share, although again the proper solution would be not to export them at all).

Anyway, the only substantial difference I see are mostly in Perl’s bindings, in a lot of Perl’s bindings to be honest, so I suppose one way to make the difference really unimportant for most desktop/server uses would be to fix XS and swig to hide their symbols. Then making it optional to collapse the duplicates would be a really really good idea.

Some words about global variables

While almost all courses of practical software engineer will tell you to avoid entirely the use of global variables in your software, sometimes there are reasons to use them. It usually applies only to programs, as you can easily assume that you don’t risk re-entrancy problems; libraries on the other hand should really try to avoid using global variables, especially global state variables, for their own nature.

I’ll let you guess which software makes an happy use of global variables.. yeah you guessed right: xine-ui. Now it’s actually comprehensible, as most of that stuff has no re-entrancy requirement, and just passing around a context structure would make the code messier. There are, though, quite a few notes on the topic, and this is why I decided to write something about it.

There are at least three ways to store the global variables:

  • declaring and defining them one by one, so that they are all a bunch of different symbols at linker level;
  • declaring them inside a structure (optionally anonymous) and then define a single global pointer to that structure to be shared;
  • again declaring them inside a structure (again optionally anonymous) and then define a single instance of that structure to be shared.

Up to Wednesday, xine-ui only did a mix of the last two cases, and I was the one introducing the last case, for your information.

There is no “one size fit all” solution, as it’s almost obvious with software design problems, so most likely a properly designed software will use a mix of these three cases. In xine-ui I introduced the first case Wednesday early morning, but let me first describe the various pros and cons.

Starting from the first, the problem with it is easy to guess even after just reading the point: they are a bunch of different symbols. If you don’t properly hide your symbols, each symbol is accessed through the PLT (Procedure Linking Table), and this is an expensive operation; also the symbols get exported and thus can be interposed, and if you’re using PIE, they also have to pass through the GOT (Global Offset Table). Also, they can easily get shifted between .bss, .data and .data.rel for pointers, which makes it more likely that they use multiple in-memory pages.

The difference between the last two instead, both providing a single symbol and thus less expensive to access even without hiding the symbols, stands on how and where the memory is allocated. Using a global pointer to the state structure allows you to allocate and deallocate it as needed, so for instance if it’s the state of a dialog window that user has to explicitly request and then is closed, it can be allocated upon request, and freed after the dialog was closed. The big part of the memory area is thus allocated in the heap, but on the worst case of nothing else ending up in .bss, it will cause a 4KiB page to be allocated just to keep the 4 or 8 bytes pointer (so in most cases, if the structure is smaller than 4KiB it’s still better to use a global instance).

On the other hand, when using a global instance of the state structure, it will be reserved either in the .data, .data.rel or .bss sections, depending on whether there are pointers or not, or if the structure is initialised as empty. It will, thus, most likely make better use of the memory, as it will just use the page for that section rather than allocating a page just for a single pointer.

Now of course one would suppose that the first case is never useful, as the other two seem to have less invasive disadvantages. Still, it’s not so.

Let’s focus on comparing the first and the last cases, as they both use statically-allocated memory (in sections) rather than dynamically-allocated memory (heap). When you have a single huge structure instance that contains pointers and parameters with a default value, and you’re building with PIC, the instance will fall into .data.rel, which – without prelink – will trigger a COW directly at the start of the program, as the dynamic linker will have to relocate it. This will create multiple problems, for instance the definition of a single long array might fall partly on the original (disk-backed) page, and partly on the new private page allocated for the process, resulting in a missed cacheline; or depending on the implementation – not Linux’s case, as far as I can see, but I certainly can see uses of this to mitigate the problem I just described – it might cause the copy on write of a huge .data.rel section which contains data that needs not to be relocated and that might even still have its default value. These problems are mitigated when you use multiple variables because they’ll enter the right section as they need.

But the other main difference between the three cases is in the way the code is built to access the data:

  • in the case of a global pointer, the compiler will take the address of the variable containing the pointer, dereference it to get the address of the memory area where the structure reside, sum to that the offset of the variable to access in it, and then dereference the address just obtained to access the data;
  • in the case of a global instance, the compiler will take the address of the instance directly, then sum to that the offset of the variable, and dereference the address just obtained to access the data;
  • in the case of single variables, the compiler will just take the address of the variable and use that to access the data.

While most of the compilers will see to optimise the second case so that the difference between the last two is minimal, if any, I find it better to keep the compiler to guess too much.

But the difference is not yet finished; again we can compare the global instance method with the single variables method, this time for what concerns ordering. When you declare a structure, the order of the element is exactly the one you’ve written; if you don’t pack the structure explicitly, padding will be added so that the alignment of the variables is correct for the architecture. This means that this structure:

struct {
  char d;
  void *p;
} a;

will require 16 bytes on a 64-bit architecture (and 8 on a 32-bit architecture), wasting either 7 or 3 bytes depending on the alignment (this is why dwarves) was created. While x86 and amd64 architectures can access just as easily non-aligned data, most RISC architectures can’t, and even on x86/amd64 advanced features like SSE and similar require alignment of variables.

So what is the relation between this and the two methods I described? Well, as I said, the order you use for the structure will remain unchanged, while this can help to order the variables so that variables accessed together are kept together to fall in the same cacheline, padding might waste quite a bit of stuff. The order of variables declaration isn’t imperative, instead, and the linker can easily reorder them to fill the holes on its own. It can also make use of advanced optimisation, for instance you can use my method for reducing unused symbols linking.

If you really really really know that some variables are always accessed together and thus should stay on the same cacheline and not reordered, then add them to a small structure. Not a huge one, just a small one with the minimum variables possible, it will be treated like a single element, will lose some of the advantages of having the variables split (reordering, direct access to data), but as this is usually an exception, it shouldn’t be much of a problem.

How to avoid unused functions to creep into final binaries

I already wrote about using ld --gc-sections to identify unused code but I didn’t go much into the details on how to use that knowledge to avoid unused code to creep into your binaries.

Of course this is one possible solution, I’m sure there are more solution to this, but I find this pretty nice and easy to deal with. It relies a bit on the linker being smart enough to kill unused units, but if the linker you’re using can’t even get to this point, I’d sincerely suggest looking for a better one. GNU ld need to be told to make the right optimisation, but should work fine afterward.

Also this trick relies on hiding the internal symbols, so either you have updated your package to properly use visibility, or you could look into linker scripts and similar tools.

The trick is actually easy. Take as an example a software like xine-ui (where I’m applying this trick as I write), that has some optional feature. One of this is, for instance, the download of skins from the xinehq.de server; it’s an optional feature, and it defines its set of functions that are used only if the feature is enabled.

Up till I a few days ago, these functions were always built, as just the initialisation one was removed by #ifdef. As most of them were static, luckily they would have been optimised out by the compiler already. Unfortunately this is not always reliable, so some of them were still built.

What I did was actually simple, I took the functions used to download the data and moved all of them on their own unit. When the support for downloadable skins is disabled, the whole file is not built, and the functions will not be defined. On the two places where the two entrypoints of that code, I used #ifdef.

Up to now there is nothing really important about unused functions. As I said, the functions were mostly static, so most of them would have been removed already, with the exclusion of the two entrypoints, and one more function that was called by the closing entrypoint.

The unused functions start to appear now. There are two functions inside the “xine toolkit” (don’t get me started with that, please) that are used just by the skin downloader dialog. If the skin downloader support is not built, they become unused. How to solve this? Adding an #ifdef inside the xitk code for a feature of xine-ui proper is not a good design, so what?

It just requires to know that xitk is built as an archive library (static library, .a), and that even without any further option, ld does not link objects found in archives if there is no symbol needed from them. Move the two functions out on their own file, they will not be requested if the skin downloader is not used, so they won’t be linked in the final file. 1KB of code gone.

When you’re not linking through an archive file, though, you need to do something more: add --gc-sections. Don’t add other flags like -ffunction-section and -fdata-section, they’ll create too much overhead; without those the sections will be split per-file, and then collapsed altogether; that will allow you to discard the unused units’ data even if you’re linking the object file explicitly.

The problem is usually that you find yourself with a huge amount of files this way; but this is not necessarily a bad thing, if you don’t overdesign stuff like xine-ui, where almost all the source units for X11 frontend re-declare the gGui global pointer explicitly, rather than leaving that in the common include file…

Parameters handling and COWs.

One common problem while fixing copy on write pages is finding an array of structures that hold parameters name and their value, usually loaded up either from network or from configuration files. One of the classical declarations of those structures is:

struct {
    char name[16];
    int value;
} parameters[20] = {
  { "foo", 34 },
  { "bar", 50 }
};

I put a character array in the struct so to show that using them does not make everything magically working.

What is wrong with this structure? Well it is not constant, so it gets written to .data, which is a COW section. You can’t really make it constant either as the value of the parameters has to change, if you’re using it to set the parameters for the program to work with. As long as the user is not going to change the default parameters, it’s not going to use extra memory, luckily, but if the user changes even just one parameter, COW is triggered.

This is actually quite common, it’s even used by Xorg drivers like radeon, often with extra flags like type, and unions, and so on.

Most people would react to this by using pointers to strings so that those get to .rodata and just their pointer is added to the parameters object. In the best case, those pointer will end up in .data, when using PIC, they’ll be added to .data.rel; the difference between the two is, again, that the first is only copied over when changing the values from the default, while the latter is copied over every time the executable or shared library is loaded (unless using prelink, if that works at all actually).

My proposed solution to that is simply to use two different objects. Somehow one limitation of C shows off here: being unable to write in columns ;)

static const char param_names[16][20] = { "foo", "bar" };
char param_values[20] = { 34, 50 };

If there is more to the names, one might use a structure instead, think for instance of this example, where the parameter can be either an integer, a float or a string:

static const struct {
    char name[16];
    enum { PARAM_INT, PARAM_FLOAT, PARAM_STRING } type;
} parameters[20] = {
  { "foo", PARAM_INT },
  { "bar", PARAM_FLOAT },
  { "baz", PARAM_STRING }
};

union {
    int intval;
    float floatval;
    char *stringval;
} param_values[20] = {
  { .intval = 34 },
  { .floatval = 50.0 },
  { .stringval = "default" }
};

Now in these last two examples the metadata about parameters (name, and type) can be written off to .rodata, while the values are saved in .data; at the same time, saved for the last one with the string, the parameter values will be shared as long as the defaults are not changed. Nice uh?

The only thing to remember is to make sure the index of the two arrays is the same. This is why I think it’s a pity that there is no way to write in columns in C, it would make maintaining this code quite easier.

When the list of parameters to maintain is quite big, the best way to tackle down the problem is to write a script that translate a table of metadata and default values into C code. It should be quite easy to do with most tools like Perl or Ruby or AWK. This way there is little to no overhead in maintaining the two arrays separate, and the memory use will be optimised.

As for the reason why I used an union in the last example, I have seen more than once structures defined the same way of that struct, carrying all three kinds at once. Those structures tend to waste a lot of space. Once you’re certain that only one of the three entries is used for a given parameter, using an union also limit the amount of memory the table use. This is one thing unions have been designed for, after all.

Many minor issues make a big issue

All of you who follow my blog for a medium to long time knows that I often take a look to minor issues. These are issues that have no direct impact on users. Summing them up, though, can easily give you big issues that do have a direct impact on users.

Take my crusade about copy-on-write pages. They don’t really make any difference taken one by one, but sum them up in a system, and it might as well make a huge difference.

It’s similar to the issue with wakeups that Intel brought on the table. It does not make much difference on its own for a single software, but add all of them together and you have a huge difference.

I wonder if we’ll ever get to a point where all these issues are well known by all the free software developers, and that just the new started projects need to be profiled to identify these issues. Unfortunately, I’m afraid this is not going to happen anytime soon. But it would be really interesting if common projects like Samba and Xorg started paying more attention to those issues.

I wonder what’s the best option to improve the situation. I actually thought about starting a Wiki listing all the problems if the apps, and then attaching patches for possible resolutions. I would hate to load a Wiki up on my blog though. As an option, I was thinking about using GIT and Emacs’s Muse, rebuilding the pages statically after every commit. Then it would also be possible to add stuff like graphs and similar to show the reduction in memory usage, or startup time, or whatever, which would then allow showing more easily what the changes improved.

I think solar was right in suggesting me to have more visual output of the changes. I’ve installed R yesterday, and I hope to be able to try learning how to use it soon, so that I can propose some graphs.

Anyway, any comment on this is very welcome, as I start to be quite tired to fight these issues alone, and having almost no feedback from upstream.

Should only libraries be concerned about PIC?

Position Independent Code is a technique used to create executable code that, as the name implies, is independent from the starting address where it is loaded (the position). This means that the pointers to data and functions in the code, as well as in the default value of pointers cannot be assumed to be always the same as the ones set after the build process in the executable file (or the library).

What this means in practical terms is that, as you can’t be sure how many and which libraries a program might load at runtime, libraries are usually loaded at dynamically-assigned addresses, thus the code need not to statically use one value as base address. When a shared library is loaded with a static base address (thus not using PIC), it has to be relocated by the runtime loader, and that causes changes to the .text section, which breaks the assumption that sections should be either writable or executable, but not both at the same time.

When using PIC, instead, the access to symbols (data and functions) is maintained by a global offset table (GOT), so the code does not need to be relocated, only the GOT and the pointers stored in the data sections. As you can guess, this kind of indirect access takes more time than the direct access that non-PIC code uses, and this is why a lot of people hate the use of PIC in x86 systems (on the other hand, shared libraries not using PIC not only breaks the security assumption noted above, making it impossible to use mitigation technologies like NX – PaX in Linux, W^X in OpenBSD – it also increase the memory usage of software as all the .text sections containing code will need to be relocated and thus duplicated by the copy-on-write).

Using hidden visibility is possible to reduce the performance hit caused by GOT access, by using PC-relative addressing (relative to the position on the file), if the architecture supports them of course. It does not save much for what concerns pointers in the data sections, as they will still need relocations. This is what causes arrays of strings to be written in .data.rel.ro sections rather than .rodata sections: the former gets relocated, the latter doesn’t, so is always shared.

So this covers shared libraries, right? Copy-on-write on shared libraries are bad, shared-libraries use PIC, pointers in data section on PIC code cause copy-on-write. But does it stop with shared libraries?

One often useful security mitigation factor is random base address choice for executables: instead of loading the code always at the same address, it is randomised between different executions. This is useful because an attacker can’t just start guessing at which address the program will be loaded. But applying this technique to non-PIC code will cause relocations in the .text section, which in turn will break another security mitigation technique, so is not really a good idea.

Introducing PIE (Position Independent Executable).

PIE is not really anything new: it only means that even executables are built with PIC enabled. This means that while arrays of pointer to characters are often considered fine for executables, and are written to .rodata (if properly declared) for non-PIC code, the problem with them reappears when using PIE.

It’s not much of a concern usually because the people using PIE are usually concerned with security more than performance (after all it is slower than using non-PIC code), but I think it’s important for software correctness to actually start considering that an issue too.

Under this light, not only it is important to replace pointers to characters with characters array (and similar), but hiding the symbols for executables become even more important to reduce the hit caused by PIC.

I’m actually tempted to waste some performance in my next box and start using PIE all over just to find this kind of problems more easily… yeah I’m masochist.

How many implementations of MD5 do you have in your system?

Anybody who ever looked into protocols and checksums or even downloaded the ISO file of a distribution a few years ago knows what an MD5 digest looks like. It’s obvious that the obnoxious presence of MD5 in our lives up to a few years ago (when it was declared non secure, and work started to replace it with SHA1 or SHA256) caused a similar obnoxious presence of it in the code.

Implementing MD5 checksumming is not really an easy task; for this reason there is almost the same code that gets reused from one library to another, and so on. This code has an interface that is more or less initialise, update, finish; the name of the functions might change a bit between implementation and implementation, but the gist is that.

Now, the most common provider of these functions is certainly OpenSSL (which also implements SHA1 and other checksum commands), but is not limited to. On FreeBSD, the functions are present in a system library (I forgot which one), and the same seems to happen in the previous Linux C library (libc.so.5, used before glibc). A common problem with OpenSSL is its GPL-incompatibility, for instance.

Now this means that a lot of software reimplemented their own MD5 code, using the same usual interface, with slightly different names: MD5Init, MD5_init, md5_init, md5init, md5_update, md5_append, md5_final, md5_finish and so on. All these interfaces are likely slightly different one with the other, to the point of not being drop-in replacements, and thus causing problems when they collide one with the other.

On every system, thus, there are multiple implementations of MD5, which, well, contributes to memory waste, as having a single implementation would be nicer and be easily shared between programs.

These packages implement their own copy of MD5 algorithms, and export them (sometimes correctly, sometimes probably by mistake): OpenSSL (obviously), GNUTLS (obviously, as it’s intended as semi-drop-in replacement for OpenSSL), GeoIP (uh?), Python (EH!? Python already links to OpenSSL, why on earth doesn’t it use SSL for MD5 functions really escapes me), python-fchksum (and why does it not use Python’s code?), Wireshark (again, it links to both GNUTLS and OpenSSL, why it does implement its own copy of MD5 escapes me), Kopete (three times, one for Yahoo plugin, one for Oscar – ICQ – plugin, and a different one for Jabber, it goes even better as KDE provides an MD5 implementation!), liblrdf, Samba (duplicated in four libraries), Wine (for advapi32.dll reimplementation, I admit that it might be requested for advapi32 to export it, I don’t know), pwdb, and FFmpeg (with the difference that FFmpeg’s implementation is not triggering my script as it uses its own interface).

I’m sure there are more implementations of MD5 on a system, as I said they are obnoxiously present in our lives still, for legacy protocols and data, and the range of different areas where MD5 checksums are used is quite wide (cryptography, network protocols, backup safety checks, multimedia – for instance the common checksum of decoded data to ensure proper decoding in lossless formats – and authentication). Likely a lot of implementations are hidden inside the source code of software, and it is likely impossible to get rid of them. But it would certainly be an interesting task if someone wants: sharing MD5 implementations means that optimising it for new CPUs will improve performance on all software using it.

If I wasn’t sure that most developers would hate me doing that, I’d pretty much like to open bugs for all the packages giving possible area of improvement of upstream code. As it is, contacting all upstreams, and creating a good lot of trackers’ accounts is something I wouldn’t like to do in my free time, but I can easily point out improvement areas for a lot of code. I just opened python-fchksum (which is used by Portage, which in turn means that if I can optimise it, I can optimise Portage), and beside the presence of MD5 code, I can see a few more things that I could improve in it. I’ll likely write the author with a patch and a note, but it’s certainly not feasible for me to do so for every package out there, alone and in my free time…

Why would an executable export symbols?

I think this post might be interesting for those people interested in trying to get all the performance power out of a box, without breaking anything in the process.

I’ve blogged before about the problems related to exporting symbols from final executables, but I haven’t really dug deep enough to actually provide useful information to developers and users about what those exported symbols represent, for an executable.

First of all, let’s start to say that an executable under Linux, and on most modern Unixes, is just the same kind of file of a shared object (shared library, if you prefer, those which are called DLL under Windows, if you come from a Windows background). And exactly as shared libraries, they can export symbols.

Exported symbols are resolved by the dynamic – runtime – linker, through the process of dynamic bindings, and thy might collide. I’ll return on the way the runtime linker works at a different moment. For now let’s just say that the exported symbols require some extra step to be taken during the execution of a program, and that the process takes time.

Executables don’t usually need to export symbols, and they usually don’t export symbols at all. Although there are rare cases where executables are required to export symbols, for instance because they are used by some of the libraries they link to as a “callback” from the library to the program, or for C++ programs for RTTI to properly work, most of the times the symbols are exported just because of libtool.

By default when you link a program, it doesn’t get its symbols exported, they are hidden and thus resolved directly at buildtime, for those symbols present in the source files themselves. When you add code to a convenience library that is built with libtool, then something changes, and the symbols defined inside that library are exported even when linking it statically inside the final executable.

This causes quite a few drawbacks. and as I said, is not usually used for anything:

  • the symbols are resolved at runtime through dynamic binding, which takes time, even if usually very little for a normal system, repeated time wasted during dynamic binding might actually become a good deal of time;

  • the symbols might collide with a library loaded afterward, this is for instance why recode breaks PHP;

  • using --gc-sections won’t help much because exported symbols are seen as always used, and this might increase the amount of code added to the executable with no good reason;

  • prelink will likely set the wrong addresses for symbols that collide, which in turn will drop off the improvement of using prelink entirely, at least for some packages.

The easy solution for this is for software packages to actually check if the compiler supports hidden visibility, and if it does, hide all the symbols but for the ones in the public API of their libraries. In case of software like cmake that install no shared objects, hidden visibility could be forced by the ebuild, but to give back to the community, the best thing is to get as much software as possible to use hidden visibility, thus reducing the amount of symbols that gets exported on both binaries and shared libraries.

I hope these few notes might actually help Gentoo maintainers to understand why I’m stressing on this point. It would be nice if we all could improve the software we maintain, even one step at a time.

As for what concerns my linking collision scripts, the bad packages’ list got a few more entries today: KViewShell with djvulibre, Karbon with gdk-pixbuf, and gcj, with both boehm-gc and libltdl.

And now I can actually start seeing the true collisions, like gfree symbol, having two different definitions in libgunicode.so (fontforge) and poppler/libkpdfpart (xpdf code), or the scan_token symbol in ghostscript with a completely different definition in libXfont/libt1.

Talking about libXfont and libt1 (or t1lib). I wonder if there is hope in the future for one to use the other, rather than both use the same parser code for type1 fonts. I’ll have to check the FreeDesktop bugzilla tomorrow to see if it was ever discussed. At the moment they duplicate a lot of symbols one with the other.

I have to say, PostgreSQL is an important speed improvement, that will allow me to complete my task in shorter time. Now I’m waiting for Patrick to run my script over the whole set of packages in Gentoo, that might actually be something. If only there was an easy way to make building and testing code faster (for COW reduction) without changing hardware, that would be awesome. Unfortunately that I need to do locally :(

A few statistics out of cowstats on media-libs libraries

Today is a short day to work, my connection is unstable and unusable, and tonight I’m out with a couple of friends of mine.

So I simply prepared a (complex) oneliner to gather some statistics about media libraries I have installed in my system, out of the static archives (so it’s not the worst case).

qlist -I -v media-libs -C | 
  while read pkg; do 
     pkgname=${pkg##*/}; mkdir -p ${pkgname}; pushd ${pkgname}; 
     qlist $pkg | egrep '.a$' | 
        while read lib; do 
            libname=$(basename $lib); mkdir -p $libname; pushd $libname; 
             ar x ${lib}; 
             ruby -I ~/devel/repos/ruby-elf  ~/devel/repos/ruby-elf/cowstats.rb --statistics --total *.o > ~/mytmpfs/libstats/${pkgname}:${libname}; 
             popd; 
        done; 
      popd; 
  done

The result of this, grepped around a bit gave me these statistics:

a52dec-0.7.4-r5:liba52.a:  Total 4593 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:libasound.a:  Total 13164 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:smixer-ac97.a:  Total 192 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:smixer-hda.a:  Total 192 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:smixer-python.a:  Total 2144 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:smixer-sbase.a:  Total 104 bytes of variables in copy-on-write sections
alsa-oss-1.0.15:libalsatoss.a:  Total 28 bytes of variables in copy-on-write sections
alsa-oss-1.0.15:libaoss.a:  Total 128 bytes of variables in copy-on-write sections
alsa-oss-1.0.15:libossredir.a:  Total 112 bytes of variables in copy-on-write sections
audiofile-0.2.6-r3:libaudiofile.a:  Total 6408 bytes of variables in copy-on-write sections
faac-1.26-r1:libfaac.a:  Total 9612 bytes of variables in copy-on-write sections
faad2-2.6.1:libfaad.a:  Total 8138 bytes of variables in copy-on-write sections
flac-1.2.1-r2:libFLAC.a:  Total 1044 bytes of variables in copy-on-write sections
fontconfig-2.5.0-r1:libfontconfig.a:  Total 2196 bytes of variables in copy-on-write sections
gd-2.0.35:libgd.a:  Total 144508 bytes of variables in copy-on-write sections
giflib-4.1.6:libgif.a:  Total 1043 bytes of variables in copy-on-write sections
ilmbase-1.0.1:libHalf.a:  Total 1 bytes of variables in copy-on-write sections
ilmbase-1.0.1:libIex.a:  Total 8 bytes of variables in copy-on-write sections
ilmbase-1.0.1:libIlmThread.a:  Total 16 bytes of variables in copy-on-write sections
ilmbase-1.0.1:libImath.a:  Total 311 bytes of variables in copy-on-write sections
imlib-1.9.15-r2:libImlib.a:  Total 28 bytes of variables in copy-on-write sections
imlib2-1.4.0:id3.a:  Total 8 bytes of variables in copy-on-write sections
imlib2-1.4.0:libImlib2.a:  Total 17468 bytes of variables in copy-on-write sections
imlib2-1.4.0:xpm.a:  Total 8 bytes of variables in copy-on-write sections
jasper-1.900.1-r1:libjasper.a:  Total 13703 bytes of variables in copy-on-write sections
lcms-1.17:liblcms.a:  Total 11002 bytes of variables in copy-on-write sections
libao-0.8.8:libalsa09.a:  Total 96 bytes of variables in copy-on-write sections
libao-0.8.8:libao.a:  Total 590 bytes of variables in copy-on-write sections
libao-0.8.8:liboss.a:  Total 72 bytes of variables in copy-on-write sections
libao-0.8.8:libpulse.a:  Total 80 bytes of variables in copy-on-write sections
libart_lgpl-2.3.19-r1:libart_lgpl_2.a:  Total 16 bytes of variables in copy-on-write sections
libcddb-1.3.0-r1:libcddb.a:  Total 2472 bytes of variables in copy-on-write sections
libdvdcss-1.2.9-r1:libdvdcss.a:  Total 2064 bytes of variables in copy-on-write sections
libdvdread-0.9.7:libdvdread.a:  Total 2108 bytes of variables in copy-on-write sections
libexif-0.6.16-r1:libexif.a:  Total 51096 bytes of variables in copy-on-write sections
libgpod-0.6.0:libgpod.a:  Total 20 bytes of variables in copy-on-write sections
libid3tag-0.15.1b:libid3tag.a:  Total 1 bytes of variables in copy-on-write sections
liblrdf-0.4.0:liblrdf.a:  Total 57376 bytes of variables in copy-on-write sections
libmikmod-3.1.11-r5:libmikmod.a:  Total 7798 bytes of variables in copy-on-write sections
libmp4v2-1.5.0.1:libmp4v2.a:  Total 288 bytes of variables in copy-on-write sections
libsdl-1.2.13:libSDL.a:  Total 49795 bytes of variables in copy-on-write sections
libsndfile-1.0.17-r1:libsndfile.a:  Total 18651 bytes of variables in copy-on-write sections
libsvg-0.1.4:libsvg.a:  Total 552 bytes of variables in copy-on-write sections
libvorbis-1.2.0:libvorbis.a:  Total 4 bytes of variables in copy-on-write sections
netpbm-10.40.0:libnetpbm.a:  Total 10612 bytes of variables in copy-on-write sections
openexr-1.6.1:libIlmImf.a:  Total 1205 bytes of variables in copy-on-write sections
raptor-1.4.16:libraptor.a:  Total 3365 bytes of variables in copy-on-write sections
sdl-gfx-2.0.16:libSDL_gfx.a:  Total 6184 bytes of variables in copy-on-write sections
sdl-image-1.2.6:libSDL_image.a:  Total 67259 bytes of variables in copy-on-write sections
sdl-mixer-1.2.8:libSDL_mixer.a:  Total 495 bytes of variables in copy-on-write sections
sdl-net-1.2.7:libSDL_net.a:  Total 7 bytes of variables in copy-on-write sections
sdl-pango-0.1.2:libSDL_Pango.a:  Total 44 bytes of variables in copy-on-write sections
sdl-ttf-2.0.9:libSDL_ttf.a:  Total 19 bytes of variables in copy-on-write sections
smpeg-0.4.4-r9:libsmpeg.a:  Total 117478 bytes of variables in copy-on-write sections
t1lib-5.1.1:libt1.a:  Total 45191 bytes of variables in copy-on-write sections
tiff-3.8.2-r3:libtiff.a:  Total 7746 bytes of variables in copy-on-write sections
x264-svn-20070924:libx264.a:  Total 16832 bytes of variables in copy-on-write sections
xvid-1.1.3-r2:libxvidcore.a:  Total 209384 bytes of variables in copy-on-write sections

As you can see there’s quite some space for improvement, and remember that these statistics are done on non-PIC objects, the PIC objects will have probably more, within .data.rel sections.

I’ll be writing a few more patches around for this, trying to reduce the numbers as much as I can, even if sometimes it won’t cause much improvement on the actual shared library, either because of relocation of symbols, or just because there are one or two static variables that cause the 4KB for the cow section to be allocated.

My overlay now features patches to mark tables as constant for: libvorbis, libtheora, libpng, giflib, speex, flac, libdca and libmpcdec. I’ll add a few more either tonight when I come home or tomorrow, or anyway in the next weeks.

All the patches were sent to the respective upstream, so that they can be applied there and be available for everybody in the future.

Hopefully some of them might just be applied in Gentoo too, soon :)