What to do with shared code

In software engineering we all get told that code re-use is a good thing, as it allows us not to reinvent the wheel every time we need a function, and also limits the bugs in one point, which in turn makes it easier to propagate the fixes.

The basic way to do this is to write a library which takes care of common things. This is why stuff like glib exists; there are libraries working on particular topics and more generic libraries. Myself, I try to use stuff from libraries as much as possible, rather than inventing my own functions, even for stuff like configuration file parsers, for which I use libconfuse.

Sometimes you have code you just want to share between a few programs which are part of the same software package. The easiest way to do that is to create a “commodity library”, which gets built inside the source tree, and then statically linked in the executables or libraries needing it.

There are a few caveats with the use of commodity libraries though; the first obvious one is that it increases the size of the executables: you end up copying the code in multiple executables; while this is fine for stuff like simple replacement functions, think of what gnulib, it’s certainly not a good idea if the amount of code you’re writing in the commodity function starts getting bigger. Another problem is that often the function names in those commodity functions are quite generic, making it way more easier for executables to have symbol collisions with other indipendent libraries.

There are thus a few things you might want to do when you use commodity libraries then. The first is to create a shared commodity library. Such a library can be called lib${project}(private|core|shared) and its soname can change between releases without making it anymore difficult for users or packagers, as nothing outside the software itself should be using it. This solves the problem with sharing the code, and if you’re good enough to that, you’d also be using a prefix (like ${project}) to the function names, to avoid symbol collisions.

Another thing you might want to consider, is to use -fvisibility=hidden to at least hide the symbols from the internal functions from static commodity libraries, and -Wl,--gc-sections to discard the unused code out of them when linking them back in the executables, at least to reduce the amount of space wasted.

I’m not posting this out of my fantasy, with the current run of my collision checking script, that finally also checks all the executable file, I was able to identify a few projects suffering from this problem:

* samba, which is a very common project, and often consider one of the best opensource projects, seems to write a huge commodity library and just use in all its 22 executables and libraries. It turns out in the output of my script a lot of times. Upstream bug #5219

  • mysql, another “well known” opensource project suffer from the same issue, even the resolveip command, which seems quite trivial to me, is exporting a huge amount of symbols from its commodity library;

  • cmake, which I criticised quite a bit already; this time the problem seems to be that cmake also has a commodity library linked in, at least, cpack, cmake, ccmake and ctest.

  • cdrkit, not sure if it comes from the previous Sillyng maintainance, but cdrkit’s files also seems to have this kind of problem.

  • wireshark, even though it has a libwireshark.so seems to have the same symbols duplicated in the main GTK interface and in a few other console tools.

  • ghostscript, even though it has a libgs.so library, the gs binary does not link to it, and duplicates tons of symbols (included the internal libraries);

  • graphviz has libgvc.so and libgvc_builtins.so which duplicates quite a few symbols; but the two of them seems to be two different implementations; might be worth looking into;

  • rosegarden seem to share quite a bit of code between its main program and the rosegardensequencer program;

Having time, I might take care of patching these projects to use shared commodity libraries, saving space on disk, my box feels quite old when I’m doing this kind of job. Help within these projects is certainly welcome.

On the other hand, there are a few ways to reduce the impact of these silly things without changing the code. I already shown the diagnostic usage of --gc-sections, but there is also an actual production use of this, of which I hinted in that post too.

In xine, I use -Wl,--gc-sections in production (well in 1.2 branch at least), so that the unused code of the internal copies of libraries (such as gsm’s code, nosefart’s, libdvdnav, just to name the three that are still present in xine-lib even in Gentoo) is dropped, if possible, without nasty performance hits (so I’m not using -fdata-sections and -ffunction-sections).

Using the same trick on the above-mentioned projects is likely to produce similar results. I tried it on samba, and this is the result for the smbpasswd command:

flame@enterprise bin % ls -l `which smbpasswd` smbpasswd
-rwxr-xr-x 1 flame flame 1397016 21 gen 20:18 smbpasswd
-rwxr-xr-x 1 root  root  2029128 11 dic 09:20 /usr/bin/smbpasswd

As you can see the size was almost cut in half. And this is without using -fvisibility=hidden on the commodity library, so a lot of symbols are still present because they are exported.

In my opinion, in case like those we should be forcing --gc-sections and/or -fvisibility=hidden inside the ebuild, to improve the quality of the build to the users. Users should not use those flags, as they can break software, but we should be good enough to know when to use them or not.