Blockers abuse

I’ve already written about some of the differences between my and Patrick’s tinderboxes, one of which is that my tinderbox does not only install one package, but tries to install all of them at once. This is a necessity for me to have enough material to work on with my library collisions script, and still have so many side effects that makes it funny to work with.

The first problem is that sometimes packages don’t get merged together because of file collisions, which most of the time are caused by packages that install commands with name too generic, and some other times because they regenerate files that should not be present in the final image (like iconv’s charset.alias that gets generated on Gentoo/FreeBSD systems with no good reason at all).

The second problem derives from the way the first problem is handled. When two packages install a file with the same name, rather than renaming one of them or both, it’s customary to actually “fix” the problem by adding blockers to each of the two packages so that they cannot get installed together. While it’s certainly better to have it expressed that way rather than having the merge to fail after the compilation and install phases, it’s not really a solution since it still disallows having the two packages on the same system. While this is acceptable for packages like the different GhostScript implementations that apply for the same task, this is not much of a solution when the packages are entirely independent one from the other and have very different tasks.

I also have found one particular package (pnet) which had a very funky solution to the collision between that and boehm-gc, considering that it was installing a private copy of that. Obviously this was not the proper fix by a mile’s look.

If you have two packages that block each other you have a few different ways to deal with that; if they provide the same function, you might as well install them with a prefix and then write an eselect module to choose between one or the other (which is something that ghostscript could very well be doing); if they only install executables with generic name, they might be changed to prefix the command name with the package name. But sometimes these commands are not to be used by the users at all, and are rather internal commands used by the scripts for processing; in those cases, it would be a nice idea to make those get into /usr/libexec/$PN/ so that they are taken out of the users’ path, and won’t collide one with the other.

While dealing with packages that install colliding files is not so easy, there is need for developers to deal with them in a less “works for me” way, and think more of the general picture, as it is, there are enough packages in the tree that blocks each other with no real good reason, and this is upsetting.

Why would an executable export symbols?

I think this post might be interesting for those people interested in trying to get all the performance power out of a box, without breaking anything in the process.

I’ve blogged before about the problems related to exporting symbols from final executables, but I haven’t really dug deep enough to actually provide useful information to developers and users about what those exported symbols represent, for an executable.

First of all, let’s start to say that an executable under Linux, and on most modern Unixes, is just the same kind of file of a shared object (shared library, if you prefer, those which are called DLL under Windows, if you come from a Windows background). And exactly as shared libraries, they can export symbols.

Exported symbols are resolved by the dynamic – runtime – linker, through the process of dynamic bindings, and thy might collide. I’ll return on the way the runtime linker works at a different moment. For now let’s just say that the exported symbols require some extra step to be taken during the execution of a program, and that the process takes time.

Executables don’t usually need to export symbols, and they usually don’t export symbols at all. Although there are rare cases where executables are required to export symbols, for instance because they are used by some of the libraries they link to as a “callback” from the library to the program, or for C++ programs for RTTI to properly work, most of the times the symbols are exported just because of libtool.

By default when you link a program, it doesn’t get its symbols exported, they are hidden and thus resolved directly at buildtime, for those symbols present in the source files themselves. When you add code to a convenience library that is built with libtool, then something changes, and the symbols defined inside that library are exported even when linking it statically inside the final executable.

This causes quite a few drawbacks. and as I said, is not usually used for anything:

  • the symbols are resolved at runtime through dynamic binding, which takes time, even if usually very little for a normal system, repeated time wasted during dynamic binding might actually become a good deal of time;

  • the symbols might collide with a library loaded afterward, this is for instance why recode breaks PHP;

  • using --gc-sections won’t help much because exported symbols are seen as always used, and this might increase the amount of code added to the executable with no good reason;

  • prelink will likely set the wrong addresses for symbols that collide, which in turn will drop off the improvement of using prelink entirely, at least for some packages.

The easy solution for this is for software packages to actually check if the compiler supports hidden visibility, and if it does, hide all the symbols but for the ones in the public API of their libraries. In case of software like cmake that install no shared objects, hidden visibility could be forced by the ebuild, but to give back to the community, the best thing is to get as much software as possible to use hidden visibility, thus reducing the amount of symbols that gets exported on both binaries and shared libraries.

I hope these few notes might actually help Gentoo maintainers to understand why I’m stressing on this point. It would be nice if we all could improve the software we maintain, even one step at a time.

As for what concerns my linking collision scripts, the bad packages’ list got a few more entries today: KViewShell with djvulibre, Karbon with gdk-pixbuf, and gcj, with both boehm-gc and libltdl.

And now I can actually start seeing the true collisions, like gfree symbol, having two different definitions in (fontforge) and poppler/libkpdfpart (xpdf code), or the scan_token symbol in ghostscript with a completely different definition in libXfont/libt1.

Talking about libXfont and libt1 (or t1lib). I wonder if there is hope in the future for one to use the other, rather than both use the same parser code for type1 fonts. I’ll have to check the FreeDesktop bugzilla tomorrow to see if it was ever discussed. At the moment they duplicate a lot of symbols one with the other.

I have to say, PostgreSQL is an important speed improvement, that will allow me to complete my task in shorter time. Now I’m waiting for Patrick to run my script over the whole set of packages in Gentoo, that might actually be something. If only there was an easy way to make building and testing code faster (for COW reduction) without changing hardware, that would be awesome. Unfortunately that I need to do locally :(

Increasing analyser performance through PostgreSQL

As I wrote before, sqlite has big performance issues which makes running my analysis tool a way too long task.

To avoid wasting time running it, and to be able to run it on bigger datasets, I’ve been planning to port it to use PostgreSQL instead. While having the script almost self-containing was a nice thing, the huge amount of time wasted by sqlite, and its way to convert a CPU-bound problem into an I/O-bound one, forced me to take a different approach.

PostgreSQL is not really that common compared to MySQL, but that’s what I do run for my own software, and what I’ll continue working with in the future, so I’ve preferred that. I’m gladly taking patch to split the code handling database out of the harvesting logic, so that one can choose between PostgreSQL and other implementations, even SQLite again.

The time requested to do harvesting and analysis is now down to less than two hours, from an original of a value between 2 and 6 hours. Isn’t it nice?

I’m looking at the output by hand at the moment, it’s actually listing a lot more data than last time because this time I had the idea of running it as root, so that it could access data about suid non-readable programs. I’ll improve the suppression file a lot tonight and run it again when I’m done removing false positives.

Interestingly enough, there seems to be a lot of binaries exporting xmalloc, xrealloc and xstrdup symbols. It’s interesting because it shows that the original malloc, realloc and strdup functions are not considered safe enough by quite a bit of software, and also because I think those should not be exported, but rather used inline, or at least static or hidden. Can we be certain that all those binaries don’t use different meanings for xmalloc and the like?

Symbol xmalloc@@ (64-bit UNIX System V ABI AMD x86-64 architecture) present 12 times

You can see that a few packages using that symbol are part of GNU. If there is anyone here who’s a GNU insider and can try to get people to hide that symbol, or use it inline, that would be quite helpful :)

And recode seems to be again in the list of the bad guys like for flex symbols ; I wonder if it’s still maintained, or if there is need to actually fork it to improve it.

Flex and (linking) conflicts, or a possible reason why PHP and Recode are so crashy

So, this past week I was off teaching a course about programming and Linux at a company I worked for for a while now. Some of the insight about what people need to know and rarely do know are helping me to decide what to focus on in the future in my blog (and not limited to).

Today, though, I want to blog about not something I explained during the course, but something that was explained to me, about Bison and Flex. It’s related to the output of my linking script:

Symbol yytext@@ (64-bit UNIX System V ABI AMD x86-64 architecture) present 7 times

yytext, together with a few other yy symbols are generated by Flex when generating the code for a lexer (for what most of my readers are concerned, these are part of a parser). These symbols are private to a single parser, and should not be exported to other parsers. I wasn’t sure about their privateness so I haven’t reported them up to now, but now I am sure: they should not be shared between two different parsers.

Both librecode and PHP export their parser’s symbols, which would create a situation where the two parsers are sharing buffers and… well, let’s just say you don’t want to share a plate between someone eating Nutella, and someone else eating Pasta, would you?

This might actually cause quite a few problems, and as hoffie said, recode is used by PHP and is often broken when used together other extensions. I can’t be sure the problems are all limited to this, but this is certainly a good point to start if we want to fix those.

Easy way out, would be to make sure that php executables don’t export symbols that extensions don’t need to use; proper way out would be adding to that also proper visibility inside recode itself, but I wonder if it’s still maintained, release 3.6 is quite old, and even patching it is a hard task as it doesn’t even recognise AMD64 by default.

What to do with shared code

In software engineering we all get told that code re-use is a good thing, as it allows us not to reinvent the wheel every time we need a function, and also limits the bugs in one point, which in turn makes it easier to propagate the fixes.

The basic way to do this is to write a library which takes care of common things. This is why stuff like glib exists; there are libraries working on particular topics and more generic libraries. Myself, I try to use stuff from libraries as much as possible, rather than inventing my own functions, even for stuff like configuration file parsers, for which I use libconfuse.

Sometimes you have code you just want to share between a few programs which are part of the same software package. The easiest way to do that is to create a “commodity library”, which gets built inside the source tree, and then statically linked in the executables or libraries needing it.

There are a few caveats with the use of commodity libraries though; the first obvious one is that it increases the size of the executables: you end up copying the code in multiple executables; while this is fine for stuff like simple replacement functions, think of what gnulib, it’s certainly not a good idea if the amount of code you’re writing in the commodity function starts getting bigger. Another problem is that often the function names in those commodity functions are quite generic, making it way more easier for executables to have symbol collisions with other indipendent libraries.

There are thus a few things you might want to do when you use commodity libraries then. The first is to create a shared commodity library. Such a library can be called lib${project}(private|core|shared) and its soname can change between releases without making it anymore difficult for users or packagers, as nothing outside the software itself should be using it. This solves the problem with sharing the code, and if you’re good enough to that, you’d also be using a prefix (like ${project}) to the function names, to avoid symbol collisions.

Another thing you might want to consider, is to use -fvisibility=hidden to at least hide the symbols from the internal functions from static commodity libraries, and -Wl,--gc-sections to discard the unused code out of them when linking them back in the executables, at least to reduce the amount of space wasted.

I’m not posting this out of my fantasy, with the current run of my collision checking script, that finally also checks all the executable file, I was able to identify a few projects suffering from this problem:

* samba, which is a very common project, and often consider one of the best opensource projects, seems to write a huge commodity library and just use in all its 22 executables and libraries. It turns out in the output of my script a lot of times. Upstream bug #5219

  • mysql, another “well known” opensource project suffer from the same issue, even the resolveip command, which seems quite trivial to me, is exporting a huge amount of symbols from its commodity library;

  • cmake, which I criticised quite a bit already; this time the problem seems to be that cmake also has a commodity library linked in, at least, cpack, cmake, ccmake and ctest.

  • cdrkit, not sure if it comes from the previous Sillyng maintainance, but cdrkit’s files also seems to have this kind of problem.

  • wireshark, even though it has a seems to have the same symbols duplicated in the main GTK interface and in a few other console tools.

  • ghostscript, even though it has a library, the gs binary does not link to it, and duplicates tons of symbols (included the internal libraries);

  • graphviz has and which duplicates quite a few symbols; but the two of them seems to be two different implementations; might be worth looking into;

  • rosegarden seem to share quite a bit of code between its main program and the rosegardensequencer program;

Having time, I might take care of patching these projects to use shared commodity libraries, saving space on disk, my box feels quite old when I’m doing this kind of job. Help within these projects is certainly welcome.

On the other hand, there are a few ways to reduce the impact of these silly things without changing the code. I already shown the diagnostic usage of --gc-sections, but there is also an actual production use of this, of which I hinted in that post too.

In xine, I use -Wl,--gc-sections in production (well in 1.2 branch at least), so that the unused code of the internal copies of libraries (such as gsm’s code, nosefart’s, libdvdnav, just to name the three that are still present in xine-lib even in Gentoo) is dropped, if possible, without nasty performance hits (so I’m not using -fdata-sections and -ffunction-sections).

Using the same trick on the above-mentioned projects is likely to produce similar results. I tried it on samba, and this is the result for the smbpasswd command:

flame@enterprise bin % ls -l `which smbpasswd` smbpasswd
-rwxr-xr-x 1 flame flame 1397016 21 gen 20:18 smbpasswd
-rwxr-xr-x 1 root  root  2029128 11 dic 09:20 /usr/bin/smbpasswd

As you can see the size was almost cut in half. And this is without using -fvisibility=hidden on the commodity library, so a lot of symbols are still present because they are exported.

In my opinion, in case like those we should be forcing --gc-sections and/or -fvisibility=hidden inside the ebuild, to improve the quality of the build to the users. Users should not use those flags, as they can break software, but we should be good enough to know when to use them or not.

Why oh why, sqlite…

Today I returned working on my ELF symbols’ collision detection script. Basically a simple script to tell you about shared objects or programs that have symbols with the same name, which will cause symbols’ collisions at runtime.

It’s a nice script not only for that, but also because it allows you to spot internal copies of libraries, as I wrote a long time ago. I resumed working on this after talking a bit with Patrick, to allow executing it on an arbitrary set of directories, rather than executing it only on the currently-running system.

There are already a few improvement in the git repository, notably now the harvest script does not only look for files ending with .so, but checks all the ELF files. Unfortunately to do so, it has to open the files one by one at the moment, and then close them. As soon as i can I want to rewrite this to just do one pass through the files, so that they are not all open twice.

There is one bad performance problem with SQLite (3) though: whenever I do an INSERT (which happens quite a lot as I’m doing one per file, and then one per symbol), sqlite seems to open the directory where its database file is generated (/home/flame/mytmpfs), create a journal file for the transition, and the unlink it.

It would certainly take less time if:

  • the directory was opened once, and left open till needed;
  • the journal file was created once, and truncated every time as needed until the database is closed (and then could be unlinked).

Anyway, I’ll be working on the script today, as I’m taking a day off from work, and then tomorrow I’ll leave to someone else to handle the bugs I’ll be filing ;)

Update: I have one extra reason to hate SQLite now. While running the consumer script (link_collisions.rb) on the SQLite database which I just completed generating on a tmpfs filesystem, there are a few complex queries being ran. To cache those, sqlite tries to be smart, and creates a temporary file:

flame@enterprise openoffice % ls /proc/31582/fd -l
total 0
lrwx------ 1 flame flame 64 21 gen 19:19 0 -> /dev/pts/5
l-wx------ 1 flame flame 64 21 gen 19:19 1 -> /home/flame/mytmpfs/collisions
lrwx------ 1 flame flame 64 21 gen 19:18 2 -> /dev/pts/5
lrwx------ 1 flame flame 64 21 gen 19:19 3 -> /home/flame/mytmpfs/symbols-database.sqlite3
lrwx------ 1 flame flame 64 21 gen 19:19 4 -> /var/log/faillog
lrwx------ 1 flame flame 64 21 gen 19:19 5 -> /var/tmp/etilqs_R4Cy7sH2VuPIYDH (deleted)

The problem is that /var/tmp is on disk, rather than in memory. I had the doubt when I’ve seen that the ruby instance was not CPU bound like I thought (all in memory should cause it to be CPU bound rather than I/O bound).

Note that if it used /tmp, I was also using tmpfs there. I wonder why on earth it uses /var/tmp for that stuff.

(And this reminds me that I need a ew faster box, any help with that would make me happy ;) ).

Later I’ll post a summary of what I gathered from this run. I can say as a preview that I filed two bugs already.