More threads is not always a good solution

You might remember that some time ago I took over unpaper as I used (and sort of use still) to pre-filter the documents I scan to archive in electronic form. While I’ve been very busy in the past few months, I’m now trying to clear out my backlog, and it turns out that a good deal of it involves unpaper. There are bugs to fix and features to implement, but also patches to evaluate.

One of the most recent patches I received is designed to help with the performance of the tool, which are definitely not what Jens, the original author, had in mind when he came up with the code. Better performance is something that about everybody would love to have at this point. Unfortunately, it turns out that the patch is not working out as intended.

The patch is two-fold: from one side, it makes (optional) use of OpenMP to parallelize the code in hope to speed it up. Given that most of the code is a bunch of loops, it seemed obvious that using multithreading would speed things out, right? Well, first try after applying it showed very easily that it slows down, at least on Excelsior, which is a 32-cores system. While it would take less than 10 seconds for the first test to run without OpenMP, it would take over a minute with it, spinning up all the cores for a 3000% CPU usage.

A quick test shows that forcing the number of threads to 8, rather than leaving it unbound, makes it actually faster than the non-OpenMP variant. This means that there are a few different variables in play that needs to be tuned for performance to improve. Without going into profiling the code, I can figure out a few things that can go wrong with unchecked multithreading:

  • extensive locking when each worker thread is running, either because they are all accessing the same memory page, or because the loop needs a “reduced” result (e.g. a value has to be calculated as the sum of values calculated within a parallelized loop); in the case of unpaper I’m sure both situations happen fairly often in the looping codepath.
  • cache trashing: as the worker threads jump around the memory area to process, it no longer is fetching a linear amount of memory;
  • something entirely more complicated.

Beside being obvious that doing the “stupid” thing and just making all the major loops parallel, this situation is bringing me to the point where I finally will make use of the Using OpenMP book that I got a few years back and I only started reading after figuring out that OpenMP was not ready yet for prime time. Nowadays OpenMP support in Linux improved to the point it’s probably worth taking another look at it, and I guess unpaper is going to be the test case for it. You can expect a series of blog posts on the topic at this point.

The first thing I noticed while reading the way OpenMP handles shared and private variables, is that the const indication is much stronger when using OpenMP. The reason is that if you tell the code that a given datum is not going to change (it’s a constant, not a variable), it can easily assume direct access from all the threads will work properly; the variable is shared by default among all of them. This is something that, for non-OpenMP programs, is usually inferred from the SSA form — I assume that for whatever reason, OpenMP makes SSA weaker.

Unfortunately, this also means that there is one nasty change tha tmight be required to make code friendlier to OpenMP, and that is a change in prototypes of functions. The parameters to a function are, for what OpenMP is concerned, variables, and that means that unless you declare them const, it won’t be sharing them by default. Within a self-contained program like unpaper, changing the signatures of the functions so that parameters are declared, for instance, const int is painless — but for a library it would be API breakage.

Anyway, just remember: adding multithreading is not the silver bullet you might think it is!

P.S.: I’m currently trying to gauge interest on a volume that would collect, re-edit, and organize all I written up to now on ELF. I’ve created a page on leanpub where you can note down your interest, and how much you think such a volume would be worth. If you would like to read such a book, just add your interest through the form; as soon as at least a dozen people are interested, I’ll spend all my free time working on the editing.

Why would an executable export symbols?

I think this post might be interesting for those people interested in trying to get all the performance power out of a box, without breaking anything in the process.

I’ve blogged before about the problems related to exporting symbols from final executables, but I haven’t really dug deep enough to actually provide useful information to developers and users about what those exported symbols represent, for an executable.

First of all, let’s start to say that an executable under Linux, and on most modern Unixes, is just the same kind of file of a shared object (shared library, if you prefer, those which are called DLL under Windows, if you come from a Windows background). And exactly as shared libraries, they can export symbols.

Exported symbols are resolved by the dynamic – runtime – linker, through the process of dynamic bindings, and thy might collide. I’ll return on the way the runtime linker works at a different moment. For now let’s just say that the exported symbols require some extra step to be taken during the execution of a program, and that the process takes time.

Executables don’t usually need to export symbols, and they usually don’t export symbols at all. Although there are rare cases where executables are required to export symbols, for instance because they are used by some of the libraries they link to as a “callback” from the library to the program, or for C++ programs for RTTI to properly work, most of the times the symbols are exported just because of libtool.

By default when you link a program, it doesn’t get its symbols exported, they are hidden and thus resolved directly at buildtime, for those symbols present in the source files themselves. When you add code to a convenience library that is built with libtool, then something changes, and the symbols defined inside that library are exported even when linking it statically inside the final executable.

This causes quite a few drawbacks. and as I said, is not usually used for anything:

  • the symbols are resolved at runtime through dynamic binding, which takes time, even if usually very little for a normal system, repeated time wasted during dynamic binding might actually become a good deal of time;

  • the symbols might collide with a library loaded afterward, this is for instance why recode breaks PHP;

  • using --gc-sections won’t help much because exported symbols are seen as always used, and this might increase the amount of code added to the executable with no good reason;

  • prelink will likely set the wrong addresses for symbols that collide, which in turn will drop off the improvement of using prelink entirely, at least for some packages.

The easy solution for this is for software packages to actually check if the compiler supports hidden visibility, and if it does, hide all the symbols but for the ones in the public API of their libraries. In case of software like cmake that install no shared objects, hidden visibility could be forced by the ebuild, but to give back to the community, the best thing is to get as much software as possible to use hidden visibility, thus reducing the amount of symbols that gets exported on both binaries and shared libraries.

I hope these few notes might actually help Gentoo maintainers to understand why I’m stressing on this point. It would be nice if we all could improve the software we maintain, even one step at a time.

As for what concerns my linking collision scripts, the bad packages’ list got a few more entries today: KViewShell with djvulibre, Karbon with gdk-pixbuf, and gcj, with both boehm-gc and libltdl.

And now I can actually start seeing the true collisions, like gfree symbol, having two different definitions in libgunicode.so (fontforge) and poppler/libkpdfpart (xpdf code), or the scan_token symbol in ghostscript with a completely different definition in libXfont/libt1.

Talking about libXfont and libt1 (or t1lib). I wonder if there is hope in the future for one to use the other, rather than both use the same parser code for type1 fonts. I’ll have to check the FreeDesktop bugzilla tomorrow to see if it was ever discussed. At the moment they duplicate a lot of symbols one with the other.

I have to say, PostgreSQL is an important speed improvement, that will allow me to complete my task in shorter time. Now I’m waiting for Patrick to run my script over the whole set of packages in Gentoo, that might actually be something. If only there was an easy way to make building and testing code faster (for COW reduction) without changing hardware, that would be awesome. Unfortunately that I need to do locally :(

Reminding a weakness of Prelink

For extra content about this entry, please refer to the previous one which talks about array of strings and PIC.

As I say in the other post, prelink can reduce the amount of dirty RSS pages due to COW of PIC code. As prelink assigns to every library a predefined load address in memory, which is either truly unique, or unique within the scope of the set of programs who are said to be able to load that library, there is no COW (Copy-on-Write) during load as the loader doesn’t have to change the address loaded from the file, and is thus able to share the same page with many processes. This is how prelink save (sometimes a lot of) memory.

Unfortunately there is one big problem especially with modern software architectures: many programs now use runtime-loaded plugins for functions; the whole KDE architecture is based on this, even for KDE 4, as well as xine and others and others.

The problem is that prelink can’t really take into account the plugins, as it doesn’t know about them. For instance it can’t understand that amarok is able to load the xine engine, thus amarokapp is not going to be prelinked for libxine.so. Additionally, it can’t understand that libxine.so is able to load the xineplug_decode_ff.so, which in turn depends on libavcodec.so. This means that for instance when using the -m switch, it could be assigning libqt3-mt.so and libavcodec.so the same address.. causing a performance hit rather than improvement at runtime, when the linker will have to relocate all the code of libavcodec.so, triggering a COW.

The same is true for almost all scripting languages which use C-compiled extensions: Perl, Ruby, Python, as you can’t tell that the interpreter can load them just by looking at the ELF file, which is what prelink does.

A possible way around this is to define post-facto, that is after compilation, which shared object the program can load. It could probably be done through a special .note section in the ELF file and a modified prelink, but I’m afraid it would be quite difficult to properly implement it especially in ebuilds. On the other hand, it might give quite some performance improvement; as I said today’s software architecture are often based on on-demand loading of code through plugins, so it could be quite interesting.