The subtlety of modern CPUs, or the search for the phantom bug

Yesterday I have released a new version of unpaper which is now in Portage, even though is dependencies are not exactly straightforward after making it use libav. But when I packaged it, I realized that the tests were failing — but I have been sure to run the tests all the time while making changes to make sure not to break the algorithms which (as you may remember) I have not designed or written — I don’t really have enough math to figure out what’s going on with them. I was able to simplify a few things but I needed Luca’s help for the most part.

Turned out that the problem only happened when building with -O2 -march=native so I decided to restrict tests and look into it in the morning again. Indeed, on Excelsior, using -march=native would cause it to fail, but on my laptop (where I have been running the test after every single commit), it would not fail. Why? Furthermore, Luca was also reporting test failures on his laptop with OSX and clang, but I had not tested there to begin with.

A quick inspection of one of the failing tests’ outputs with vbindiff showed that the diffs would be quite minimal, one bit off at some non-obvious interval. It smelled like a very minimal change. After complaining on G+, Måns pushed me to the right direction: some instruction set that differs between the two.

My laptop uses the core-avx-i arch, while the server uses bdver1. They have different levels of SSE4 support – AMD having their own SSE4a implementation – and different extensions. I should probably have paid more attention here and noticed how the Bulldozer has FMA4 instructions, but I did not, it’ll show important later.

I decided to start disabling extensions in alphabetical order, mostly expecting the problem to be in AMD’s implementation of some instructions pending some microcode update. When I disabled AVX, the problem went away — AVX has essentially a new encoding of instructions, so enabling AVX causes all the instructions otherwise present in SSE to be re-encoded, and is a dependency for FMA4 instructions to be usable.

The problem was reducing the code enough to be able to figure out if the problem was a bug in the code, in the compiler, in the CPU or just in the assumptions. Given that unpaper is over five thousands lines of code and comments, I needed to reduce it a lot. Luckily, there are ways around it.

The first step is to look in which part of the code the problem appears. Luckily unpaper is designed with a bunch of functions that run one after the other. I started disabling filters and masks and I was able to limit the problem to the deskewing code — which is when most of the problems happened before.

But even the deskewing code is a lot — and it depends on at least some part of the general processing to be run, including loading the file and converting it to an AVFrame structure. I decided to try to reduce the code to a standalone unit calling into the full deskewing code. But when I copied over and looked at how much code was involved, between the skew detection and the actual rotation, it was still a lot. I decided to start looking with gdb to figure out which of the two halves was misbehaving.

The interface between the two halves is well-defined: the first return the detected skew, and the latter takes the rotation to apply (the negative value to what the first returned) and the image to apply it to. It’s easy. A quick look through gdb on the call to rotate() in both a working and failing setup told me that the returned value from the first half matched perfectly, this is great because it meant that the surface to inspect was heavily reduced.

Since I did not want to have to test all the code to load the file from disk and decode it into a RAW representation, I looked into the gdb manual and found the dump commands that allows you to dump part of the process’s memory into a file. I dumped the AVFrame::data content, and decided to use that as an input. At first I decided to just compile it into the binary (you only need to use xxd -i to generate C code that declares the whole binary file as a byte array) but it turns out that GCC is not designed to compile efficiently a 17MB binary blob passed in as a byte array. I then opted in for just opening the raw binary file and fread() it into the AVFrame object.

My original plan involved using creduce to find the minimal set of code needed to trigger the problem, but it was tricky, especially when trying to match a complete file output to the md5. I decided to proceed with the reduction manually, starting from all the conditional for pixel formats that were not exercised… and then I realized that I could split again the code in two operations. Indeed while the main interface is only rotate(), there were two logical parts of the code in use, one translating the coordinates before-and-after the rotation, and the interpolation code that would read the old pixels and write the new ones. This latter part also depended on all the code to set the pixel in place starting from its components.

By writing as output the calls to the interpolation function, I was able to restrict the issue to the coordinate translation code, rather than the interpolation one, which made it much better: the reduced test case went down to a handful of lines:

void rotate(const float radians, AVFrame *source, AVFrame *target) {
    const int w = source->width;
    const int h = source->height;

    // create 2D rotation matrix
    const float sinval = sinf(radians);
    const float cosval = cosf(radians);
    const float midX = w / 2.0f;
    const float midY = h / 2.0f;

    for (int y = 0; y < h; y++) {
        for (int x = 0; x < w; x++) {
            const float srcX = midX + (x - midX) * cosval + (y - midY) * sinval;
            const float srcY = midY + (y - midY) * cosval - (x - midX) * sinval;
            externalCall(srcX, srcY);

Here externalCall being a simple function to extrapolate the values, the only thing it does is printing them on the standard error stream. In this version there is still reference to the input and output AVFrame objects, but as you can notice there is no usage of them, which means that now the testcase is self-contained and does not require any input or output file.

Much better but still too much code to go through. The inner loop over x was simple to remove, just hardwire it to zero and the compiler still was able to reproduce the problem, but if I hardwired y to zero, then the compiler would trigger constant propagation and just pre-calculate the right value, whether or not AVX was in use.

At this point I was able to execute creduce; I only needed to check for the first line of the output to match the “incorrect” version, and no input was requested (the radians value was fixed). Unfortunately it turns out that using creduce with loops is not a great idea, because it is well possible for it to reduce away the y++ statement or the y < h comparison for exit, and then you’re in trouble. Indeed it got stuck multiple times in infinite loops on my code.

But it did help a little bit to simplify the calculation. And with again a lot of help by Måns on making sure that the sinf()/cosf() functions would not return different values – they don’t, also they are actually collapsed by the compiler to a single call to sincosf(), so you don’t have to write ugly code to leverage it! – I brought down the code to

extern void externCall(float);
extern float sinrotation();
extern float cosrotation();

static const float midX = 850.5f;
static const float midY = 1753.5f;

void main() {
    const float srcX = midX * cosrotation() - midY * sinrotation();

No external libraries, not even libm. The external functions are in a separate source file, and beside providing fixed values for sine and cosine, the externCall() function only calls printf() with the provided value. Oh if you’re curious, the radians parameter became 0.6f, because 0, 1 and 0.5 would not trigger the behaviour, but 0.6 which is the truncated version of the actual parameter coming from the test file, would.

Checking the generated assembly code for the function then pointed out the problem, at least to Måns who actually knows Intel assembly. Here follows a diff of the code above, built with -march=bdver1 and with -march=bdver1 -mno-fma4 — because turns out the instruction causing the problem is not an AVX one but an FMA4, more on that after the diff.

        movq    -8(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmovss  -20(%rbp), %xmm2
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovss  -20(%rbp), %xmm1
+       vfmsubss        %xmm0, .LC0(%rip), %xmm1, %xmm0
        .cfi_def_cfa 7, 8
-       vsubss  %xmm0, %xmm1, %xmm0
        jmp     externCall@PLT

It’s interesting that it’s changing the order of the instructions as well, as well as the constants — for this diff I have manually swapped .LC0 and .LC1 on one side of the diff, as they would just end up with different names due to instruction ordering.

As you can see, the FMA4 version has one instruction less: vfmsubss replaces both one of the vmulss and the one vsubss instruction. vfmsubss is a FMA4 instruction that performs a Fused Multiply and Subtract operation — midX * cosrotation() - midY * sinrotation() indeed has a multiply and subtract!

Originally, since I was disabling the whole AVX instruction set, all the vmulss instructions would end up replaced by mulss which is the SSE version of the same instruction. But when I realized that the missing correspondence was vfmsubss and I googled for it, it was obvious that FMA4 was the culprit, not the whole AVX.

Great, but how does that explain the failure on Luca’s laptop? He’s not so crazy so use an AMD laptop — nobody would be! Well, turns out that Intel also have their Fused Multiply-Add instruction set, just only with three operands rather than four, starting from Haswell CPUs, which include… Luca’s laptop. A quick check on my NUC which also has a Haswell CPU confirms that the problem exists also for the core-avx2 architecture, even though the code diff is slightly less obvious:

        movq    -24(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmovd   %ebx, %xmm2
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovd   %ebx, %xmm1
+       vfmsub132ss     .LC0(%rip), %xmm0, %xmm1
        addq    $24, %rsp
+       vmovaps %xmm1, %xmm0
        popq    %rbx
-       vsubss  %xmm0, %xmm1, %xmm0
        popq    %rbp
        .cfi_def_cfa 7, 8

Once again I swapped .LC0 and .LC1 afterwards for consistency.

The main difference here is that the instruction for fused multiply-subtract is vfmsub132ss and a vmovaps is involved as well. If I read the documentation correctly this is because it stores the result in %xmm1 but needs to move it to %xmm0 to pass it to the external function. I’m not enough of an expert to tell whether gcc is doing extra work here.

So why is this instruction causing problems? Well, Måns knew and pointed out that the result is now more precise, thus I should not work it around. Wikipedia, as linked before, points also out why this happens:

A fused multiply–add is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b×c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire sum a+b×c to its full precision before rounding the final result down to N significant bits.

Unfortunately this does mean that we can’t have bitexactness of images for CPUs that implement fused operations. Which means my current test harness is not good, as it compares the MD5 of the output with the golden output from the original test. My probable next move is to use cmp to count how many bytes differ from the “golden” output (the version without optimisations in use), and if the number is low, like less than 1‰, accept it as valid. It’s probably not ideal and could lead to further variation in output, but it might be a good start.

Optimally, as I said a long time ago I would like to use a tool like pdiff to tell whether there is actual changes in the pixels, and identify things like 1-pixel translation to any direction, which would be harmless… but until I can figure something out, it’ll be an imperfect testsuite anyway.

A huge thanks to Måns for the immense help, without him I wouldn’t have figured it out so quickly.

More threads is not always a good solution

You might remember that some time ago I took over unpaper as I used (and sort of use still) to pre-filter the documents I scan to archive in electronic form. While I’ve been very busy in the past few months, I’m now trying to clear out my backlog, and it turns out that a good deal of it involves unpaper. There are bugs to fix and features to implement, but also patches to evaluate.

One of the most recent patches I received is designed to help with the performance of the tool, which are definitely not what Jens, the original author, had in mind when he came up with the code. Better performance is something that about everybody would love to have at this point. Unfortunately, it turns out that the patch is not working out as intended.

The patch is two-fold: from one side, it makes (optional) use of OpenMP to parallelize the code in hope to speed it up. Given that most of the code is a bunch of loops, it seemed obvious that using multithreading would speed things out, right? Well, first try after applying it showed very easily that it slows down, at least on Excelsior, which is a 32-cores system. While it would take less than 10 seconds for the first test to run without OpenMP, it would take over a minute with it, spinning up all the cores for a 3000% CPU usage.

A quick test shows that forcing the number of threads to 8, rather than leaving it unbound, makes it actually faster than the non-OpenMP variant. This means that there are a few different variables in play that needs to be tuned for performance to improve. Without going into profiling the code, I can figure out a few things that can go wrong with unchecked multithreading:

  • extensive locking when each worker thread is running, either because they are all accessing the same memory page, or because the loop needs a “reduced” result (e.g. a value has to be calculated as the sum of values calculated within a parallelized loop); in the case of unpaper I’m sure both situations happen fairly often in the looping codepath.
  • cache trashing: as the worker threads jump around the memory area to process, it no longer is fetching a linear amount of memory;
  • something entirely more complicated.

Beside being obvious that doing the “stupid” thing and just making all the major loops parallel, this situation is bringing me to the point where I finally will make use of the Using OpenMP book that I got a few years back and I only started reading after figuring out that OpenMP was not ready yet for prime time. Nowadays OpenMP support in Linux improved to the point it’s probably worth taking another look at it, and I guess unpaper is going to be the test case for it. You can expect a series of blog posts on the topic at this point.

The first thing I noticed while reading the way OpenMP handles shared and private variables, is that the const indication is much stronger when using OpenMP. The reason is that if you tell the code that a given datum is not going to change (it’s a constant, not a variable), it can easily assume direct access from all the threads will work properly; the variable is shared by default among all of them. This is something that, for non-OpenMP programs, is usually inferred from the SSA form — I assume that for whatever reason, OpenMP makes SSA weaker.

Unfortunately, this also means that there is one nasty change tha tmight be required to make code friendlier to OpenMP, and that is a change in prototypes of functions. The parameters to a function are, for what OpenMP is concerned, variables, and that means that unless you declare them const, it won’t be sharing them by default. Within a self-contained program like unpaper, changing the signatures of the functions so that parameters are declared, for instance, const int is painless — but for a library it would be API breakage.

Anyway, just remember: adding multithreading is not the silver bullet you might think it is!

P.S.: I’m currently trying to gauge interest on a volume that would collect, re-edit, and organize all I written up to now on ELF. I’ve created a page on leanpub where you can note down your interest, and how much you think such a volume would be worth. If you would like to read such a book, just add your interest through the form; as soon as at least a dozen people are interested, I’ll spend all my free time working on the editing.

It’s that time of the year again…

Which time of the year? The time when Google announces the new Summer of Code!

Okay so you know I”m not always very positive about the outcome of Summer of Code work, even though I’m extremely grateful to Constanze (and Mike, who got it i tree now!) for the work on filesystem based capabilities — I’m pretty sure at this point that it also has been instrumental for the Hardened team to have their xattr-based PaX marking (I’m tempted to re-consider Hardened for my laptops now that Skype is no longer a no-go, by the way). Other projects (many of which centred around continuous integration, with no results) ended up in much worse shape.

But since being always overly negative is not a good way to proceed in life, I’m going to propose a few possible things that could be useful to have, both for Gentoo Linux and libav/VLC (whichever is going to be part of GSoC this year). Hopefully if something comes out of them is going to be good.

First of all, a re-iteration of something I’ve been asking of Gentoo for a while: a real alternatives-like system. Debian has a very well implemented tool for selecting among alternative packages supporting multiple tools. In Gentoo we have eselect — and a bunch of modules. My laptop counts 10 different eselect packages installed, and for most of them, the overhead of having another package installed is bigger than the eselect module itself! This also does not really work that well, as for instance you cannot choose the tar command, and pkg-config vs pkgconf require you to make a single selection by installing one or the other (or disabling the flag from pkgconf, but that defeats the point, doesn’t it?).

Speaking of eselect and similar tools, we still have gcc-config and binutils-config as their own tools, without using the framework that we use for about everything else. Okay, the last guy who tried to make these bit more than he could chew, and the result has been abysmal, but the reason there is likely that the target was set too high: re-do the whole compiler handling so that it could support non-GCC compilers.. this might actually be too small a project for GSoC, but might work as a qualification task, similar to the ones we’ve got for libav in the past.

Going to libav, one thing that I was discussing with Luca, J-B and other VLC developers, was the possibility to integrate at least part of the DVD handling that is currently split between libdvdread and libdvdnav into libav itself. VLC already forked the two libraries (and I rewrote the build system) — me and Luca were looking into merging them back into a single libdvd library already… but a rewrite and especially one that can reuse code from libav, or introduce new code that can be shared, would probably be a good thing. I haven’t looked into it but I wouldn’t be surprised if libdvbpsi could follow the same treatment.

Finally, another project that could sound cool for libav would be to create a library, API and ABI compatible with xine, that only uses libav. I’m pretty sure that if most of the internals of xine are dropped (including the configuration file and the plugin system), it would be possible to have a shallow wrapper around libav instead of having a full blown project. It might lose support for some files, such as modules, and DVDs, but it would probably be a nice proof of concept and would show what we still need .. and the moment when we can deal with those formats straight into libav, we know we have something better than simply libav.

On a similar note, one of the last things I’ve worked on, in xine, was the “audio out conversion branch”, see for instance this very old post — it is no more no less than what we now is libavresample, just done much worse. Indeed, libavresample actually has the SIMD-optimized routines I never found out how to actually write, which makes it much nicer. Since xine, at the moment I left it, was actually quite nicely using libavutil already, it might be interesting to see what happens if all the audio conversion code is killed, and replaced with libavresample.

So these are my suggestions for this season of GSoC, at least for the projects I’m involved on… maybe I’ll even have time to mentor them this year, as with a bit of luck I’ll have stable employment when the time comes for this to happen (more on this to follow, but not yet).

Gentoo Linux Health Report — October 2012 Edition

I guess it’s time for a new post on what’s the status with Gentoo Linux right now. First of all, the tinderbox is munching as I write. Things are going mostly smooth but there are still hiccups due to some developers not accepting its bug reports because of the way logs are linked (as in, not attached).

Like last time that I wrote about it, four months ago, this is targeting GCC 4.7, GLIBC 2.16 (which is coming out of masking next week!) and GnuTLS 3. Unfortunately, there are a few (biggish) problems with this situation, mostly related to the Boost problem I noted back in July.

What happens is this:

  • you can’t use any version of boost older than 1.48 with GCC 4.7 or later;
  • you can’t use any version of boost older than 1.50 with GLIBC 2.16;
  • many packages don’t build properly with boost 1.50 and later;
  • a handful of packages require boost 1.46;
  • boost 1.50-r2 and later (in Gentoo) no longer support eselect boost making most of the packages using boost not build at all.

This kind of screwup is a major setback, especially since Mike (understandably) won’t wait any more to unmask GLIBC 2.16 (he waited a month, the Boost maintainers had all the time to fix their act, which they didn’t — it’s now time somebody with common sense takes over). So the plan right now is for me and Tomáš to pick up the can of worms, and un-slot Boost, quite soon. This is going to solve enough problems that we’ll all be very happy about it, as most of the automated checks for Boost will then work out of the box. It’s also going to reduce the disk space being used by your install, although it might require you to rebuild some C++ packages, I’m sorry about that.

For what concerns GnuTLS, version 3.1.3 is going to hit unstable users at the same time as glibc-2.16, and hopefully the same will be true for stable when that happens. Unfortunately there are still a number of packages not fixed to work with gnutls, so if you see a package you use (with GnuTLS) in the tracker it’s time to jump on fixing it!

Speaking of GnuTLS, we’ve also had a smallish screwup this morning when libtasn1 version 3 also hit the tree unmasked — it wasn’t supposed to happen, and it’s now masked, as only GnuTLS 3 builds fine with it. Since upstream really doesn’t care about GnuTLS 2 at this point, I’m not interested in trying to get that to work nicely, and since I don’t see any urgency in pushing libtasn1 v3 as is, I’ll keep it masked until GNOME 3.6 (as gnome-keyring also does not build with that version, yet).

Markos has correctly noted that the QA team – i.e., me – is not maintaining the DevManual anymore. We made it now a separate project, under QA (but I’d rather say it’s shared under QA and Recruiters at the same time), and the GIT Repository is now writable by any developer. Of course if you play around that without knowing what you’re doing, on master, you’ll be terminated.

There’s also the need to convert the DevManual to something that makes sense. Right now it’s a bunch of files all called text.xml which makes editing a nightmare. I did start working on that two years ago but it’s tedious work and I don’t want to do it on my free time. I’d rather not have to do it while being paid for it really. If somebody feels like they can handle the conversion, I’d actually consider paying somebody to do that job. How much? I’d say around $50. Desirable format is something that doesn’t make a person feel like taking their eyes out when trying to edit it with Emacs (and vim, if you feel generous): my branch used DocBook 5, which I rather fancy, as I’ve used it for Autotools Mythbuster but RST or Sphinx would probably be okay as well, as long as no formatting is lost along the way. Update: Ben points out he already volunteered to convert it to RST, I’ll wait for that before saying anything more.

Also, we’re looking for a new maintainer for ICU (and I’m pressing Davide to take the spot) as things like the bump to 50 should have been handled more carefully. Especially now that it appears that it’s basically breaking a quarter of its dependencies when using GCC 4.7 — both the API and ABI of the library change entirely depending on whether you’re using GCC 4.6 or 4.7, as it’ll leverage C++11 support in the latter. I’m afraid this is just going to be the first of a series of libraries making this kind of changes and we’re all going to suffer through it.

I guess this is all for now.

GNU software (or, gnulib considered harmful and other stories)

The tinderbox right now seems to be having fun trying to play catch-up with changes in GNU software: GCC, Automake, glibc and of course GnuTLS. Okay it’s true that compatibility problems are not a prerogative of GNU, but there are some interesting problems with this whole game, especially for what concerns inter-dependencies.

So, there is a new C library in town, which, if we ignore the whole x32 dilemma, has support for ISO C11 as its major new feature. And what is the major change in this release? The gets() function has finally been removed. This is good, as it was a very nasty thing to use, and nobody in his sane mind would use it…

tbamd64 ~ # scanelf -qs -gets -R /bin /opt /sbin /lib /usr
gets  /opt/flexlm/bin/lmutil

Okay nevermind those who actually use it, the rest of the software shouldn’t be involved, should it? You wish. What happens is that since gnulib no longer only carries replacements for GNU extensions but also includes code that is not present on glibc itself, and extra warnings about use of deprecated features, it now comes with its own re-declaration of gets() to feature a prominent warning if it’s used. And of course, that makes it fail badly with the new gnulib.

Obviously, this has been fixed in gnulib already, since it was planned for gets() to be removed, but it takes quite a bit of time for a fix in gnulib to trickle down to the packages using it, which is one of my previous main complaints about it. Which means that Gentoo will have to patch the same code over and over again in almost all GNU software, since almost all of it uses gnulib.

Luckily for me only two packages have hit (with this problem, at least) that I’m in the herd of: GnuTLS and libtasn1 which is a dep of it. The former is fixed in version 3 which is also masked but I’m testing as well, while the latter is fixed in the current ~arch (I really don’t care about 2.16 in stable yet!), so there is nothing to patch there. The fact that GCC 4.6 itself fails to build with this version of glibc is obviously a different problem altogether, and so is the fact that we need Boost 1.50 for a number of packages to work with the new glibc/gcc combination, as 1.49 is broken with the new C library and 1.48 is broken with the new compiler.

Now to move on to something different: Automake 1.12 was released a couple of months ago and is now in ~arch, causing trouble although not as bad as it could have been. Interestingly enough, one of the things that they did change in this version was removing $(mkdir_p) as I wrote in my guide — but that seems to have been a mistake.

What should have been removed in 1.12 was the default use of AM_PROG_MKDIR_P, while the mkdir_p define should have been kept around until version 1.13. Stefano said he’s going to revert that change in automake 1.12.2, but I guess it’s better if we deal with it right away instead of waiting for 1.13 to hit us….

Of course there is a different problem with the new automake as well: GNU gettext hasn’t been updated to support the new automake versions, so using it causes deprecation warnings with 1.12 (and will fail with 1.13 if it’s not updated). And of course a number of projects are now using -Werror on automake, like it wasn’t enough trouble to use it on the source code itself.

And of course the problem with Gentoo, according to somebody, is the fact that my tinderbox bugs (those dozen a day) are filed with linked build logs instead of attached ones. Not the fact that the same somebody commits new versions of critical packages without actually doing anything to test them.

Gentoo Linux Health Report

Maybe I should start do this monthly, with the tinderbox’s results at hand.

So as I said, the Tinderbox is logging away although it’s still coughing from time to time. What I want to write about, here, is some of the insights the current logs tell me.

First of all, as you might expect considering my previous efforts, I’m testing GCC 4.7. It’s the least I can do of course, and it’s definitely important to proceed with this if we want to have it unmasked in decent terms, instead of the 4.6-like æons (that were caused in part by the lack of running continuous testing). The new GCC version by itself doesn’t seem to be too much of a break-through anyway; there is the usual set of new warnings that cause packages insisting in using -Werror to fail; there is some more headers’ cleanup that cause software using C and POSIX interfaces in C++ to fail because they don’t include the system headers declaring the functions they use; there also is a change in the C++ specs that require this-> to be prefixed on a few access to inherited attributes/methods but the rule of which I’m not sure of.

The most disruptive issue with the new GCC, though, seems to be another set of strengthening the invalid command-line parameters passed to the compiler. So for instance, the -mimpure-text option that is supported for SunOS is no longer usable on Linux — guess which kind of package fails badly due to that? Java native libraries of course. But it’s not just them, one or two packages failed due to the use of -mno-cygwin which is also gone from the Linux version of GCC.

So while this goes on, what other tests are being put in place? Well, I’ve been testing GnuTLS 3, simply because this version no longer bundles an antique, unmaintained configuration file parsing library. To be honest though I should be working a bit more on GnuTLS as there is a dependency over readline I’d like to remove.. for now I only fixed parallel build which means you can now use the ebuild at the best speed.

Oh and how can I forget that I’m testing the automake 1.12 transition which I’m also trying to keep documented in Autotools Mythbuster — although I’m still missing the “see also” references, I hope I’ll be able to work on it in the next few days. Who knows, maybe I’ll be able to work on them on the plane (given next time no matter what I’m going to get the “premium economy” extra — tired of children screaming, hopefully there are few families ready to pay for the extra).

The most interesting part of this transition is that we get failures for things that really don’t matter for us at all, such as the dist-lzma removal. Unfortunately we still have to deal with those. The other thing is that there are more packages than I expected that relied on what is called “auto-de-ANSI-fication” (conversion from ANSI C to more modern C), which is also gone on this automake version. Please remember that if you don’t want to spend too much time on fixing this you can still restrict WANT_AUTOMAKE=1.11 in your ebuild for the moment — but at some later point you’ll have to get upstream to support the new version.

I sill have some trouble with PHP extensions not installing properly. And I have no idea why.

Now, let’s leave alone the number of complains I could have with developers who commit outright broken packages without taking care of them, or three (or four) years old bugs still open and still present… I’d like for once to thank those developers who’re doing an excellent job by responding timely to the new bugs.

So, thank you Justin (jlec), Michael (kensington), Alfredo Tupone and Hans (graaff) among others. These are just the guys who received most of the bugs… and fixed them quickly enough for me to notice — “Hey did I just file a bug about that? Ah it’s fixed already.”

Are -g options really safe?

Tonight feels like a night after a very long day. But it was just half a day spent on trying to find the end of a bug saga that started about a month ago for me.

It starts like this: postgresql-server started failing; the link editor – BFD-based ld – reported that one of the static archives installed by postgresql-base didn’t have a proper index, which should have been generated by ranlib. But simply executing ranlib on said file didn’t solve the problem.

I originally blamed the build system of PostgreSQL, but when yesterday I launched an emerge -e world to rebuild everything with GCC 4.6, another package failed in the same way: lvm2, linking to /usr/lib64/libudev.a — since I know the udev build system very well, almost like I wrote it myself, I trusted that the archive was built correctly, so it was time to look at what the real problem was.

After poking around a bit, I found that binutils’s nm, objdump and at that point even ld refused to display information for some relocated objects (ET_REL files). This would have made it very difficult to debug the issue if not for two things: first, eu-nm could see the file just fine, and second, my own home-cooked nm.rb tool that I wrote to test Ruby-Elf reported issues with the file — but without exploding.

flame@yamato mytmpfs % nm dlopen.o
nm: dlopen.o: Bad value
flame@yamato mytmpfs % eu-nm -B dlopen.o 
0000000000000000 n .LC0
                 U _GLOBAL_OFFSET_TABLE_
                 U __stack_chk_fail
                 U dlclose
                 U dlerror
                 U dlopen
00000000000001a0 T dlopen_LTX_get_vtable
                 U dlsym
                 U lt__error_string
                 U lt__set_last_error
                 U lt__zalloc
0000000000000000 t vl_exit
00000000000000a0 t vm_close
0000000000000100 t vm_open
0000000000000040 t vm_sym
0000000000000000 d vtable
0000000000000000 n wt.1af52e75450527ed
0000000000000000 n wt.2e36542536402b38
0000000000000000 n wt.32ec40f73319dfa8
0000000000000000 n wt.442ae951f162d46e
0000000000000000 n wt.90e079bbb773abcb
0000000000000000 n wt.ac43b6ac10ce5688

I don’t have the original output from my tool since I have since fixed it, but the issues were related, as you can guess from that output, to the various wt. symbols at the end of the list. Where do they come from? What does the ‘n’ symbol they are coded with mean? And why is BFD failing to deal with them? I set out to find those answers with, well, more than a hunch of what the problem would turn out to be.

So what are those symbols? Google doesn’t help at all here since searching for “wt”, even enclosed in double quotes, turns up only results for “weight”. Yes, I know it is a way to shorten that word, but what the heck, I’m looking for a certain string! The answer, actually is simple: they are additional debug symbols that are added by -gdwarf-4, which is used to include the latest DWARF format revision. This was implemented in GCC 4.6 and is supposed to reduce the size of the debug information, which is generally positive, and include more debug information.

Turns out that libbfd (the library that implements all the low level access for nm, ld and the other utilities) doesn’t like those symbols, not sure if it’s the sections they are defined on, their type (which is set to STT_NONE), or what else, but it doesn’t like them at all. Interestingly enough, this does not happen with final executables and dynamic libraries, which makes it at least bearable: only less then 40 packages had to be rebuilt on my system because they had broken static objects; unfortunately one of those was LibreOffice, d’oh!

Now, let’s look back at the nm issue though: when I started writing Ruby-Elf, I decided not to reimplement the whole suite of ELF tools, since there are already quite a few implementations of those out there. But I did write a nm tool to debug my own issues — it also worked quite nicely, because implementing access to the codes used by nm allowed me to use the same output in my elfgrep tool to show the results. This implementation, that was actually never ported to the tools framework I wrote for Ruby-Elf, didn’t get installed, and it was just part of the repository for my own good.

But after noticing that I’m more resilient than binutils’s version, and it produced more interesting output than elfutils’s, I decided to rework it and make it available as rbelf-nm, writing a man page, and listing the codes for the various symbol kinds. But before all this, I also rewrote the function of code choice. Before, it relied on binding types and then on section names to produce the right code; now it relies on the symbol type, binding types, and sections’ flags and type, making it as resilient as elfutils’s, and as informative as binutils’s, up to what I encountered right now.

I also released a new version (, don’t ask!) that includes the new tool, and it is already on RubyGems and Portage if you wish to use it. Please remember that the project has a Flattr page so if you like the project, your support is definitely welcome.

Compilers’ rant

Be warned that this blog’s style is in form of a rant, because I’ve spent the past twelve hours fighting with multiple compilers trying to make sense of them while trying to get the best out of my unpaper fork thanks to the different analysis.

Let’s start with a few more notes about the Pathscale compiler I already slightly ranted about for my work on Ruby-Elf. I know I didn’t do the right thing when I posted that stuff as I should have reported the issues upstream directly, but I didn’t have much time, I was already swamped with other tasks, and going through a very bad personal moment, so I quickly written up my feelings without doing proper reports. I have to thank Pathscale people for accepting the critiques anyway, as Måns reported me that a couple of the issues I noted, in particular the use of --as-needed and the __PIC__ definition were taken care of (sorta, see in a moment).

First problem with the Pathscale compiler: by mistake I have been using the C++ compiler to compile C code; rather than screaming at me, it passed through properly, with one little difference: a static constant gets mis-emitted and this is not a minor issue at all, even though I am using the wrong compiler! Instead of having the right content, the constant is emitted as an empty, zeroed-out array of characters of the right size. I only noticed because of Ruby-elf’s cowstats reporting what should have been a constant into the .bss section. This is probably the most worrisome bug I have seen with Pathscale yet!

Of course its impact is theoretically limited by the fact that I was using the wrong compiler, but since the code should be written in a way to be both valid C and C+, I’m afraid the same bug might exist for some properly-written C+ code.. I hope it might get fixed soon.

The killer feature for Pathscale’s compiler is supposedly optimisation, though, and in that it looks like it is doing quite a nice job, indeed I can see from the emitted assembler that it is finding more semantics to the code than GCC seems to, even though it requires -O3 -march=barcelona to make something useful out of it — and in that case you give up debugging information as the debug sections may reference symbols that were dropped, and the linker will be unable to produce a final executable. This is hit and miss of course, as it depends on whether the optimiser will drop those symbols, but it makes difficult to use -ggdb at all in these cases.

Speaking about optimisations, as I said in my other post, GCC’s missed optimisation is still missed by Pathscale even with full optimisation (-O3) turned on, and with the latest sources. And is also still not fixed the wrong placement of static constants that I ranted about in that post.

Finally, for what concerns the __PIC__ definition that Måns referred as being fixed, well, it isn’t really as fixed as one would expect. Yes, using -fPIC now implies defining __PIC__ and __pic__ as GCC does, but there are two more issues:

  • While this does not apply to x86 and amd64 (but just for m68k, PowerPC and Sparc), GCC supports two modes for emission of position-independent code, one that is limited by the architecture’s global offset table maximum size, and the other that overrides such maximum size (I never investigated how it does that, probably through some indirect tables). The two options are enabled through -fpic (or -fpie) and -fPIC (-fPIE) and define the macros as 1 and 2, respectively; Path64 does only ever define them to 1.
  • With GCC, using -fPIE – that is used to emit Position Independent Executables – or the alternative -fpie of course, implies the use of -fPIC, which in turn means that the two macros noted above are defined; at the same time, two more are defined, __pie__ and __PIE__ with the same values as described in the previous paragraph. Path64 defines none of these four macros when building PIE.

But enough rant about Pathscale, before they feel I’m singling them out (which I’m not). Let’s rant a bit about Clang as well, the only compiler up to now that properly dropped write-only unit-static variables. I had very high expectations for what concerns improving unpaper through its suggestions but.. it turns out it cannot really create any executable, at least that’s what autoconf tells me:

configure:2534: clang -O2 -ggdb -Wall -Wextra -pipe -v   conftest.c  >&5
clang version 2.9 (tags/RELEASE_29/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
 "/usr/bin/clang" -cc1 -triple x86_64-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name conftest.c -mrelocation-model static -mdisable-fp-elim -masm-verbose -mconstructor-aliases -munwind-tables -target-cpu x86-64 -target-linker-version -momit-leaf-frame-pointer -v -g -resource-dir /usr/bin/../lib/clang/2.9 -O2 -Wall -Wextra -ferror-limit 19 -fmessage-length 0 -fgnu-runtime -fdiagnostics-show-option -o /tmp/cc-N4cHx6.o -x c conftest.c
clang -cc1 version 2.9 based upon llvm 2.9 hosted on x86_64-pc-linux-gnu
#include "..." search starts here:
#include <...> search starts here:
End of search list.
 "/usr/bin/ld" --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ -o a.out /usr/lib/../lib64/crt1.o /usr/lib/../lib64/crti.o crtbegin.o -L -L/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/../../.. /tmp/cc-N4cHx6.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed crtend.o /usr/lib/../lib64/crtn.o
/usr/bin/ld: cannot find crtbegin.o: No such file or directory
/usr/bin/ld: cannot find -lgcc
/usr/bin/ld: cannot find -lgcc_s
clang: error: linker command failed with exit code 1 (use -v to see invocation)
configure:2538: $? = 1
configure:2576: result: no

What’s going on? Well, Clang doesn’t provide its own crtbegin.o file for the C runtime prologue (while Path64 does), so it relies on the one provided by GCC, which has to be on the system somewhere. Unfortunately, to identify where this file is… they try hitting and missing.

% strace -e stat clang test.c -o test |& grep crtbegin.o
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/crtbegin.o", 0x7fffc937f170)     = -1 ENOENT (No such file or directory)
stat("/../../../../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/lib/../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/usr/lib/../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/../../../crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)

Yes you can see that it has a hardcoded list of GCC versions that it looks for, from higher to lower, until it falls back to some generic paths (which don’t make that much sense to me to be honest, but nevermind). This means that on my system, that only has GCC 4.6.1 installed, you can’t use clang. This was reported last week and while a patch is available, a real solution is still not there: we shouldn’t be patching and bumping clang each time a new micro version of GCC is released that upstream didn’t list already!

Sigh. While GCC sure has its shortcomings, this is not really looking promising either.

x86 and SIMD extensions

The approach I’m taking for the tinderbox is to not use any special optimisation flag in it, even though I could use it to test stuff that can get tricky, such as -ftracer. It is important to note that it doesn’t even set any particular architecture, meaning that the compiler is told to use the default i686 architecture without any SIMD extension enabled.

Why is this relevant? Well, there seems to be some trouble with this idea. A number of packages, both multimedia and not, seem to rely to the presence of some of those instructions. The problem is not when they actually insert literal ASM code using said extensions, as that is passing right through the compiler and sent directly to the assembler program, which is always supporting the whole set of extensions, at least for what concerns x86, for what I can tell (other architectures seem to have more constrained assemblers).

The problem is that a C compiler such as GCC provides an “higher-level” interface to the processor’s: intrinsics. Intrinsics allow to write C code that calls instructions directly on the C objects, without using volatile ASM literals, and reusing similar syntax between MMX/SSE or other extensions. This is not really a common way to implement SIMD-based optimisations, as most of the developers involved in hard optimisation, such as in libav, prefer writing their own code directly, tweaking it depending on the processor’s design and not just on the extension used. As you can see in libav’s sources, the most recent optimisations have been written directly in assembly source files compiled with yasm rather than in C files.

Unfortunately, this is hindering my tinderboxing efforts, as it is going to stop me from installing in the container some of the packages, which in turn means that their dependencies can’t be installed. Now it is true that many times this situation is mostly something that the developer didn’t consider, and can be fixed, but in a number of projects there is no intention to remove the blocker, as they refuse to support older CPUs without said SIMD, mostly because of performance reasons.

I’m not sure how to proceed here: I could either change the tinderbox’s flags so that SSE is enabled, which would reflect your average modern system, or I could keep filing bugs for those packages until we sort out a way to deal with them. Suggestions are definitely welcome.

Impressions of Path64 compiler

So I noticed today that an ebuild for the Path64 compiler hit Portage; being the ELF nerd that I am, it interested me on the technical level, more than for the optimizations (especially since I’m never happy to hear about “the most sophisticated” about anything; claims like that tend to be simply bothersome to me).

Before starting with testing the compiler, I got to say that the ebuilds themselves had a bit of trouble: the pre-built binary one (dev-lang/ekopath) is changing the path at each update, which breaks Makefiles and other scripts where you could be using a full path as compiler (which has to be the case if you wish to target the binary toolchain rather than the custom-built one), while the custom-built one (dev-lang/path64) does not check for the validity of the dynamic linker name when trying to gather it from GCC, and breaks when using my customized specs for forced --as-needed as they change the commandline used to call collect2. Both problems are now reported in bugzilla and I hope they’ll be solved soon.

What is my baseline test? Well, let’s start with something simple: Ruby-ELF has a number of tests implemented for multiple compilers, in particular GCC, SunStudio and ICC on Linux/AMD64; adding a new compiler just requires rebuilding some object files, and then add some lines of code in the testsuite to check those out. There are always a few attributes that need to be adapted, such as the ELF entry points, but that’s beside the point now, and it is expected of compilers to have small variations in their behaviour, otherwise it wouldn’t make sense to have multiple compilers at all.

This test alone caused me to feel like I’m playing with an alpha-version of a compiler rather than something already targeted at production use, like it seems to be sold to the public. Given that the testfiles I use are very small and simplistic, I wasn’t expecting any difference at all, beside the most obvious ones. For instance, I already know that ICC appends a .0 suffix on all the local symbols (unit-static ones), and SunCC uses common symbols rather than BSS symbols for external TLS variables. But all in all, they are very similar. Turns out that Path64 has more semantic differences than the others.

First issue: on a very simple, hello-world type executable, where only one symbols – printf() – is used, all the compilers manage to only link to, which provides that symbol. Path64 instead adds one more dependency over, or rather its own variation of it. This in turn adds a dependency over, which makes it two extra objects to be loaded for simple executables (yes it might sound like it is impossible not to load the math library, but there are cases where that actually happens). This is extra nasty because linking to that library also means emitting “weak symbols” used for C++ language support.

Not extremely difficult to work around though: just add -Wl,--as-needed to the command line to make it skip over as it is really unused — this is what GCC does in its specs files by the way, it enables as-needed linking, lists its support library, then disable it again, so that the original semantics are restored.

There is one particularity to the Pathscale compiler: it sets the OS ABI on the ELF file to the code for Linux, on static executables. Neither GCC nor ICC do so (I’m not sure about SunStudio as I was unable to produce a static executable out of it last time). Nothing wrong with this, and I’m actually often wondering why compilers never did that.

Next up start the trouble for the compiler: one of the tests is designed to make sure that Ruby-ELF can provide the correct nm-style description code for the symbols in the object files. This is the most compiler-specific test of the whole suite, as both the notes I wrote above about ICC and SunStudio come from this one. Path64 is not as much inconsistent as it seems to be buggy in this area though.

The first difference is that the other three compilers are emitting, in the relocatable object file, an absolute symbol with the name of the source translation unit. This is not the case for Path64, but it isn’t much of a problem: the symbol is probably helpful during debug but not for real usage of the object, so it would just be an issue of rewiring the test. Where the problems arise is when it comes to the section and Copy-on-Write which is one of my pet peeves.

The test source file contains combination of static, exported, and external variables and constants; since the unit is compiled as PIC, it also contains combination of constants that contain relocated and non-relocated code:

char external_variable[] = "foo";
static char static_variable[] tc_used = "foo";

const char external_constant[] = "foo";
static const char static_constant[] tc_used = "foo";

const char *relocated_external_variable = "foo";
const char *const relocated_external_constant = "foo";

static const char *relocated_static_variable tc_used = "foo";
static const char *const relocated_static_constant tc_used = "foo";

All three of the compilers implemented up to now are good and emit the non-relocated constants in the .rodata section, keeping only the relocated ones (i.e., the pointers) in the sections that are copy-on-write.

Finally, for those who keep scores, the missed optimization I noted back in April, is missing in path64 as well as GCC and ICC. Only clang up to now was able to actually make the best out of that code.

I guess I’ll have some reports to do to PathScale, and I’ll keep an eye on this compiler. On the other hand, please don’t ask for this to be tested in any tinderbox for now. Before I can even just consider this, it’ll need to improve a bit further… and I’ll need a more powerful machine to use for tinderboxing.