Birch Books: 8051 and Yak Shaving

I have previously discussed my choice of splitting the actuator board, pointing out I’ll probably try designing an alternative controller board using something like the Adafruit Feather M4 and writing the firmware with CircuitPython. Part of the reason for that is that it’s just easier, but part of it is because 8051 is an annoying platform to work with.

There are a few different compilers for this platform, but as far as I know, the only open-source and maintained one is SDCC the Small Device C Compiler. I hadn’t used this in forever, but I was very happy to see a new release this year, including C2X work in progress, and C11 (mostly) supported, so I was in high spirits when I started working on this.

A Worrying Demonstration

I started from a demo that was supposed to be written explicitly for the STC89. The first thing I noted was that the code does not actually match the documentation in the same page, it references a _sdcc_external_startup() function that is not actually defined. On the other hand it does not seem to be required. There’s other issues with the code, and for something that is designed to work with the STC89, it seems to be overly complicated. Let me try to dissect the problems.

First of all, the source code is manually declaring the “Special Feature Registers” (SFR) for the device. In this case I don’t really understand the point, since all of the declared registers are part of the base 8051 architecture, and would be already declared by any of the model-specific header files that SDCC provides. While the STC89 does have a number of registers that are not found otherwise, none of those are used here. In my code I ended up importing at89x52.h, which is meant for the Atmel (now Microchip) AT89 series, which is the closest header I found for the STC89. I have since filed a patch with a header written based on other headers and the datasheet.

Side note: the datasheet is impressive in the matter of detail. It includes everything you may want to know, including the full ISA description, and a number of example cases.

Once you have the proper definition of headers, you can also avoid a lot of binary flag logic — the most important registers on the 8051 chips are bit-addressable, and so you don’t need to remember how many bits you need to shift around for you to set the correct flag to enable interrupts. And while you may be worrying that using the bit-addressed register would be slower: no, as long as you’re changing fewer than three bits on a register at a time, setting them with the bit-addressed variant is the same or faster. In the case of this demo, the original code uses two orl instructions, each taking 2 cycles, to set three bits total — using the setb instruction, it’s only going to take 3 cycles.

Once you use the correct header (either my contributed stc89c51rc.h, the at89x52.h, or even the very generic 8052.h), you have access to other older-than-thirty-years features that weren’t part of the original 8051, but were part of the subsequent 8052, which both the STC89 and AT89 series derive off. One of these features, as even Wikipedia knows, is a third 16-bit timer. This is important to the demo, since it’s effectively just an example of setting up a timer to “[set] up and using an accurate timer”.

Indeed, the code is fairly complicated, as it configures the timer both in main() and in the interrupt handler clockinc(). The reason for that is that Timer 0 is configured in “Mode 0”: the timer register is configured as 13-bit (with the word TH0, TL0), its rollover causes an interrupt, but you need to reload the timer afterwards. The reason for that is that you need more than 8 bit to set the timer to fire at 1kHz (once every millisecond), and while Timer 0 supports “automatic reload”, it only supports 8-bit reload values — since it’s using TH0 for the reload value.

8052 derivative support a third timer (Timer 2), which is 16-bit, rather than 8- or 13-bit. And it supports auto-reload at 16-bit through RCAP2H, RCAP2L. The only other complication is that unlike Timer 0 and Timer 1, you need to manually “disarm” the interrupt flag (TF2), but that’s still a lot less code.

I found the right way to solve this problem on Google Books, on a book that does not appear to have an ebook edition, and that does not seem to be in print at all. The end result is the following, modified demo.

// Source code under CC0 1.0
#include <stdbool.h>
#include <mcs51/8052.h>

volatile unsigned long int clocktime;
volatile bool clockupdate;

void clockinc(void) __interrupt(5)
{
        TF2 = 0;  // disarm interrupt flag.
	clocktime++;
	clockupdate = true;
}

unsigned long int clock(void)
{
	unsigned long int ctmp;

	do
	{
		clockupdate = false;
		ctmp = clocktime;
	} while (clockupdate);
	
	return(ctmp);
}

void main(void)
{
	// Configure timer for 11.0592 Mhz default SYSCLK
	// 1000 ticks per second
	TH2 = (65536 - 922) >> 8;
	TL2 = (65536 - 922) & 0xFF;
        RCAP2H = (65536 - 922) >> 8;
        RCAP2L = (65536 - 922) & 0xFF;
	
        TF2 = 0;
        ET2 = 1;
        EA = 1;
        TR2 = 1; // Start timer

	for(;;)
		P3 = ~(clock() / 1000) & 0x03;
}

I can only expect that this demo was just written long enough ago that the author forgot to update it, because… the author is an SDCC developer, and refers to his own papers working on it at the bottom of the demo.

A Very Conservative Compiler

Speaking of the compiler itself, I had no idea of what a mess I would get myself into by using it. Turns out that despite the fact that this is de-facto the only opensource embedded compiler people can use for the 8051, it is not a very good compiler.

I don’t say that to drag down the development team, who are probably trying to wrestle a very complex problem space (the 8051’s age make its quirk understandable, but irritating — and the fact that there’s more derivatives than there’s people working on them, is not making it any better), but rather because it is missing so much.

As Philipp describes it, SDCC “has a relative conservative architecture” — I would say that it’s a very conservative architecture, given that even some optimisations that, as far as I can tell, are completely safe are being skipped. For example, doing var % 2 (which I was using to alternate between two test patterns on my LEDs) was generating code calling into a function implementing integer modulo, despite being equivalent to var & 1, which is implemented in the basic instructions.

Similarly, the compiler does not optimise division by powers-of-two ­— which means that for anything that is not a build-time constant you’re better off using bitwise operations rather than divisions — it’s another thing that I addressed in the demo above, even though there it does not matter, as the value is constant at build time.

Speaking of build-time constants — turns out that SDCC does not do constant propagation at all. Even when you define something static const, and never take its address, it’s emitted in the data section of the output program, rather than being replaced at build time where it’s used. Together with the lack of optimisation noted above, it meant I gave up on my idea of structuring the firmware in easily-swappable components — those would rely on the ability of the compiler to do optimisation passes such as constant propagation and inlining, but we’re talking about the lack of much lower level optimisation now.

Originally, this blog post also wanted to touch on the fact that the one library of 8051 interfaces I found hasn’t been touched in six years, has still a few failed merge markers, and not even parsing with modern SDCC — but then again, now that I know SDCC does not optimise even the most basic of operations, I don’t think using a library like that is a good idea — the IO module there is extremely complicated, considering that most ports’ I/O lines can be accessed with bit-addressed registers.

Now, as Andrea (Insomniac) pointed out, Philipp also has a document on using LLVM with SDCC — but the source code this is referencing is more than five years old, and relies on the LLVM C backend, which means it’s generating C code for SDCC to continue compiling. I do wonder if it would make sense to have instead a proper LLVM target for 8051 code — it’s beyond the amount of work I want to put on this project, but last year they merged AVR support into LLVM, which allows to use (or at least try) Rust on 8-bit controllers already. It would be interesting to see if 8051 cores could be used with something different than C (or manually written assembly).

You could wonder why am I caring this much for a side project MCU that is quite older than me. The thing is I don’t, really. I just seem to keep bumping around 8051/2 in various places. I nearly wrote a disassembler for it to hack at my laptop’s keyboard layout a few years ago. I still feel bad I didn’t complete that project. 8051 is still an extremely common micro in low-power applications, and the STC89 in particular is possibly the cheapest micro you can set up prototypes at home: you can get 20 of them for less than 60p each from AliExpress, if you have the time to wait — I know, I just ordered a lot, just to have them around if I decide to do more with them now that I sort-of understand them. the manufacturer appears to make many multiple variants of them still, and I would be extremely surprised if you didn’t have a bunch of these throughout your home, in computers, dishwashers, washing machines, mice, and other devices that just need some cheap and cheerful logic controller without breaking the bank. Heck, I expect them to be used in glucometers, too!

With all these devices tied to closed-source, proprietary compilers, I would feel more comfortable if there was some active work on supporting a modern compiler platform in the open source world as well. From my point of view, this sounds like the needs of the industrial users, and those of the hobbyist community, diverged very much on this topic.

Sum It All Up

So for my art project I decided that even SDCC is good enough, but I wanted to make sure I would not end up with broken code (which appears to happen fairly often), so I ended up reading the generated assembly code to make sure it made sense. Despite not being particularly familiar with 8051 ISA, thanks to the Wikipedia article and the detailed datasheet from STC, it wasn’t too hard to read through it.

While I was going through it, I also figured out how to rewrite parts of the C code to force SDCC to emit some decent code. For instance, instead of a branch that either adds 1 or 32 to a counter, I was better off making a temporary variable hold 1, or change it to 32, add that variable. The fact that SDCC couldn’t optimise that made me sad, but again it’s understandable given the priorities.

Hopefully I have kept the source code still fairly readable. You can check the history to see all the various things I kept changing to make it more readable in assembly as well. Part of the changes meant changing some of my plans. In my first notes I wanted to run through 20 “hours” configurations in 60 minutes — but to optimise the code I decided that it’ll run 16 “hours” in just over 68 minutes. That way I could use a lot of power-of-twos and do away with annoying calculations.

The subtlety of modern CPUs, or the search for the phantom bug

Yesterday I have released a new version of unpaper which is now in Portage, even though is dependencies are not exactly straightforward after making it use libav. But when I packaged it, I realized that the tests were failing — but I have been sure to run the tests all the time while making changes to make sure not to break the algorithms which (as you may remember) I have not designed or written — I don’t really have enough math to figure out what’s going on with them. I was able to simplify a few things but I needed Luca’s help for the most part.

Turned out that the problem only happened when building with -O2 -march=native so I decided to restrict tests and look into it in the morning again. Indeed, on Excelsior, using -march=native would cause it to fail, but on my laptop (where I have been running the test after every single commit), it would not fail. Why? Furthermore, Luca was also reporting test failures on his laptop with OSX and clang, but I had not tested there to begin with.

A quick inspection of one of the failing tests’ outputs with vbindiff showed that the diffs would be quite minimal, one bit off at some non-obvious interval. It smelled like a very minimal change. After complaining on G+, Måns pushed me to the right direction: some instruction set that differs between the two.

My laptop uses the core-avx-i arch, while the server uses bdver1. They have different levels of SSE4 support – AMD having their own SSE4a implementation – and different extensions. I should probably have paid more attention here and noticed how the Bulldozer has FMA4 instructions, but I did not, it’ll show important later.

I decided to start disabling extensions in alphabetical order, mostly expecting the problem to be in AMD’s implementation of some instructions pending some microcode update. When I disabled AVX, the problem went away — AVX has essentially a new encoding of instructions, so enabling AVX causes all the instructions otherwise present in SSE to be re-encoded, and is a dependency for FMA4 instructions to be usable.

The problem was reducing the code enough to be able to figure out if the problem was a bug in the code, in the compiler, in the CPU or just in the assumptions. Given that unpaper is over five thousands lines of code and comments, I needed to reduce it a lot. Luckily, there are ways around it.

The first step is to look in which part of the code the problem appears. Luckily unpaper is designed with a bunch of functions that run one after the other. I started disabling filters and masks and I was able to limit the problem to the deskewing code — which is when most of the problems happened before.

But even the deskewing code is a lot — and it depends on at least some part of the general processing to be run, including loading the file and converting it to an AVFrame structure. I decided to try to reduce the code to a standalone unit calling into the full deskewing code. But when I copied over and looked at how much code was involved, between the skew detection and the actual rotation, it was still a lot. I decided to start looking with gdb to figure out which of the two halves was misbehaving.

The interface between the two halves is well-defined: the first return the detected skew, and the latter takes the rotation to apply (the negative value to what the first returned) and the image to apply it to. It’s easy. A quick look through gdb on the call to rotate() in both a working and failing setup told me that the returned value from the first half matched perfectly, this is great because it meant that the surface to inspect was heavily reduced.

Since I did not want to have to test all the code to load the file from disk and decode it into a RAW representation, I looked into the gdb manual and found the dump commands that allows you to dump part of the process’s memory into a file. I dumped the AVFrame::data content, and decided to use that as an input. At first I decided to just compile it into the binary (you only need to use xxd -i to generate C code that declares the whole binary file as a byte array) but it turns out that GCC is not designed to compile efficiently a 17MB binary blob passed in as a byte array. I then opted in for just opening the raw binary file and fread() it into the AVFrame object.

My original plan involved using creduce to find the minimal set of code needed to trigger the problem, but it was tricky, especially when trying to match a complete file output to the md5. I decided to proceed with the reduction manually, starting from all the conditional for pixel formats that were not exercised… and then I realized that I could split again the code in two operations. Indeed while the main interface is only rotate(), there were two logical parts of the code in use, one translating the coordinates before-and-after the rotation, and the interpolation code that would read the old pixels and write the new ones. This latter part also depended on all the code to set the pixel in place starting from its components.

By writing as output the calls to the interpolation function, I was able to restrict the issue to the coordinate translation code, rather than the interpolation one, which made it much better: the reduced test case went down to a handful of lines:

void rotate(const float radians, AVFrame *source, AVFrame *target) {
    const int w = source->width;
    const int h = source->height;

    // create 2D rotation matrix
    const float sinval = sinf(radians);
    const float cosval = cosf(radians);
    const float midX = w / 2.0f;
    const float midY = h / 2.0f;

    for (int y = 0; y < h; y++) {
        for (int x = 0; x < w; x++) {
            const float srcX = midX + (x - midX) * cosval + (y - midY) * sinval;
            const float srcY = midY + (y - midY) * cosval - (x - midX) * sinval;
            externalCall(srcX, srcY);
        }
    }
}

Here externalCall being a simple function to extrapolate the values, the only thing it does is printing them on the standard error stream. In this version there is still reference to the input and output AVFrame objects, but as you can notice there is no usage of them, which means that now the testcase is self-contained and does not require any input or output file.

Much better but still too much code to go through. The inner loop over x was simple to remove, just hardwire it to zero and the compiler still was able to reproduce the problem, but if I hardwired y to zero, then the compiler would trigger constant propagation and just pre-calculate the right value, whether or not AVX was in use.

At this point I was able to execute creduce; I only needed to check for the first line of the output to match the “incorrect” version, and no input was requested (the radians value was fixed). Unfortunately it turns out that using creduce with loops is not a great idea, because it is well possible for it to reduce away the y++ statement or the y < h comparison for exit, and then you’re in trouble. Indeed it got stuck multiple times in infinite loops on my code.

But it did help a little bit to simplify the calculation. And with again a lot of help by Måns on making sure that the sinf()/cosf() functions would not return different values – they don’t, also they are actually collapsed by the compiler to a single call to sincosf(), so you don’t have to write ugly code to leverage it! – I brought down the code to

extern void externCall(float);
extern float sinrotation();
extern float cosrotation();

static const float midX = 850.5f;
static const float midY = 1753.5f;

void main() {
    const float srcX = midX * cosrotation() - midY * sinrotation();
    externCall(srcX);
}

No external libraries, not even libm. The external functions are in a separate source file, and beside providing fixed values for sine and cosine, the externCall() function only calls printf() with the provided value. Oh if you’re curious, the radians parameter became 0.6f, because 0, 1 and 0.5 would not trigger the behaviour, but 0.6 which is the truncated version of the actual parameter coming from the test file, would.

Checking the generated assembly code for the function then pointed out the problem, at least to Måns who actually knows Intel assembly. Here follows a diff of the code above, built with -march=bdver1 and with -march=bdver1 -mno-fma4 — because turns out the instruction causing the problem is not an AVX one but an FMA4, more on that after the diff.

        movq    -8(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmovss  -20(%rbp), %xmm2
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovss  -20(%rbp), %xmm1
+       vfmsubss        %xmm0, .LC0(%rip), %xmm1, %xmm0
        leave
        .cfi_remember_state
        .cfi_def_cfa 7, 8
-       vsubss  %xmm0, %xmm1, %xmm0
        jmp     externCall@PLT
 .L6:
        .cfi_restore_state

It’s interesting that it’s changing the order of the instructions as well, as well as the constants — for this diff I have manually swapped .LC0 and .LC1 on one side of the diff, as they would just end up with different names due to instruction ordering.

As you can see, the FMA4 version has one instruction less: vfmsubss replaces both one of the vmulss and the one vsubss instruction. vfmsubss is a FMA4 instruction that performs a Fused Multiply and Subtract operation — midX * cosrotation() - midY * sinrotation() indeed has a multiply and subtract!

Originally, since I was disabling the whole AVX instruction set, all the vmulss instructions would end up replaced by mulss which is the SSE version of the same instruction. But when I realized that the missing correspondence was vfmsubss and I googled for it, it was obvious that FMA4 was the culprit, not the whole AVX.

Great, but how does that explain the failure on Luca’s laptop? He’s not so crazy so use an AMD laptop — nobody would be! Well, turns out that Intel also have their Fused Multiply-Add instruction set, just only with three operands rather than four, starting from Haswell CPUs, which include… Luca’s laptop. A quick check on my NUC which also has a Haswell CPU confirms that the problem exists also for the core-avx2 architecture, even though the code diff is slightly less obvious:

        movq    -24(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmovd   %ebx, %xmm2
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovd   %ebx, %xmm1
+       vfmsub132ss     .LC0(%rip), %xmm0, %xmm1
        addq    $24, %rsp
+       vmovaps %xmm1, %xmm0
        popq    %rbx
-       vsubss  %xmm0, %xmm1, %xmm0
        popq    %rbp
        .cfi_remember_state
        .cfi_def_cfa 7, 8

Once again I swapped .LC0 and .LC1 afterwards for consistency.

The main difference here is that the instruction for fused multiply-subtract is vfmsub132ss and a vmovaps is involved as well. If I read the documentation correctly this is because it stores the result in %xmm1 but needs to move it to %xmm0 to pass it to the external function. I’m not enough of an expert to tell whether gcc is doing extra work here.

So why is this instruction causing problems? Well, Måns knew and pointed out that the result is now more precise, thus I should not work it around. Wikipedia, as linked before, points also out why this happens:

A fused multiply–add is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b×c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire sum a+b×c to its full precision before rounding the final result down to N significant bits.

Unfortunately this does mean that we can’t have bitexactness of images for CPUs that implement fused operations. Which means my current test harness is not good, as it compares the MD5 of the output with the golden output from the original test. My probable next move is to use cmp to count how many bytes differ from the “golden” output (the version without optimisations in use), and if the number is low, like less than 1‰, accept it as valid. It’s probably not ideal and could lead to further variation in output, but it might be a good start.

Optimally, as I said a long time ago I would like to use a tool like pdiff to tell whether there is actual changes in the pixels, and identify things like 1-pixel translation to any direction, which would be harmless… but until I can figure something out, it’ll be an imperfect testsuite anyway.

A huge thanks to Måns for the immense help, without him I wouldn’t have figured it out so quickly.

Compilers’ rant

Be warned that this blog’s style is in form of a rant, because I’ve spent the past twelve hours fighting with multiple compilers trying to make sense of them while trying to get the best out of my unpaper fork thanks to the different analysis.

Let’s start with a few more notes about the Pathscale compiler I already slightly ranted about for my work on Ruby-Elf. I know I didn’t do the right thing when I posted that stuff as I should have reported the issues upstream directly, but I didn’t have much time, I was already swamped with other tasks, and going through a very bad personal moment, so I quickly written up my feelings without doing proper reports. I have to thank Pathscale people for accepting the critiques anyway, as Måns reported me that a couple of the issues I noted, in particular the use of --as-needed and the __PIC__ definition were taken care of (sorta, see in a moment).

First problem with the Pathscale compiler: by mistake I have been using the C++ compiler to compile C code; rather than screaming at me, it passed through properly, with one little difference: a static constant gets mis-emitted and this is not a minor issue at all, even though I am using the wrong compiler! Instead of having the right content, the constant is emitted as an empty, zeroed-out array of characters of the right size. I only noticed because of Ruby-elf’s cowstats reporting what should have been a constant into the .bss section. This is probably the most worrisome bug I have seen with Pathscale yet!

Of course its impact is theoretically limited by the fact that I was using the wrong compiler, but since the code should be written in a way to be both valid C and C+, I’m afraid the same bug might exist for some properly-written C+ code.. I hope it might get fixed soon.

The killer feature for Pathscale’s compiler is supposedly optimisation, though, and in that it looks like it is doing quite a nice job, indeed I can see from the emitted assembler that it is finding more semantics to the code than GCC seems to, even though it requires -O3 -march=barcelona to make something useful out of it — and in that case you give up debugging information as the debug sections may reference symbols that were dropped, and the linker will be unable to produce a final executable. This is hit and miss of course, as it depends on whether the optimiser will drop those symbols, but it makes difficult to use -ggdb at all in these cases.

Speaking about optimisations, as I said in my other post, GCC’s missed optimisation is still missed by Pathscale even with full optimisation (-O3) turned on, and with the latest sources. And is also still not fixed the wrong placement of static constants that I ranted about in that post.

Finally, for what concerns the __PIC__ definition that Måns referred as being fixed, well, it isn’t really as fixed as one would expect. Yes, using -fPIC now implies defining __PIC__ and __pic__ as GCC does, but there are two more issues:

  • While this does not apply to x86 and amd64 (but just for m68k, PowerPC and Sparc), GCC supports two modes for emission of position-independent code, one that is limited by the architecture’s global offset table maximum size, and the other that overrides such maximum size (I never investigated how it does that, probably through some indirect tables). The two options are enabled through -fpic (or -fpie) and -fPIC (-fPIE) and define the macros as 1 and 2, respectively; Path64 does only ever define them to 1.
  • With GCC, using -fPIE – that is used to emit Position Independent Executables – or the alternative -fpie of course, implies the use of -fPIC, which in turn means that the two macros noted above are defined; at the same time, two more are defined, __pie__ and __PIE__ with the same values as described in the previous paragraph. Path64 defines none of these four macros when building PIE.

But enough rant about Pathscale, before they feel I’m singling them out (which I’m not). Let’s rant a bit about Clang as well, the only compiler up to now that properly dropped write-only unit-static variables. I had very high expectations for what concerns improving unpaper through its suggestions but.. it turns out it cannot really create any executable, at least that’s what autoconf tells me:

configure:2534: clang -O2 -ggdb -Wall -Wextra -pipe -v   conftest.c  >&5
clang version 2.9 (tags/RELEASE_29/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
 "/usr/bin/clang" -cc1 -triple x86_64-pc-linux-gnu -emit-obj -disable-free -disable-llvm-verifier -main-file-name conftest.c -mrelocation-model static -mdisable-fp-elim -masm-verbose -mconstructor-aliases -munwind-tables -target-cpu x86-64 -target-linker-version 2.21.53.0.2.20110804 -momit-leaf-frame-pointer -v -g -resource-dir /usr/bin/../lib/clang/2.9 -O2 -Wall -Wextra -ferror-limit 19 -fmessage-length 0 -fgnu-runtime -fdiagnostics-show-option -o /tmp/cc-N4cHx6.o -x c conftest.c
clang -cc1 version 2.9 based upon llvm 2.9 hosted on x86_64-pc-linux-gnu
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /usr/bin/../lib/clang/2.9/include
 /usr/include
 /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.1/include
End of search list.
 "/usr/bin/ld" --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a.out /usr/lib/../lib64/crt1.o /usr/lib/../lib64/crti.o crtbegin.o -L -L/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/../../.. /tmp/cc-N4cHx6.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed crtend.o /usr/lib/../lib64/crtn.o
/usr/bin/ld: cannot find crtbegin.o: No such file or directory
/usr/bin/ld: cannot find -lgcc
/usr/bin/ld: cannot find -lgcc_s
clang: error: linker command failed with exit code 1 (use -v to see invocation)
configure:2538: $? = 1
configure:2576: result: no

What’s going on? Well, Clang doesn’t provide its own crtbegin.o file for the C runtime prologue (while Path64 does), so it relies on the one provided by GCC, which has to be on the system somewhere. Unfortunately, to identify where this file is… they try hitting and missing.

% strace -e stat clang test.c -o test |& grep crtbegin.o
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.5/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.4/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.3/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2.1/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/gcc/x86_64-pc-linux-gnu/4.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/usr/lib64/gcc/x86_64-pc-linux-gnu/4.2/crtbegin.o", 0x7fffc937eff0) = -1 ENOENT (No such file or directory)
stat("/crtbegin.o", 0x7fffc937f170)     = -1 ENOENT (No such file or directory)
stat("/../../../../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/lib/../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/usr/lib/../lib64/crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)
stat("/../../../crtbegin.o", 0x7fffc937f170) = -1 ENOENT (No such file or directory)

Yes you can see that it has a hardcoded list of GCC versions that it looks for, from higher to lower, until it falls back to some generic paths (which don’t make that much sense to me to be honest, but nevermind). This means that on my system, that only has GCC 4.6.1 installed, you can’t use clang. This was reported last week and while a patch is available, a real solution is still not there: we shouldn’t be patching and bumping clang each time a new micro version of GCC is released that upstream didn’t list already!

Sigh. While GCC sure has its shortcomings, this is not really looking promising either.

Impressions of Path64 compiler

So I noticed today that an ebuild for the Path64 compiler hit Portage; being the ELF nerd that I am, it interested me on the technical level, more than for the optimizations (especially since I’m never happy to hear about “the most sophisticated” about anything; claims like that tend to be simply bothersome to me).

Before starting with testing the compiler, I got to say that the ebuilds themselves had a bit of trouble: the pre-built binary one (dev-lang/ekopath) is changing the path at each update, which breaks Makefiles and other scripts where you could be using a full path as compiler (which has to be the case if you wish to target the binary toolchain rather than the custom-built one), while the custom-built one (dev-lang/path64) does not check for the validity of the dynamic linker name when trying to gather it from GCC, and breaks when using my customized specs for forced --as-needed as they change the commandline used to call collect2. Both problems are now reported in bugzilla and I hope they’ll be solved soon.

What is my baseline test? Well, let’s start with something simple: Ruby-ELF has a number of tests implemented for multiple compilers, in particular GCC, SunStudio and ICC on Linux/AMD64; adding a new compiler just requires rebuilding some object files, and then add some lines of code in the testsuite to check those out. There are always a few attributes that need to be adapted, such as the ELF entry points, but that’s beside the point now, and it is expected of compilers to have small variations in their behaviour, otherwise it wouldn’t make sense to have multiple compilers at all.

This test alone caused me to feel like I’m playing with an alpha-version of a compiler rather than something already targeted at production use, like it seems to be sold to the public. Given that the testfiles I use are very small and simplistic, I wasn’t expecting any difference at all, beside the most obvious ones. For instance, I already know that ICC appends a .0 suffix on all the local symbols (unit-static ones), and SunCC uses common symbols rather than BSS symbols for external TLS variables. But all in all, they are very similar. Turns out that Path64 has more semantic differences than the others.

First issue: on a very simple, hello-world type executable, where only one symbols – printf() – is used, all the compilers manage to only link to libc.so.6, which provides that symbol. Path64 instead adds one more dependency over libgcc.so, or rather its own variation of it. This in turn adds a dependency over libm.so, which makes it two extra objects to be loaded for simple executables (yes it might sound like it is impossible not to load the math library, but there are cases where that actually happens). This is extra nasty because linking to that library also means emitting “weak symbols” used for C++ language support.

Not extremely difficult to work around though: just add -Wl,--as-needed to the command line to make it skip over libgcc.so as it is really unused — this is what GCC does in its specs files by the way, it enables as-needed linking, lists its support library, then disable it again, so that the original semantics are restored.

There is one particularity to the Pathscale compiler: it sets the OS ABI on the ELF file to the code for Linux, on static executables. Neither GCC nor ICC do so (I’m not sure about SunStudio as I was unable to produce a static executable out of it last time). Nothing wrong with this, and I’m actually often wondering why compilers never did that.

Next up start the trouble for the compiler: one of the tests is designed to make sure that Ruby-ELF can provide the correct nm-style description code for the symbols in the object files. This is the most compiler-specific test of the whole suite, as both the notes I wrote above about ICC and SunStudio come from this one. Path64 is not as much inconsistent as it seems to be buggy in this area though.

The first difference is that the other three compilers are emitting, in the relocatable object file, an absolute symbol with the name of the source translation unit. This is not the case for Path64, but it isn’t much of a problem: the symbol is probably helpful during debug but not for real usage of the object, so it would just be an issue of rewiring the test. Where the problems arise is when it comes to the .data.rel.ro section and Copy-on-Write which is one of my pet peeves.

The test source file contains combination of static, exported, and external variables and constants; since the unit is compiled as PIC, it also contains combination of constants that contain relocated and non-relocated code:



char external_variable[] = "foo";
static char static_variable[] tc_used = "foo";

const char external_constant[] = "foo";
static const char static_constant[] tc_used = "foo";

const char *relocated_external_variable = "foo";
const char *const relocated_external_constant = "foo";

static const char *relocated_static_variable tc_used = "foo";
static const char *const relocated_static_constant tc_used = "foo";

All three of the compilers implemented up to now are good and emit the non-relocated constants in the .rodata section, keeping only the relocated ones (i.e., the pointers) in the .data.rel.ro sections that are copy-on-write.

Finally, for those who keep scores, the missed optimization I noted back in April, is missing in path64 as well as GCC and ICC. Only clang up to now was able to actually make the best out of that code.

I guess I’ll have some reports to do to PathScale, and I’ll keep an eye on this compiler. On the other hand, please don’t ask for this to be tested in any tinderbox for now. Before I can even just consider this, it’ll need to improve a bit further… and I’ll need a more powerful machine to use for tinderboxing.

That innocent warning… or maybe not?

Anybody who ever programmed in C with a half-decent compiler knows that warnings are very important and you should definitely not leaving them be. Of course, there are more and less important warnings, and the more the compiler’s understanding of the code increases, the more warnings it can give you (which is why using -Werror in released code is a bad idea and why it causes so many headaches to me and the other developers when a new compiler is released).

But there are some times where the warnings, while not highlighting broken code, are indication of more trivial issues, such as suboptimal or wasteful code. One of these warnings was introduced in GCC 4.6.0, and relates to variables that are declared, set, but never read, and I dreamed of it back in 2008.

Now, the warning as it is, it’s pretty useful. Even though a number of times it’s going to be used to mask unused results warnings it can show code where a semi-pure function, i.e. one without visible side effects, but not marked (or markable) as such because of caching and other extraneous accesses, more about it in my old article if you wish, is invoked just to set a variable that is not used — especially with very complex functions, it is possible that enough time is spent processing for nothing.

Let me clarify this situation. If you have a function that silently reads data from a configuration file or a cache to give you a result (based on its parameters), you have a function that, strictly-speaking, is non-pure. But if the end result depends most importantly on the parameters, and not from the external data, you could say that the function’s interface is pure, from the caller perspective.

Take localtime() as an example: it is not a strictly-pure function because it calls tzset(), which – as the name leaves to intend – is going to set some global variables, responsible to identify the current timezone. While these are most definitely side effects, they are not the kind of side effects that you’ll be caring for: if the initialization doesn’t happen there, it will happen the next time the function is called.

This is not the most interesting case though: tzset() is not a very expensive funciton, and quite likely it’ll be called (or it would have been called) at some other point in the process. But there are a number of other functions, usually related to encoding or cryptography, which rely on pre-calculated tables, which might be actually calculated at the time of use (why that matters is another story).

Now, even considering this, a variable set but not used is usually not going to be troublesome in by itself: if it’s used to mask a warning for instance, you still want the side effect to apply, and you won’t be paying the price of the extra store since the compiler will not emit the variable at all.. as long as said variable is an automatic one, which is allocated on the stack for the function. Automatic variables undergo the SSA transformation, which among other things allows for unused stores to be omitted from the code.

Unfortunately, SSA cannot be applied to static variables, which means that assigning a static variable, even though said static variable is never used, will cause the compiler to include a store of that value in the final code. Which is indeed what happens for instance with the following code (tested with both GCC 4.5 – which does not warn – and 4.6):

int main() {
  static unsigned char done = 0;

  done = 1;
  return 1;
}

The addition of the -Wunused-but-set-variable warning in GCC 4.6 is thus a godsend to identify these, and can actually lead to improvements on the performance of the code itself — although I wonder why is GCC still emitting the static variable in this case, since, at least 4.6, knows enough to warn you about it. I guess this is a matter of missing an optimization, nothing world-shattering. What I was much more surprised by is that GCC fails to warn you about one similar situation:

static unsigned char done = 0;

int main() {

  done = 1;
  return 1;
}

In the second snippet above, the variable has been moved from function-scope to unit-scope, and this is enough to confuse GCC into not warning you about it. Obviously, to be able to catch this situation, the compiler will have to perform more work than the previous one, since the variable could be accessed by multiple functions, but at least with -funit-at-a-time it is already able to apply similar analysis, since it reports unused static functions and constant/variables. I reported this as bug #48779 upstream.

Why am I bothering writing a whole blog post about a simple missed warning and optimization? Well, while it is true that zeroed static variables don’t cause much trouble, since they are mapped to the zero-page and shared by default, you could cause huge waste of memory if you have a written-only variable that is also relocated, like in the following code:

#include 

static char *messages[] = {
  NULL, /* set at runtime */
  "Foo",
  "Bar",
  "Baz",
  "You're not reading this, are you?"
};

int main(int argc, char *argv[]) {
  messages[0] = argv[0];

  return 1;
}

Note: I made this code for an executable just because it was easier to write down, and you should think of it as a PIE so that you can feel the issue with relocation.

In this case, the messages variable is going to be emitted even though it is never used — by the way it is not emitted if you don’t use it at all: when the static variable is reported as unused, the compiler also drops it, not so for the static ones, as I said above. Luckily I can usually identify problems like these while running cowstats, part of my Ruby-Elf utilities if you wish to try it, so I can look at the code that uses it, but you can guess it would have been nicer to have already in the compiler.

I guess we’ll have to wait for 4.7 to have that. Sigh!

Upstream, rice it down!

While Gentoo often gets a bad name because of the so-called ricers and upstream developer complains that we allow users to shoot themselves in the foot by setting CFLAGS as they please, it has to be said that not all upstream projects are good in that regard. For instance, there are a number of projects that, unless you enable debug support, will force you to optimise (or even over-optimise) the code, which is obviously not the best of ideas (this does not count in things like FFmpeg that rely on Dead Code Elimination to link properly — in those cases we should be even more careful but let’s leave it alone for now).

Now, what is the problem with forcing optimisation for non-debug builds? Well, sometimes you might not want to have debug support (extra verbosity, assertions, …) but you might still want to be able to fetch a proper backtrace; in such cases you have a non-debug build that needs to turn down optimisations. Why should I be forced to optimise? Most of the time, I shouldn’t.

Over-optimisation is even nastier: when upstream forces stuff like -O3, they might not even understand that the code might easily slow down further. Why is that? Well one of the reasons is -funroll-loops: declaring all loops to be slower than unrolled code is an over-generalisation that you cannot pretend to keep up with, if you have a minimum of CPU theory in mind. Sure, the loop instructions have an higher overhead than just pushing the instruction pointer further, but unrolled loops (especially when they are pretty complex) become CPU cache-hungry; where a loop might stay hot within the cache for many iterations, an unrolled version will most likely require more than a couple of fetch operations from memory.

Now, to be honest, this was much more of an issue with the first x86-64 capable processors, because of their risible cache size (it was vaguely equivalent to the cache available for the equivalent 32-bit only CPUs, but with code that almost literally doubled its size). This was the reason why some software, depending on a series of factors, ended up being faster when compiled with -Os rather than -O2 (optimise for size, the code size decreases and it uses less CPU cache).

At any rate, -O3 is not something I’m very comfortable to work with; while I agree with Mark that we shouldn’t filter or exclude compiler flags (unless they are deemed experimental, as is the case for graphite) based on compiler bugs – they should be fixed – I also would prefer avoiding to hit those bugs in production systems. And since -O3 is much more likely to hit them, I’d rather stay the hell away from it. Jesting about that, yesterday I produced a simple hack for the GCC spec files:

flame@yamato gcc-specs % diff -u orig.specs frigging.specs
--- orig.specs  2010-04-14 12:54:48.182290183 +0200
+++ frigging.specs  2010-04-14 13:00:48.426540173 +0200
@@ -33,7 +33,7 @@
 %(cc1_cpu) %{profile:-p}

 *cc1_options:
-%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage}
+%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*} %{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}%{!c:%{!S:-auxbase %b}} %{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs} %{v:-version} %{pg:-p} %{p} %{f*} %{undef} %{Qn:-fno-ident} %{--help:--help} %{--target-help:--target-help} %{--help=*:--help=%(VALUE)} %{!fsyntax-only:%{S:%W{o*}%{!o*:-o %b.s}}} %{fsyntax-only:-o %j} %{-param*} %{fmudflap|fmudflapth:-fno-builtin -fno-merge-constants} %{coverage:-fprofile-arcs -ftest-coverage} %{O3:%eYou're frigging kidding me, right?} %{O4:%eIt's a joke, isn't it?} %{O9:%eOh no, you didn't!}

 *cc1plus:

flame@yamato gcc-specs % gcc -O2 hellow.c -o hellow; echo $?   
0
flame@yamato gcc-specs % gcc -O3 hellow.c -o hellow; echo $?
gcc: You're frigging kidding me, right?
1
flame@yamato gcc-specs % gcc -O4 hellow.c -o hellow; echo $?
gcc: It's a joke, isn't it?
1
flame@yamato gcc-specs % gcc -O9 hellow.c -o hellow; echo $?
gcc: Oh no, you didn't!
1
flame@yamato gcc-specs % gcc -O9 -O2 hellow.c -o hellow; echo $?
0

Of course, there is no way I could put this in production as it is. While the spec files allow enough flexibility to hit the case for the latest optimisation level (the one that is actually applied), rather than for any parameter passed, they lack an “emit warning” instruction, the instruction above, as you can see from the value of $? is “error out”. While I could get it running in the tinderbox, it would probably produce so much noise and for failing packages that I’d spend each day just trying to find why something failed.

But if somebody feels like giving it a try, it would be nice to ask the various upstream to rice it down themselves, rather than always being labelled as the ricer-distribution.

P.S.: building with no optimisation at all may cause problems; in part because of reliance on features such as DCE, as stated above, and as used by FFmpeg; in part because headers, including system headers might change behaviour and cause the packages to fail.

Filtering compiler optimisation flags is not a solution

Strangely enough this post is not brought but something that happened recently (I usually write in response of stuff that happens), but it’s a generic indication that I’ve had to explain to too many people in the past. The most recent thing that might link to my writing this is my note about Pidgin crashing more on Fedora and a little discussion on the matter with a friend of mine.

So, we all know compiler optimisation flags in Gentoo, and most users trying some exotic ones probably know that a lot of ebuilds tend to filter, strip or otherwise reduce the number of flags actually used at build time. This is, in many cases, a violation of Gentoo policies, and Mark being both QA and Toolchain master usually get upset by them. Since this is often abused I’d like to explain here what the problem is.

First of all, not all compiler flags are the same: there are flags that change behaviour of the source code, and others that should not change that behaviour. For instance the -ffast-math flag enables some more loose mathematical rules, this change the behaviour of the math source code as it’s no longer perfect; on the other hand the -ftree-vectorize only changes the output code and not the meaning of the source code, and should then be counted in as a safe flag.

You can see already the gist here: -ftree-vectorize has been called for build and runtime errors in the past few years, so it’s often not considered safe at all and indeed it’s often considered one of the less safe flags. But there are a few catches here: the first is that yes, the implementation of the flag might be at fault, and in the past it caused quite a few internal compiler errors, or miscompilation of source code into something that fails at runtime. But both these issues has to be reported to the GCC developers to be fixed because they are bug in GCC to begin with, so if the issue is just ignored by disabling the flag, they won’t be fixed any time soon.

Sometimes, though, the issues are neither a problem of miscompilation nor a bug in GCC, yet the package fails to execute properly or fails to build entirely; the latter happened with mplayer not too long ago. In these cases there’s still a bug, and it’s in the software itself, and needs to be fixed. In the case of mplayer for instance it has shown that the inline assembler code was using global labels rather than local lables like it should have been in the first place. Fixing the code wasn’t that hard, compared with the flag’s filtering.

Now, don’t get me wrong, I know there are at least a few issues with the approach I just noted: the first is that as the FFmpeg developers found out, -ftree-vectorize is not often a good idea, and can actually produce slower code on most systems, at least for the common multimedia usage methods. The second problem is that, with the exception of the mplayer bug, most of the build and runtime failures aren’t straightforward to fix; and when the problem is in GCC, it might take quite a while before the issue is fixed; how should we work those situations out then, if not by filtering?

Well, filtering works fine as a temporary option, a workaround, a band-aid to hide the problem from users. So indeed we should use filtering; on the other hand, this is a problem akin to those related to parallel make or --as-needed: you should not let the user be bitten by the problem, but at the same time you should accept that you haven’t fixed the bug just yet. My indication is thus keep the bug open if you “solved” it by filtering flags!

I know lots of developers dislike having bugs open at all, but it’s not really fixed if you just applied a workaround. And if you close it, nobody will ever see it again, and this will result in a phantom bug that will take a much longer time to reproduce, verify, and fix properly. This is for instance the problem when I hit a package that, without any comment in either ebuild or change log, has a strip-flags call, which reduces the amount of flags passed to the compiler: finding whether the call is there because of a reported bug, or just because the Gentoo developer involved couldn’t be bothered by following the policy, requires time.

And finally, users please understand that the flags like -ffast-math or -fvisibility that do change the meaning of the source code should not be used by users but should rather be applied directly by upstream if they are safe!

There are flags and mlags…

In my previous post about control I stated that we want to know about problems generated by compiler flags on packages, and that filtering flags is not a fix, but rather a workaround. I’d like to expand the notion by providing a few more important insights about the matter.

The first thing to note is that there are different kind of flags you can give the compiler; some are not supposed to be tweaked by users, other can be tweaked just as fine. Deciding whether a flag should or should not be touched by the user is a very tricky matter because different persons might have different ideas about them. Myself, I’d like to throw my two eurocents in to show the discretion I use.

The first point is a repeat of what I already expressed about silly flags that can be summed up in “if you’re just copying the flags from a forum post you’re doing it wrong”. If you really know what you’re doing it should be pretty easy for you to never have problems with flags, on the other hand if you just copy what others did, there is a huge chance you’re going to get burned by something one day or the one after that.

Compilers are huge, complex beasts, and being able to understand how they work is not something for the average user. Unfortunately to correctly assess the impact of a flag on the produced code, you do need to know a lot about the compiler. For this reason you often find some of the flags listed as “safe flags”, and briefly explained. Myself, I’m not going to do that, I’m just going to talk abstractly about them.

The first issue comes with understanding that there are “free” and “non-free” optimisations: some optimisation, like almost all the ones enabled at -O2, don’t force any particular requirement on the code that the language the code is written in does not force before; actually sometimes it also makes it loose up a bit. An example of this is the dead code elimination phase that allows for functions only called in branches that are never executed to remain undefined at the final linking stage (as used by FFmpeg’s libavcodec to deal with optional codecs’ registration).

Before GCC 4.4, at least for x86, the -O2 level also didn’t enforce (at least not really) some specifications of the C language, like strict aliasing, which reduced the chances for optimisation to loosen up the type of code that was allowed to compile properly. More than an allowance from GCC, though, this was due to the fact that the compiler didn’t have much to exploit by enforcing aliasing on registry-poor architectures like x86. With GCC 4.4, relying on this is no longer possible, though.

Other flags, though, do restrict the type of code that is accepted as proper and compiled, and may cause bugs that are too subtle for the average upstream developer, which then would declare custom flags “unsupported”. Unfortunately this is not some extremely rare case, it’s actually a norm for many upstream we deal with in Gentoo. These flags, with the most prominent example being -ffast-math, break assumptions in the code; for instance this flag may provide slightly different results in mathematical functions that could lead to huge domino effects over code resolving complex formulae. On a similar note, but not the same note, the -mfpmath=sse flag allows to generate SSE (instead of i387) code for floating point operations; it’s considered “safer” because it breaks an assumption that is only valid on x86 architecture (the non-standard 80-bit size of the temporary values), and only exploited by very targeted software rather than pure C code. Indeed this is what the x86-64 compiler do by default.

There are then a few flags that only work when the code is designed to make use of them; this is the case of the -fvisibility flags family, that requires the code to properly declare the visibility of its function to work properly. Similarly, the -fopenmp flag requires the code to be written using OpenMP, otherwise it won’t magically make your software faster by using parallel optimisation (there are, though, flags that do that as far as I know they are quite experimental for now). Enabling of these flags should be left only to the actual upstream buildsystem and not by users.

Some flags might interfere with hand-written ASM code; for instance the -fno-omit-frame-pointer (need to get some decent output from kernel-level profilers), which is actually an un-optimisation, can make the ebx x86 register unavailable (when coupled with -fPIC or -fPIE at least). While I experienced myself problems with -ftree-vectorize in a single case (on x86-64 at least; on x86 I know it has created faulty code more than once, on whether this is a GCC bug or some assumption, I have no idea): with mplayer’s mp3lib and an hand-written piece of asm code that didn’t use local labels, the flag duplicated a code path and the pasted code from the asm() block tried to declare twice the same (global) label.

Finally, some flags, like -fno-exceptions and -fno-rtti for C++ can cause some pretty heavy optimisation, but should never be used if not by upstream. Doing so it will cause some hard to track down issues like the ones that Ardour devs complained about as you’re actually disabling some pretty big parts of the language, in a way that makes the resulting ABI pretty much incompatible between libraries.

And I almost forgot the most important thing to keep in mind: not always the code most optimised for execution speed is faster, which is why on the first models of x86-64 CPUs, the code produced by -Os sometimes performed better than the code produced by -O2. In this case, the relatively small L2 cache on the CPU could slow down the execution of the most aggressively optimised code because it was larger and couldn’t fit in the cache. The simplest example to understand this is to think about unrolled loops: a loop is inherently slower than the unrolled code: it needs an iterator that might not be needed otherwise, it requires to jump up the stream, it might require to actually move a cursor of some sorts. On the other hand, especially for big loop bodies (with inline and static functions included of course), unrolling the loop might result in code that requires lots of cache fetches; and on the other hand, smaller loops that can be entirely kept in cache might not take that much time to jump back since the code is already there.

So what is the bottom line of this post? One could argue that the solution is to leave it to the compiler; but as Mike points out repeatedly at least ICC is beating the crap out of GCC on newer chips (at least Intel chips; I also have some concerns about their use of tricks like the ones I said above about unsafe floating point optimisations but I don’t want to go there today). So the compiler might not know really much better.

My suggestion is to leave it to the experts; while I don’t like the idea of making it an explicit USE flag to use your own CFLAGS definition (I also want the control) we shouldn’t be, usually, overriding upstream-provided CFLAGS is they are good. Sometimes though they might require a bit of help; for instance in the case of xine I can remember the original CFLAGS definition to be pretty much crazy, with all the possible optimisations forced on even when they don’t produce that good of a result on average at all. I guess it’s all a bet.

Gentoo maintainer node and help-call: it seems like either my PCI sound card fried up or there is some nasty bug in the ALSA driver for it I don’t really have the time to deal with (I’ll be updating my previous post about it since after a few more tries it turned out not to be related to the hardware outside of Yamato). This already was a problem in the past two months or so since kernel 2.6.29 didn’t work properly, and it starts to be a big deal. My contributions to PulseAudio especially on Gentoo side has been quite hectic because of it, and the package is in serious need of ordinary and extraordinary work on it.

I might just go out one day of this week and fetch a new USB card, but to be honest I’d like to avoid that for now (I had already enough hardware failure for the past months, and a few more hardware bits that I had to replace/buy for other reasons, as well as a scheduled acquisition of one or two eSATA disks to move around data that I have no longer space for). So I added one USB soundcard (as suggested by Lennart to be fine under Linux) to my wishlist (thanks to the fact that Amazon now ships electronics components to Italy, whooo!) but I could just use some old Linux-supported card if somebody had one to give me; my only requirement is for it to support digital ouput (iec958, S/PDIF), it really doesn’t matter whether it uses coaxial or optical cable; I admit coaxial might be a bit nicer (so that the receiver can deal with both Yamato and Merrimac, with the latter only providing optical), but really either are fine.

Yes I know this sounds a lot like a shameless plug – it probably is – but I’ve got over 1300 bugs open in Bugzilla, and Yamato is crunching its hard drives to find the issues before they hit users, I guess you can let me have this plug, can’t you? Thanks.

Giving control

One of the issues that I’m trying to tackle with my tinderbox is that we have a varying degree of control among different ebuilds. This is one thing that I think is a major problem in Gentoo: while a lot of users are brought to us by the idea of being able to choose the flags to use for build the software, we are lately slowing down on that as an issue. Not only packages start to feature custom-cflags USE flags (or custom-cxxflags for the Qt packages), but we also strip, filter and randomly mangle flags.

Now, of course there are quite a few compiler flags that we don’t want users to enable, but as Mark has been repeating over and over and over is that if any flag breaks a package which is not intended to, then we should be tackling the issue on the compiler level, fixing that. And on the other hand, I wouldn’t care if users using silly flags get broken software. As for the idea that upstream will not support our users… well they shouldn’t, to begin with; problems should first filter through us; if we had enough people to work on the issues at least.

But even skipping over the flags there are other issues: USE flags, debug information, installation paths, slotting, alternative software and so on. As David said in a previous post there is no way we can test all situations beforehand, even if it’d be quite easier for our users. While binary distributions have a limited setup system which can be tested somewhat easily, there is an infinite amount of variation in Gentoo systems which makes it much more difficult to identify all the issues beforehand (and this is even without factoring in the Gentoo/Alt project, with Gentoo/FreeBSD and the prefix support!).

I can repeat at every post that the key for proper software is testing but this is not going to work when there are so many packages failing tests, with bugs open, and nobody looking at them. I am culprit of this too, there are quite a few packages that I maintain for which I don’t run all the tests properly and I have never finished my uif2iso testsuite which I started working on almost six months ago! We should really start to reject stable for packages failing tests, and bumping the priority of test failure for packages that are stable already.

Of course, it might well be that upstream doesn’t test enough pieces already, and that something in the environment will break their software; shit happens, we can track it down, and upstream can add further tests to make sure this does not happen again! I’m sure that lots of developers do like this idea. And reading Eric’s interview I guess that RedHat and Fedora are working on making use of automated tests more. Why shouldn’t we?

Okay this is one post I have written instead of sleeping, again, at least I have been watching Bill Maher .. love that talk show!

Discovering the environment versus knowledge repository

For what concerns users, the main problem with build systems based on autotools is the long delay caused by executing ./configure scripts. It is, indeed, a bit of a problem from time to time and I already expressed my resentment regarding superfluous checks. One of the other proposed solution to mitigate the problem in Gentoo is running the script through a faster, more basic shell rather than the heavy and slow bash, but this has a few other side issues that I’m not going to discuss today.

As a “solution” (but to me, a workaround) to this problem, a lot of build systems prefer using a “knowledge repository” to know how to deal with various compiler and linker flags, and various operating systems. While this certainly has better result for smaller, standards-based software (as I write in my other post, one quite easy way out from a long series of tests is just to require C99), it does not work quite that well for larger software, and tends to create fairly problematic situations.

One example of such a buildsystem is qmake as used by Trolltech, sorry Qt Software. I had to fight with it quite a bit in the past when I was working on Gentoo/FreeBSD since the spec files used under FreeBSD assumed that the whole environment was what ports provided, and of course Gentoo/FreeBSD had a different environment. But without going full-blown toward build systems entirely based on knowledge, there can easily be similar problems with autotools-based software, as well as cmake-based software. Sometimes it’s just a matter of not knowing well enough what the future will look like, sometimes these repositories are simply broken. Sometimes, the code is simply wrong.

Let’s take for instance the problem I had to fix today (in a hurry) on PulseAudio: a patch to make PulseAudio work under Solaris went to look for the build-time linker (ld) and if it was the GNU version it used the -version-script option that it provides. If you look at it on paper, it’s correct, but it didn’t work and messed up the test4 release a lot. In this case the cause of the problem is that the macro used has been obsoleted and thus it never found the link as being GNU, but this was, nonetheless, a bad way to deal with the problem.

Instead of knowing that the GNU ld supported that option, and just that, the solution I implemented (which works) is to check if the linker accepts the flag we needed, and if it does it provides a variable that can be used to deal with it. This is actually quite useful since as soon as I make Yamato take a break from the tinderbox I can get the thing to work with the Sun linker too. But it’s not just that: nobody tells me whether in the future a new linker will support the same options as the GNU ld (who knows, maybe gold).

A similar issue applies to Intel’s ICC compiler, that goes to the point as passing itself as GCC (defining the same internal preprocessor macros), for the software to use the GCC extension that ICC implements. If everybody used discovery instead of knowledge repository, this would have not been needed (and you wouldn’t have to workaround when ICC does not provide the same features as GCC — I had to do that for FFmpeg some time ago).

Sure, knowledge repository is faster, but is it good just the same? I don’t think so.