The subtlety of modern CPUs, or the search for the phantom bug

Yesterday I have released a new version of unpaper which is now in Portage, even though is dependencies are not exactly straightforward after making it use libav. But when I packaged it, I realized that the tests were failing — but I have been sure to run the tests all the time while making changes to make sure not to break the algorithms which (as you may remember) I have not designed or written — I don’t really have enough math to figure out what’s going on with them. I was able to simplify a few things but I needed Luca’s help for the most part.

Turned out that the problem only happened when building with -O2 -march=native so I decided to restrict tests and look into it in the morning again. Indeed, on Excelsior, using -march=native would cause it to fail, but on my laptop (where I have been running the test after every single commit), it would not fail. Why? Furthermore, Luca was also reporting test failures on his laptop with OSX and clang, but I had not tested there to begin with.

A quick inspection of one of the failing tests’ outputs with vbindiff showed that the diffs would be quite minimal, one bit off at some non-obvious interval. It smelled like a very minimal change. After complaining on G+, Måns pushed me to the right direction: some instruction set that differs between the two.

My laptop uses the core-avx-i arch, while the server uses bdver1. They have different levels of SSE4 support – AMD having their own SSE4a implementation – and different extensions. I should probably have paid more attention here and noticed how the Bulldozer has FMA4 instructions, but I did not, it’ll show important later.

I decided to start disabling extensions in alphabetical order, mostly expecting the problem to be in AMD’s implementation of some instructions pending some microcode update. When I disabled AVX, the problem went away — AVX has essentially a new encoding of instructions, so enabling AVX causes all the instructions otherwise present in SSE to be re-encoded, and is a dependency for FMA4 instructions to be usable.

The problem was reducing the code enough to be able to figure out if the problem was a bug in the code, in the compiler, in the CPU or just in the assumptions. Given that unpaper is over five thousands lines of code and comments, I needed to reduce it a lot. Luckily, there are ways around it.

The first step is to look in which part of the code the problem appears. Luckily unpaper is designed with a bunch of functions that run one after the other. I started disabling filters and masks and I was able to limit the problem to the deskewing code — which is when most of the problems happened before.

But even the deskewing code is a lot — and it depends on at least some part of the general processing to be run, including loading the file and converting it to an AVFrame structure. I decided to try to reduce the code to a standalone unit calling into the full deskewing code. But when I copied over and looked at how much code was involved, between the skew detection and the actual rotation, it was still a lot. I decided to start looking with gdb to figure out which of the two halves was misbehaving.

The interface between the two halves is well-defined: the first return the detected skew, and the latter takes the rotation to apply (the negative value to what the first returned) and the image to apply it to. It’s easy. A quick look through gdb on the call to rotate() in both a working and failing setup told me that the returned value from the first half matched perfectly, this is great because it meant that the surface to inspect was heavily reduced.

Since I did not want to have to test all the code to load the file from disk and decode it into a RAW representation, I looked into the gdb manual and found the dump commands that allows you to dump part of the process’s memory into a file. I dumped the AVFrame::data content, and decided to use that as an input. At first I decided to just compile it into the binary (you only need to use xxd -i to generate C code that declares the whole binary file as a byte array) but it turns out that GCC is not designed to compile efficiently a 17MB binary blob passed in as a byte array. I then opted in for just opening the raw binary file and fread() it into the AVFrame object.

My original plan involved using creduce to find the minimal set of code needed to trigger the problem, but it was tricky, especially when trying to match a complete file output to the md5. I decided to proceed with the reduction manually, starting from all the conditional for pixel formats that were not exercised… and then I realized that I could split again the code in two operations. Indeed while the main interface is only rotate(), there were two logical parts of the code in use, one translating the coordinates before-and-after the rotation, and the interpolation code that would read the old pixels and write the new ones. This latter part also depended on all the code to set the pixel in place starting from its components.

By writing as output the calls to the interpolation function, I was able to restrict the issue to the coordinate translation code, rather than the interpolation one, which made it much better: the reduced test case went down to a handful of lines:

void rotate(const float radians, AVFrame *source, AVFrame *target) {
    const int w = source->width;
    const int h = source->height;

    // create 2D rotation matrix
    const float sinval = sinf(radians);
    const float cosval = cosf(radians);
    const float midX = w / 2.0f;
    const float midY = h / 2.0f;

    for (int y = 0; y < h; y++) {
        for (int x = 0; x < w; x++) {
            const float srcX = midX + (x - midX) * cosval + (y - midY) * sinval;
            const float srcY = midY + (y - midY) * cosval - (x - midX) * sinval;
            externalCall(srcX, srcY);
        }
    }
}

Here externalCall being a simple function to extrapolate the values, the only thing it does is printing them on the standard error stream. In this version there is still reference to the input and output AVFrame objects, but as you can notice there is no usage of them, which means that now the testcase is self-contained and does not require any input or output file.

Much better but still too much code to go through. The inner loop over x was simple to remove, just hardwire it to zero and the compiler still was able to reproduce the problem, but if I hardwired y to zero, then the compiler would trigger constant propagation and just pre-calculate the right value, whether or not AVX was in use.

At this point I was able to execute creduce; I only needed to check for the first line of the output to match the “incorrect” version, and no input was requested (the radians value was fixed). Unfortunately it turns out that using creduce with loops is not a great idea, because it is well possible for it to reduce away the y++ statement or the y < h comparison for exit, and then you’re in trouble. Indeed it got stuck multiple times in infinite loops on my code.

But it did help a little bit to simplify the calculation. And with again a lot of help by Måns on making sure that the sinf()/cosf() functions would not return different values – they don’t, also they are actually collapsed by the compiler to a single call to sincosf(), so you don’t have to write ugly code to leverage it! – I brought down the code to

extern void externCall(float);
extern float sinrotation();
extern float cosrotation();

static const float midX = 850.5f;
static const float midY = 1753.5f;

void main() {
    const float srcX = midX * cosrotation() - midY * sinrotation();
    externCall(srcX);
}

No external libraries, not even libm. The external functions are in a separate source file, and beside providing fixed values for sine and cosine, the externCall() function only calls printf() with the provided value. Oh if you’re curious, the radians parameter became 0.6f, because 0, 1 and 0.5 would not trigger the behaviour, but 0.6 which is the truncated version of the actual parameter coming from the test file, would.

Checking the generated assembly code for the function then pointed out the problem, at least to Måns who actually knows Intel assembly. Here follows a diff of the code above, built with -march=bdver1 and with -march=bdver1 -mno-fma4 — because turns out the instruction causing the problem is not an AVX one but an FMA4, more on that after the diff.

        movq    -8(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmovss  -20(%rbp), %xmm2
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovss  -20(%rbp), %xmm1
+       vfmsubss        %xmm0, .LC0(%rip), %xmm1, %xmm0
        leave
        .cfi_remember_state
        .cfi_def_cfa 7, 8
-       vsubss  %xmm0, %xmm1, %xmm0
        jmp     externCall@PLT
 .L6:
        .cfi_restore_state

It’s interesting that it’s changing the order of the instructions as well, as well as the constants — for this diff I have manually swapped .LC0 and .LC1 on one side of the diff, as they would just end up with different names due to instruction ordering.

As you can see, the FMA4 version has one instruction less: vfmsubss replaces both one of the vmulss and the one vsubss instruction. vfmsubss is a FMA4 instruction that performs a Fused Multiply and Subtract operation — midX * cosrotation() - midY * sinrotation() indeed has a multiply and subtract!

Originally, since I was disabling the whole AVX instruction set, all the vmulss instructions would end up replaced by mulss which is the SSE version of the same instruction. But when I realized that the missing correspondence was vfmsubss and I googled for it, it was obvious that FMA4 was the culprit, not the whole AVX.

Great, but how does that explain the failure on Luca’s laptop? He’s not so crazy so use an AMD laptop — nobody would be! Well, turns out that Intel also have their Fused Multiply-Add instruction set, just only with three operands rather than four, starting from Haswell CPUs, which include… Luca’s laptop. A quick check on my NUC which also has a Haswell CPU confirms that the problem exists also for the core-avx2 architecture, even though the code diff is slightly less obvious:

        movq    -24(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmovd   %ebx, %xmm2
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovd   %ebx, %xmm1
+       vfmsub132ss     .LC0(%rip), %xmm0, %xmm1
        addq    $24, %rsp
+       vmovaps %xmm1, %xmm0
        popq    %rbx
-       vsubss  %xmm0, %xmm1, %xmm0
        popq    %rbp
        .cfi_remember_state
        .cfi_def_cfa 7, 8

Once again I swapped .LC0 and .LC1 afterwards for consistency.

The main difference here is that the instruction for fused multiply-subtract is vfmsub132ss and a vmovaps is involved as well. If I read the documentation correctly this is because it stores the result in %xmm1 but needs to move it to %xmm0 to pass it to the external function. I’m not enough of an expert to tell whether gcc is doing extra work here.

So why is this instruction causing problems? Well, Måns knew and pointed out that the result is now more precise, thus I should not work it around. Wikipedia, as linked before, points also out why this happens:

A fused multiply–add is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b×c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire sum a+b×c to its full precision before rounding the final result down to N significant bits.

Unfortunately this does mean that we can’t have bitexactness of images for CPUs that implement fused operations. Which means my current test harness is not good, as it compares the MD5 of the output with the golden output from the original test. My probable next move is to use cmp to count how many bytes differ from the “golden” output (the version without optimisations in use), and if the number is low, like less than 1‰, accept it as valid. It’s probably not ideal and could lead to further variation in output, but it might be a good start.

Optimally, as I said a long time ago I would like to use a tool like pdiff to tell whether there is actual changes in the pixels, and identify things like 1-pixel translation to any direction, which would be harmless… but until I can figure something out, it’ll be an imperfect testsuite anyway.

A huge thanks to Måns for the immense help, without him I wouldn’t have figured it out so quickly.

A new XBMC box

A couple of months ago I was at LinuxTag in Berlin with the friends from VIdeoLAN and we shared a booth with the XBMC project. It was interesting to see the newest version of XBMC running, and I decided that it was time for me to get a new XBMC box — last time I used XBMC was on my AppleTV and while it was not strictly disappointing it was not terrific either after a while.

At any rate, we spoke about what options are available nowadays to make a good XBMC set up, and while the RaspberryPi is all the rage nowadays, my previous experience with the platform made it a no-go. It also requires you to find a place where to store your data (the USB support on the Pi is not good for many things) and you most likely will have to re-encode animes to the Right Format™ so that the RPi VideoCore can properly decode them: anything that can’t be hardware-accelerated will not play on such a limited hardware.

The alternative has been the Intel NUC (Next Unit of Computing), which Intel sells in pre-configured “barebone” kits, some of which include wifi antennas, 2.5” disk bays, and a CIR (Consumer Infrared Receiver) that allows you to use a remote such as the one for the XBox 360 to control the unit. I decided to look into the options and I settled on the D54250WYKH which has a Core i5 CPU, space for both a wireless card (I got the Intel 7260 802.11ac which is dual-radio and supports the new 11ac protocol, even though my router is not 11ac yet), and a mSATA SSD (I got a Transcend 128GB one), as well the 2.5” bay that allows me to use a good old spinning-rust harddrive to store the bulk of the data.

Be careful and don’t repeat my mistake! I originally ordered a very cool Western Digital Caviar Green 2TB HDD but while it is a 2.5” HDD, it does not fit properly in the provided cradle; the same problem used to happen with the first series of 1TB HDDs on PlayStation 3s. I decided to keep the HDD and bring it with me to Ireland, as I don’t otherwise have a 2TB HDD, instead I opted for a HGST 1.5TB HDD (no link for this one as I bought it at Fry’s the same day I picked up the rest, if nothing else because I had no will to wait, and also because I forgot I needed a keyboard).

While I could have just put OpenELEC on the device, I decided instead to install my trusted Gentoo — a Core i5 with 16GB of RAM and a good SSD is well in its ability to run it. And since I was finally setting something up that needs (for myself) to turn on very quickly, I decided to give systemd a go (especially as Robbins is now considered a co-maintainer for OpenRC which drains all my will to keep using it). The effect has been stunning, but there are a few issues that needs to be ironed out; for instance, as far as I can tell, there is no unit for rngd which means that both my laptop (now converted to systemd) and the device have no entropy, even though they both have the rdrand instruction; I’ll try to fix this lack myself.

Another huge problem for me has been getting the audio to work; while I’ve been told by the XBMC people that the NUC are perfectly well supported, I couldn’t for the sake of me get the audio to work for days. At the end it was Alexander Patrakov who pointed out to intel_iommu=on,igfx_off as a kernel option to get it to work (kernel bug #67321 still unfixed). So if you have no HDMI output on your NUC, that’s what you have to do!

Speaking about XBMC and Gentoo, the latest version as of last week (which was not the latest upstream version, as a new one got released exactly while I was installing the box), seem to force you to install FFmpeg over libav – I honestly felt a bit sorry for the developers of XBMC at LinuxTag while they were trying to tell me how the multi-threaded h264 decoder from FFmpeg is great… Anton, who wrote it, is a libav developer! – but even after you do that, it seems like it does not link it in, preferring a bundled copy of it instead. Which also doesn’t seem to build support for multithread (uh?). This is something that I’ll have to look into once I’m back in Dublin.

Other than that, there isn’t much to say; the one remaining big issue is to figure out how to properly have XBMC start up at boot without nasty autologin hacks on systemd. And of course finding a better way than using a transmission user to start the Transmission daemon, or at least find a better way to share the downloads with XBMC itself. Probably separating the XBMC and Transmission users is a good idea.

Expect more posts on what’s going on with my XBMC box in the future, and take this one as a reference about the NUC audio issue.

Finding IDs to submit

I have written a lot about the hardware IDs but i haven’t said much about submitting new entries to the upstream databases. Indeed, the package just mirrors the data that is collected by the USB and PCI databases that are managed by Stephen, Martin and Michal.

As an example, I’ll show you how I’ve been submitting the so-called Subsystem IDs for PCI devices from computers I either own, or fix up for customers and friends.

First off, you have to find a system or device whose subsystem IDs have not been submitted yet. Unfortunately I don’t have any computer at hand that I haven’t submitted to the database already. But fear not — it so happens I had an interesting opening. I rented a server from OVH recently, as I’ve had some trouble with one of my production hosts lately, and I’m entertaining the idea of moving everything on a new server and service altogether. But the whole thing is a topic for a completely different time. In any case, let’s see what we can do about these IDs now that I have an interesting system at hand.

First of all, while I don’t have the server at hand to know what’s in it, OVH does tell me what hardware is on it — in particular they tell me it’s an Intel D425KT board (yes I got a Kimsufi Atom, I got the three months lease for now and I’ll see if it can perform decently enough), so that’s a start. Alternatively, I could have asked dmidecode — but I just don’t have it installed on that server right now.

First step is to look at what lspci -v says:

00:00.0 Host bridge: Intel Corporation Atom Processor D4xx/D5xx/N4xx/N5xx DMI Bridge
        Subsystem: Intel Corporation Device 544b
        Flags: bus master, fast devsel, latency 0
        Capabilities: [e0] Vendor Specific Information: Len=08 <?>

This is of course only the first entry in the list but it’s still something. You can see on the second line that it says “Subsystem: Intel Corporation Device 544b” — that means that it knows the subsystem vendor (ID 8086, I can tell you by heart — they have been funny at that), but it doesn’t know the subsystem device. So it’s what we’re looking for: an unknown system! Time to compare the output of lspci -vn — that one does not resolve the IDs, since we’ll need them to submit to the PCI database so if you’re not registered already, do register so that they can be submitted to begin with.

00:00.0 0600: 8086:a000
        Subsystem: 8086:544b
        Flags: bus master, fast devsel, latency 0
        Capabilities: [e0] Vendor Specific Information: Len=08 <?>

Okay so now we know that our first device is Intel’s (VID 8086) and has a000 as device ID — this brings us to https://pci-ids.ucw.cz/read/PC/8086/a000 easy, isn’t it? At the end of the page there’s a list of the known subsystem IDs; pending submissions does not show up the name, but they show up in the table with a darker gray background. All PCI ID entries are moderated by hand by the database’ s maintainers. When you’ll be reading this, the entry for my board will be in already, but right now it isn’t — if it wasn’t obvious, I’m looking for an entry that reads 8086 544b (which is under “Subsystem” above).

Now the form requires just a few words: the ID itself – which is 8086 544b with a space, not a colon – and a name. Note is for something that needs to be written on the pci.ids, so in most cases need to be empty. Discussion if when you wan tot comment on the certainly of your submission; for my laptop for instance we had some trouble with “Intel Corporation Device 0153” — which is now officially “3rd Gen Core Processor Thermal Subsystem”.

The name I’m going to submit is “Desktop Board D425KT” as that’s what the other entry in the database for that device uses as a format — okay it actually uses DeskTop but I’d rather not capitalize another T and see a kitten cry.

Now it’s time to go through all the other entries in the system — yes there are many of them, and most of the time the IDs are not set in the order of the PCI connections, so be careful. More interestingly, not all the subsystems are going to be listed in the same line. Indeed, the third entry that I have is this:

00:1c.0 0604: 8086:27d0 (rev 01) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00001000-00001fff
        Memory behind bridge: e0f00000-e12fffff
        Prefetchable memory behind bridge: 00000000e0000000-00000000e00fffff
        Capabilities: [40] Express Root Port (Slot+), MSI 00
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [90] Subsystem: 8086:544b
        Capabilities: [a0] Power Management version 2
        Capabilities: [100] Virtual Channel
        Capabilities: [180] Root Complex Link
        Kernel driver in use: pcieport

The subsystem ID is listed under “Capabilities” instead — but it’s always the same. This is actually critical: if the subsystem does not match, it means that it’s coming from a different component — for instance if you’re building your own computer, the subsystem of the internal CPU devices and those of the motherboard will not match, as they come from different vendors. And so would happen to add-on cards (PCI, PCI-E, AGP, …).

Sometimes, a different subsystem is also available on internal components that get different names from the motherboard itself — in this case, the Realtek network card on this motherboard reports a completely different ID and I really don’t know how to submit it:

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E/RTL8102E PCI Express Fast Ethernet controller (rev 05)
        Subsystem: Intel Corporation Device d626
        Flags: bus master, fast devsel, latency 0, IRQ 44
        I/O ports at 1000 [size=256]
        Memory at e0004000 (64-bit, prefetchable) [size=4K]
        Memory at e0000000 (64-bit, prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [70] Express Endpoint, MSI 01
        Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
        Capabilities: [d0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Virtual Channel
        Capabilities: [160] Device Serial Number 01-00-00-00-36-4c-e0-00
        Kernel driver in use: r8169

If for whatever reason you make a mistake, you can click on the “Discuss” link on the submitted content and edit the name that you want to submit. I did make such a mistake during submitting the IDs for this.

So here are the tricks.. happy submission!

Last few notes about x32

So my previous posts were picked up by none others than LWN.net — it was quite impressive to see the tweet of them picking up my blog post, it’s the first time, although I did author articles for them before.

Now in the comments of the articles and LWN’s own signalling of it, you can find a lot of discussion about the merits of x32, and a little of it tries to paint me as uninformed. I would like to just say a few words about that right now so that I don’t have to go through this later on. I’ve been toying around ELF, x86-64, PIC and structure optimisation for a very long time. I’ll come back in a moment on why I didn’t do a more thorough analysis and my own benchmarks of the architecture, but if you really think I’m just an amateur because I work on Gentoo Linux and not Fedora or Ubuntu, please think again. I might not be one of the “greats”, but I don’t think I’d be boasting if I say that I know what I’m doing — most of the times at least.

So why did I not go into doing my own benchmark to show the numbers of (non-)improvement on x32? Because for me it would be time wasted. I’m not Phoronix, I don’t benchmark stuff for a living, and I’m neither proposing the ABI or going to do work on it myself. I looked into the new ABI because from one side, it’s always cool to learn about new techniques and technology, even when they sound a little over the top (I did look a lot into FatELF as well and I was very negative about it — I hope Ryan doesn’t hold a grudge against me, I was quite unlikeable from his point of view I’m sure), and from the other because my colleague Luca suggested it could be useful to get some more performance out of a device we’re working on.

Now, said device is embedded, runs Gentoo Linux, and needs libav and x264 – I’m not going to give you any more specifics about it – which is why my first test on the new ABI has been testing libav (and finding it requiring way too much work than it would make sense for us). Looking into it has also told me that some of the assumption I made about how the new ABI would have been designed, for instance the fact that long was still 32-bit surprised me.

I’ve been told my arguments are “strawmen” because I singled out some specific topic instead of doing a top-down analysis — as the title of my post, and the reference to my old ccache article should have suggested, I was looking into some of the things I’ve been discussing, or have been told. The only exception to that has been my answer to “x32 is going to be compatible with x86, if not now in the future.” — I have talked with nobody about this but I’ve seen this kind of misconception floating around, especially at the time of the FatELF proposal, about a 64-bit ABI which would be binary compatible with good old 32-bit x86.

The purported reason for having such an ABI would be being able to load 32-bit closed-source libraries into the address space of 64-bit programs or vice-versa. The idea is that this way the copy of Skype I’m running wouldn’t be loading into my memory a copy of the 32-bit libc.so.6 library, which is used by no other process.

If it feels like my posts have been aimed squarely at the Gentoo folks, it might very well be right, although that was not the intention. Most people who look into new ABIs as they come out are probably on the same page as most Gentoo users with their bleeding edge feeling — if you have only production Fedora installs, you really won’t give much about an ABI fedora is not released for yet! And given Mike made us the first distribution releasing something for the ABI, it feels right to discuss Gentoo issues first.

Now I also been told that I didn’t talk enough about the reduction in size of data structures, which improves the use of the data cache (not the instruction cache as Francesco said in the comments of the first article), and for that people got the impression I don’t know how much of a difference that makes … that would be wrong given that I’ve actually discussed methods to minimize data usage and have spent time writing a tool to reduce copy-on-write even when that means making changes for ludicrously small improvements.

I have also been working closely with codiff and pahole from Arnaldo’s dwarves package to make sure that the software I manage has properly-designed structures, not only reducing the size of the single object, but making sure that attributes that are to be used together are grouped nearby — this is pretty important for data cache handling, and might goes against what most people are told in school, here at least, that attributes in classes have to be ordered semantically, not by use.

On a different note it would be nice if it was possible to tell the compiler that a given structure never leaves the object, and thus it can reorder it as needed to get the best performance — but that would also require that each unit reorders it properly. Nevermind.

There are some interesting things to be considered as well — if you need fast access to objects in an array, you might be interested in using a little more memory and make sure the object’s size is a power of two, so that instead of using expensive multiplications you can use left shifts to calculate the offset from base pointer of a given index.

I know that reducing the size of pointers and long will reduce the pressure on the data cache, which in turn means you can have faster pointer chasing and better access to thinks like linked lists and so on — on the other hand I don’t think that this improvement is worth all the compatibility and porting headaches that a new ABI involves, especially considering that, as we move along, more and more software will make a better use of the 64-bit address space, as developers start to understand they have to drop the old design and paradigm of scores of years ago and replace it with modern design; Paul-Henning Kamp of FreeBSD and Varnish fame said it very well in the linked ACM article.

So to sum it up: I still don’t think x32 is worth my time, whether it is for porting, bug-filing or benchmarking. Of course if somebody gets libav to work on x32 I’ll be the first person to set up a FATE instance for it, and if Gentoo decides to make it a first-class citizen I’ll set up a tinderbox instance for it, but … I sure hope I won’t have to spend more time on it.

What I think I’ll spend some time on in the next few days, that I started thinking about after all the comments, is some posts describing things such as what an ABI actually is in this content, and how to see whether your structures are simply inadequate for what you’re trying to do. It might get interesting.

And to finish this off, I know I use “Now,” to start paragraphs way too often — I guess this is the reason why O’Reilly wouldn’t consider me as an author.

Debunking x32 myths

There has been many comments on my previous post about the new x32 ABI; some are interesting, others are more “out there” — the feeling I get is that there is quite a bit of cargo culting, with people thinking “there has to be a reason why is is developed, so it’ll be good for me!” without actually having the technical background to judge the usefulness of all this.

So in the same spirit with which I commented on ccache almost exactly four years ago (wow, I have been keeping a blog for a very long time, haven’t I?), I’ll try to debunk a few of the myths and misconception around this new ABI.

The new x32 ABI has proven to be faster. Not really; what we have right now are a few benchmarks, published by those who actually created the ABI, Of course you’d expect that those who spent time to set it up found it interesting and actually faster, but I honestly have doubts about the results, for reasons that will be clearer by reading the next few entries.

It’s also interesting to note that while the overall benchmarks seem to be positive, the numbers are quite close in general.. and even Intel’s presentation only gives you actual “big” numbers only when comparing with the original x86 ABI — which nobody is saying is better than what x86-64 is!

The data is also coming from a synthetic test, not from an actual overall system usage, and if you have any clue about benchmarks you know that the numbers can easily lie out of their teeth!

The new ABI generates smaller code, which means more instruction will fit in cache, and you’ll have smaller files as well. This is absolutely false. Not only the code generated is generally the same as x86-64 (you’re not changing the instruction set at all, you’re just changing the so-called “data model”, which means you change the size of long (and related types) and of the pointers (and thus the address space).

From one side it is theoretically correct that you’re going to have smaller data structures, which means you can make better use of the data cache (not of the instruction cache, be sure!) — but is this the correct approach? In my informed opinion, it should be a better idea to look into actually writing code that considers the cachelines, if your code is cache-hungry! You can use dev-util/dwarves which is a set of utilities by Arnaldo (acme) — pahole will tell you how your data structures will be split in memory.

Also remember that for compatibility the syscalls are kept the same with x86-64, which means that all the kernel code executed, and all the data structures that are shared with the kernel are the same as x86-64 (which means that a number of data structures won’t even change their size with the new ABI).

Actually, referring again to the same slides you can see on on slide 24 that the x32 code can be longer than x86’s original code — it would have been nice if they included the same code in x86-64, especially since I don’t speak VCISC, but I think it’s just the same code.

It might be of interest to compare the size of the libc.so.6 file itself; this is the output of rbelf-size from my Ruby Elf suite:

        exec         data       rodata        relro          bss     overhead    allocated   filename
     1239436         7456       341974        13056        17784        94924      1714630   /lib/libc.so.6
     1259721         4560       316187         6896        12884        87782      1688030   x32/libc.so.6

The executable code is actually bigger in the x32 variant — the big change is of course in the data sections (data, rodata, relro and bss) as the pointers have been halved — I honestly wonder how’s it possible for the C library to have so many pointers in its own structures, but it’s a question beside the point. Even if these numbers are halved, the difference is not that big, in total you have something along the lines of 30KB less data allocated, which is unlikely to even change the memory map.

The data size reduction is useful. Okay this seems to be a common issue. Sure it is the case that the data structures are smaller with x32, that’s its design after all. The main question would probably be “is this significant?” — I don’t think it is. Even in the example above with the C library, the difference while still “big enough”, is just under 20% of the allocated space … of the C library! A library that is supposed to implement the very minimal interface.

Now if you add up all the possible libraries, you probably can shave off a few megabytes of data of course but … you’ll have to add in all the porting issues that I’m going to discuss soon. Yes it is true that C++ and most VM languages will have less pressure, especially when copying objects, thanks to the reduced pointers’ size, but this is still quite a stretch. Especially since for the most part you’ll have to keep data buffers aligned to at least 8 bytes (64-bit) to make use of the new instructions — you already to align them to 16 bytes (128-bit) to make use of some SIMD sets.

And for those who think that x32 is reducing the size of files on disk — remember that as it is you can’t run a pure-x32 install; what you get is usually going to be a mix of three ABIs: x86, amd64 and x32!

But there is no reason for $application to deal with more than 4GiB memory. Yes of course that might be true, but really, do you care about the pointer size? If you really want to make sure that the application doesn’t use more than a given amount of memory, use system limits! They are definitely less intrusive than building a new ABI altogether.

Interestingly there are two way different, contrasting, applications of a full 64-bit address space on systems with less than 4GiB of RAM: ASLR (Address Space Layout Randomization — which can really load the various objects an application require at widely different addresses), and Prelink (which can then make sure that every unique object on the system is always loaded at the same address, yes that’s really the opposite of what ASLR does!).

Applications use long but they don’t need the full 64-bit space. And of course the solution is to create a new ABI for it, according to some people.

I’m not going to say that there are many applications that still use long without a clue on why they do that; they probably have some very little range of values they want to use and yet they use “big values” such as long, as they probably learnt programming on systems that use it as a synonym for int — or even better they learnt programming on systems where long is 32-bit but int is 16-bit (hello MS-DOS!).

The solution to this is simply to use the standard integers provided by stdint.h such as uint32_t and int16_t — so that you always use the data size you’re expecting and needing! This also has the side-effect of working on many more systems than you expect, and works with FFI and other techniques.

Hand-coded assembly is rare. This is one thing a few people told me after my previous post as I complained about the fact that with the new ABI as it is we’re losing most of the hand-coded assembly. This might strictly be true, but it might be less rare than you think. Even excluding all the multimedia software, crypto software usually makes good use of SIMD as well, and that’s done through hand-coded assembly, not through the compiler’s intrinsics.

There is also another issue with hand-coded assembly in software such as Ruby — while Ruby 1.9 fails to build on x32, it gets much more interesting on Ruby 1.8 because while it builds just file, it_segfaults at runtime_. Reminds you of something?

Furthermore, it’s the C library itself that comes with most of the handcoded assembly — the only reason why you don’t feel the porting pressure is simply that H.J. Lu that takes care of most of those is one of the authors of the new ABI, which means the code is already ported there.

x32 is going to be compatible with x86, if not now in the future. Okay this I didn’t have a comment about before, but it’s one misconception I’ve noticed before being thrown around. Luckily, the presentation comes to help, slide 22 makes it very clear that the ABI are not compatible. Among other things you have to consider that the x32 ABI at least corrects some of the actual mistakes in x86, including the use of 32-bit data types for off_t and similar. Again, something I talked about two years ago.

This is the future of 64-bit processors. No; again refer to the slides in particular slide 10. This has been explicitly designed for closed systems rather than as a replacement for x86-64! How does that feel now?

The porting effort is going to be trivial, you just have to change the few lines of assembler and change the size of pointer arithmetic. This is not the case. The porting requires a number of other issues to be tackled, and handcrafted assembly is just the tip of the iceberg. Breaking the assumption that x86-64 has 64-bit pointers is, by itself, quite a big deal, but not as big as one might assume at first (it’s the same way on Windows), what I think is going to be a big issue is going to be the implementation of FFI style C bindings — remember I said it wasn’t an easy answer?

CPUs perform better on 32-bit operands than 64-bit. Interestingly, the only CPU that Intel admits do perform better on 32-bit on the presentation I already linked a few times, is the Atom — the quote is actually “64bit imul latency is twice of 32bit imul on Atom”.

Now, what the heck is imul? That’s a signed multiply operation. Do you multiply pointers? It doesn’t make sense. Besides, pointers are not signed. Are you telling me that your most concern is for a platform (Atom) that has extra latency on an operation when people use 64-bit data types and they should instead use 32-bit? And your solution for that concerns is to create a new ABI where it’s harder to use 64-bit data types instead of going to fix whatever program is causing the problem?

I guess I should end it here, because this last note about the Atom and imul is probably going to make the day of most people who have half a clue.

Microupdates for microcodes

Here comes a post that is half an announcement and half a request for help to improve a situation, so please read on.

Yesterday I was finally putting the almost-finishing touches onto the new frontend system for my office (after the Italian Post screwup I was able to get the system from Alternate in a single week); one of these touches was setting up the microcode update support, which for Intel processors involves installing the sys-apps/microcode-ctl and sys-apps/microcode-data packages and adding a service to the boot runlevel.

At that point my thought went to Yamato and the fact that it sounded impossible that AMD had no way to update the microcode of their CPus on Linux, especially since I know for a fact that Microsoft users get,via Windows Update, an AMD-provided CPU support update for their systems — I still do a lot of support on Windows and a number of friends and customers run AMD boxes.

Lo and behold, AMD publishes microcode updates for some of their CPUs (Family 10h and later, so starting from Barcelona, which is just what Yamato has), so I went to look into that; the results are now in sys-kernel/amd-ucode (I wanted to use sys-apps like the Intel microcode, but I found, late, that there was already an ebuild for it in Sunrise, and I didn’t want to have to deal with pkgmoves or blockers for out-of-tree packages). This package only installs the microcode though, so the question was to find how to load the microcode.

The documentation provided by AMD suggests to build the microcode driver as a module in the Linux kernel; when the module is loaded into the kernel, it fetches the microcode via the standard firmware loading interface of the kernel, like it’s done to wireless cards and Radeon video cards. This is pretty nifty for many reasons. Interestingly enough, it also works fine if you build the driver statically, and built the firmware blob into the kernel. Unfortunately I wasn’t able to trigger a firmware reload from the filesystem via the /sys interface that is supposed to allow that.

And again this comes back to Intel; if the Linux kernel nowadays has a way to request the ucode file itself, why do the Intel CPUs still require us to install a binary (and a script) to load it? A quick check shows that while we do install it in /lib/firmware the microcode.dat file is not used by the kernel at all; the reason is also easy to find if you call less on that file: it is a text file! The microcode-ctl parses it and converts it to binary form each time the machine boots up — why at all? wouldn’t it be easier if the tool compiled it into binary form and then the init script, shipped with the data, would just output it on the device?

More interesting, the kernel does have support for requesting the microcode via the usual firmware-loading interface; but instead of looking for the generic microcode, like the AMD variant does, it looks for the specific firmware of a given CPU signature (combined family, model and stepping); the driver also has the ability to parse the generic microcode compiled from microcode.dat, and then find the right version for the right processor.

But this means that you have to pass through a number of hoops right now at each boot, rather than doing it once at install time. Am I missing some obvious application to do the Intel microcode processing? Ideally, microcode-data would then just install the already-cut firmware, and the kernel would request the single file it needs. No need for userspace programs to process firmware further.

Ruby-Elf and multiple compilers

I’ve written about supporting multiple compilers; I’ve written about testing stuff on OpenSolaris and I have written about wanting to support Sun extensions in Ruby-Elf.

Today I wish to write about the way I’m currently losing my head to get Ruby-Elf testsuite to apply over other compilres beside GCC. Since I’ve been implementing some new features for missingstatic upon request (by Emanuele “exg”), I decided to add some more tests, in particular considering Sun and Intel compilers that I decided to support for FFmpeg at least.

The new tests not only apply the already-present generic ELF tests (but rewritten and improved, so that I can extend them much more quickly) over files built with ICC and Sun Studio under Linux/AMD64, but also adds tests to check for the nm(1)-like code on a catalogue of different symbols.

The results are interesting in my view:

  • Sun Studio does not generate .data.rel sections, it only generates a single .picdata section, which is not divided between read-only and read-write (which might have bad results with prelinking);
  • Sun Studio also emits uninitialised non-static TLS variables as common symbols rather than in .tbss (this sounds like a mistake to me sincerely!);
  • the Intel C Compiler enables optimisation by default;
  • it also optimises out unused static symbols with -O0;
  • and even with __attribute__((used)) it optimises out static uninitialised variables (both TLS and non-TLS);
  • oh and it puts a “.0” suffix to the name of unit-static data symbols (I guess to discern between them and function-static symbols, that usually have a code after them);
  • and least but not last: ICC does not emit a .data.rel section, nor a .picdata section: everything is emitted in .data section. This means that if you’re building something with ICC and expect cowstats to work on it, then you’re out of luck; but it’s not just that, it also means that prelinking will not help you at all to reduce memory usage, just a bit to reduce startup time.

Fixing up some stuff for Sun Studio was easy, and now cowstats will work fine even under Sun Studio compiled source code, taking care of ICC quirks not so much, and also meant wasting time.

On the other hand, there is one new feature to missingstatic: now it shows the nm(1)-like symbol near the symbols that are identified as missing the static modifier, this way you can tell if it’s a function, or constant, or a variable.

And of course, there are two manpages: missingstatic(1) and cowstats(1) (DocBook 5 rulez!) that describe the options and some of the working of the two tools; hopefully I’ll write more documentation in the next weeks and that’ll help Ruby-Elf being accepted and used. Once I have enough documentation about it I might actually decide to release something. — I’m also considering the idea of routing --help to man like git commands do.

Variables assigned and never used

I have written in the past that I sometimes miss the Borland C compiler when it comes down to warnings, because there was at least one that it really was interesting and that GCC lacks: warning about variables whose value is set and never used.

The problem is for instance in this example C code:

int foo(int n) {
  int t = 123;

  t = bar(n);
  return t+n;
}

As you can guess, that t = 123 part is totally useless since t is replaced right afterward. Obviously without optimisation, GCC will emit the assignment anyway:

foo:
.LFB2:
        pushq   %rbp
.LCFI0:
        movq    %rsp, %rbp
.LCFI1:
        subq    $32, %rsp
.LCFI2:
        movl    %edi, -20(%rbp)
        movl    $123, -4(%rbp)
        movl    -20(%rbp), %edi
        movl    $0, %eax
        call    bar
        movl    %eax, -4(%rbp)
        movl    -20(%rbp), %edx
        movl    -4(%rbp), %eax
        addl    %edx, %eax
        leave
        ret

But the assignment will rightfully disappear once optimisations are turned on, even at just the first level:

foo:
.LFB2:
        pushq   %rbx
.LCFI0:
        movl    %edi, %ebx
        movl    $0, %eax
        call    bar
        leal    (%rbx,%rax), %eax
        popq    %rbx
        ret

This is all nice and fine but the problem is that GCC does not warn even if it does remove the variable. But it’s not the only one, even Sun Studio Express and the Intel C Compiler don’t warn about such a case. Interestingly enough, Sun Studio’s lint tool in “enhanced” mode does report the issue:

assigned value never used
    t defined at test-ssa.c(2)  :: set at test-ssa.c(2) :: reset at test-ssa.c(4)

too bad that autotools don’t integrate lint-like tools too easily (but maybe I can tie it in the FFmpeg buildsystem at least…).

Why am I so upset about this particular warning missing, you ask? Because it should be quite easy to implement considering that each compiler is most likely using SSA form for optimisation scans. And in SSA form, it’s trivial to see the code from before this way:

int foo(int n) {
  int t1 = 123;

  int t2 = bar(n);
  return t2+n;
}

and notice that t1 is unused in the whole function. Indeed, it’s what the compiler already does, to optimise out the assignment, the problem is: it just does not warn you that’s what it’s doing. Sincerely, how difficult could it be for a GCC hacker to add that warning? — I’m sure I wouldn’t be able to myself, since I don’t know GCC well enough, and it’s likely a mess, but it doesn’t sound like a difficult warning to implement, to me.

Supporting more than one compiler

As I’ve written before, I’ve been working on FFmpeg to make it build with the Sun Studio Express compiler, under Linux and then under Solaris. Most sincerely, while supporting multiple (free) operating systems, even niche Unixes (like Lennart likes to call them) is one of the things I spend a lot of time on, I have little reason to support multiple compilers. FFmpeg on the other hand tends to support compilers like the Intel C Compiler (probably because it sometimes produces better code than the GNU compiler, especially when coming to MMX/SSE code — on the other hand it lacks some basic optimisation), so I decided to make sure I don’t create regressions when I do my magic.

Right now I have five different compile trees for FFmpeg: three for Linux (GCC 4.3, ICC, Sun Studio Express), two for Solaris (GCC 4.2 and Sun Studio Express). Unfortunately the only two trees to build entirely correctly are GCC and ICC under Linux. GCC under Solaris still needs fixes that are not available upstream yet, while Sun Studio Express has some problem with libdl under Linux (but I think the same applies to Solaris), and explodes entirely under Solaris.

While ICC still gives me some problems, Sun Studio is giving me the worst headache since I started this task.

While Sun seems to strive to reach GCC compatibility, there are quite a few bugs in their compiler, like -shared not really being the same as -G (although the help output states so). Up to now the most funny bug (or at least absurd idiotic behaviour) has been the way the compiler handles libdl under Linux. If a program uses the dlopen() function, sunc99 decides it’s better to silently link it to libdl, so that the build succeeds (while both icc and gcc fail since there is an undefined symbol), but if you’re building a shared object (a library) that also uses the function, that is not linked against libdl. It remembered me of FreeBSD’s handling of -pthread (it links the threading library in executables but not in shared objects), and I guess it is done for the same reason (multiple implementation, maybe in the past). Unfortunately since it’s done this way, the configure will detect dlopen() not requiring any library, but then later on libavformat will fail the build (if vhook or any of the external-library-loading codecs are enabled).

I thus reported those two problems to Sun, although there are a few more that, touching some grey areas (in particular C99 inline functions), I’m not sure to treat as Sun bugs or what. This includes for instance the fact that static (C99) inline functions are emitted in object files even if not used (with their undefined symbols following them, causing quite a bit of a problem for linking).

The only thing for which I find non-GCC compilers useful is to take a look to their warnings. While GCC is getting better at them, there are quite a few that are missing; both Sun Studio and ICC are much more strict with what they accept, and raise lots of warnings for things that GCC simply ignores (at least by default). For instance, ICC throws a lot of warnings about mixing enumerated types (enums) with other types (enumerated or integers), which gets quite interesting in some cases — in theory, I think the compiler should be able to optimise variables if they know they can only assume a reduce range of values. Also, both Sun Studio, ICC, Borland and Microsoft compilers warn when there is unreachable code in sources; recently I discovered that GCC, while supporting that warning, disables it by default both with -Wall and -Wextra to avoid false positives with debug code.

Unfortunately, not even with the combined three of them I’m getting the warning I was used to on Borland’s compiler. It would be very nice if Codegear decided to release an Unix-style compiler for Linux (their command-line bcc for Windows does have a syntax that autotools don’t accept, one would have to write a wrapper to get those to work). They already released free as in soda compilers for Windows, it would be a nice addition to have a compiler based upon Borland’s experience under Linux, even if it was proprietary.

On the other hand, I wonder if Sun will ever open the sources of Sun Studio; they have been opening so many things that it wouldn’t be so impossible for them to open their compiler too. Even if they decided to go with CDDL (which would make it incompatible with GCC license), it could be a good way to learn more things about the way they build their code (and it might be especially useful for UltraSPARC). I guess we’ll have to wait and see about that.

It’s also quite sad that there isn’t any alternative open source compiler focusing, for instance, toward issuing warnings rather than optimising stuff away (although it’s true that most warnings do come out of optimisation scans).

Sub-optimal optimisations?

While writing Implications of pure and constant functions I’ve been testing some code that I was expecting to be optimised by GCC. I was surprised to find a lot of my testcases were not optimised at all.

I’m sincerely not sure whether these are due to errors on GCC, to me expecting the compiler to be smarter than it can feasibly be right now, or to the “optimised” code to be more expensive than the code that is actually being generated.

Take for instance this code:

int somepurefunction(char *str, int n)
  __attribute__((pure));

#define NUMTYPE1 12
#define NUMTYPE2 15
#define NUMTYPE3 12

int testfunction(char *param, int type) {
  switch(type) {
  case 1:
    return somepurefunction(param, NUMTYPE1);
  case 2:
    return somepurefunction(param, NUMTYPE2);
  case 3:
    return somepurefunction(param, NUMTYPE3);
  }

  return -1;
}

I was expecting in this case the compiler to identify cases 1 and 3 as identical (by coincidence) and then merge them in a single branch. This would have made debugging quite hard actually (as you wouldn’t be able to discern the two case) but it’s a nice reduction on code, I think. Neither on x86_64 nor on Blackfin, neither 4.2 nor 4.3 actually merge the two cases leaving the double code in there.

Another piece of code that wasn’t optimised as I was expecting it to be is this:

unsigned long my_strlen(const char *str)
  __attribute__((pure));
char *strlcpy(char *dst, const char *str, unsigned long len);

char title[20];
#define TITLE_CODE 1
char artist[20];
#define ARTIST_CODE 2

#define MIN(a, b) ( a < b ? a : b )

static void set_title(const char *str) {
  strlcpy(title, str, MIN(sizeof(title), my_strlen(str)));
}

static void set_artist(const char *str) {
  strlcpy(artist, str, MIN(sizeof(artist), my_strlen(str)));
}

int set_metadata(const char *str, int code) {
  switch(code) {
  case TITLE_CODE:
    set_title(str);
    break;
  case ARTIST_CODE:
    set_artist(str);
    break;
  default:
    return -1;
  }

  return 0;
}

I was expecting here a single call to my_strlen(), as it’s a pure function, and in both branches it’s the first call. I know it’s probably complex code once unrolled, but still gcc at least was better at this than intel’s and sun’s compilers!

Both Intel’s and Sun’s, even at -O3 level, emit four calls to my_strlen(), as they can’t even optimise the ternary operation! Actually, Sun’s compiler comes last for optimisation, as it doesn’t even inline set_title() and set_artist().

Now, I haven’t tried IBM’s PowerPC compiler as I don’t have a PowerPC box to develop on anymore (although I would think a bit about the YDL PowerStation, given enough job income in the next months — and given Gentoo being able to run on it), so I can’t say anything about that, but for these smaller cases, I think GCC is beating other proprietary compilers under Linux.

I could check Microsoft’s and Borland’s Codegear’s compilers, but it was a bit out of my particular scope at the moment.

If I did think a bit before about supporting non-GNU compilers for stuff like xine and unieject, I start to think it’s not really worth the time spent on that at all, if this is the result of their compilations…