Debunking x32 myths

There has been many comments on my previous post about the new x32 ABI; some are interesting, others are more “out there” — the feeling I get is that there is quite a bit of cargo culting, with people thinking “there has to be a reason why is is developed, so it’ll be good for me!” without actually having the technical background to judge the usefulness of all this.

So in the same spirit with which I commented on ccache almost exactly four years ago (wow, I have been keeping a blog for a very long time, haven’t I?), I’ll try to debunk a few of the myths and misconception around this new ABI.

The new x32 ABI has proven to be faster. Not really; what we have right now are a few benchmarks, published by those who actually created the ABI, Of course you’d expect that those who spent time to set it up found it interesting and actually faster, but I honestly have doubts about the results, for reasons that will be clearer by reading the next few entries.

It’s also interesting to note that while the overall benchmarks seem to be positive, the numbers are quite close in general.. and even Intel’s presentation only gives you actual “big” numbers only when comparing with the original x86 ABI — which nobody is saying is better than what x86-64 is!

The data is also coming from a synthetic test, not from an actual overall system usage, and if you have any clue about benchmarks you know that the numbers can easily lie out of their teeth!

The new ABI generates smaller code, which means more instruction will fit in cache, and you’ll have smaller files as well. This is absolutely false. Not only the code generated is generally the same as x86-64 (you’re not changing the instruction set at all, you’re just changing the so-called “data model”, which means you change the size of long (and related types) and of the pointers (and thus the address space).

From one side it is theoretically correct that you’re going to have smaller data structures, which means you can make better use of the data cache (not of the instruction cache, be sure!) — but is this the correct approach? In my informed opinion, it should be a better idea to look into actually writing code that considers the cachelines, if your code is cache-hungry! You can use dev-util/dwarves which is a set of utilities by Arnaldo (acme) — pahole will tell you how your data structures will be split in memory.

Also remember that for compatibility the syscalls are kept the same with x86-64, which means that all the kernel code executed, and all the data structures that are shared with the kernel are the same as x86-64 (which means that a number of data structures won’t even change their size with the new ABI).

Actually, referring again to the same slides you can see on on slide 24 that the x32 code can be longer than x86’s original code — it would have been nice if they included the same code in x86-64, especially since I don’t speak VCISC, but I think it’s just the same code.

It might be of interest to compare the size of the file itself; this is the output of rbelf-size from my Ruby Elf suite:

        exec         data       rodata        relro          bss     overhead    allocated   filename
     1239436         7456       341974        13056        17784        94924      1714630   /lib/
     1259721         4560       316187         6896        12884        87782      1688030   x32/

The executable code is actually bigger in the x32 variant — the big change is of course in the data sections (data, rodata, relro and bss) as the pointers have been halved — I honestly wonder how’s it possible for the C library to have so many pointers in its own structures, but it’s a question beside the point. Even if these numbers are halved, the difference is not that big, in total you have something along the lines of 30KB less data allocated, which is unlikely to even change the memory map.

The data size reduction is useful. Okay this seems to be a common issue. Sure it is the case that the data structures are smaller with x32, that’s its design after all. The main question would probably be “is this significant?” — I don’t think it is. Even in the example above with the C library, the difference while still “big enough”, is just under 20% of the allocated space … of the C library! A library that is supposed to implement the very minimal interface.

Now if you add up all the possible libraries, you probably can shave off a few megabytes of data of course but … you’ll have to add in all the porting issues that I’m going to discuss soon. Yes it is true that C++ and most VM languages will have less pressure, especially when copying objects, thanks to the reduced pointers’ size, but this is still quite a stretch. Especially since for the most part you’ll have to keep data buffers aligned to at least 8 bytes (64-bit) to make use of the new instructions — you already to align them to 16 bytes (128-bit) to make use of some SIMD sets.

And for those who think that x32 is reducing the size of files on disk — remember that as it is you can’t run a pure-x32 install; what you get is usually going to be a mix of three ABIs: x86, amd64 and x32!

But there is no reason for $application to deal with more than 4GiB memory. Yes of course that might be true, but really, do you care about the pointer size? If you really want to make sure that the application doesn’t use more than a given amount of memory, use system limits! They are definitely less intrusive than building a new ABI altogether.

Interestingly there are two way different, contrasting, applications of a full 64-bit address space on systems with less than 4GiB of RAM: ASLR (Address Space Layout Randomization — which can really load the various objects an application require at widely different addresses), and Prelink (which can then make sure that every unique object on the system is always loaded at the same address, yes that’s really the opposite of what ASLR does!).

Applications use long but they don’t need the full 64-bit space. And of course the solution is to create a new ABI for it, according to some people.

I’m not going to say that there are many applications that still use long without a clue on why they do that; they probably have some very little range of values they want to use and yet they use “big values” such as long, as they probably learnt programming on systems that use it as a synonym for int — or even better they learnt programming on systems where long is 32-bit but int is 16-bit (hello MS-DOS!).

The solution to this is simply to use the standard integers provided by stdint.h such as uint32_t and int16_t — so that you always use the data size you’re expecting and needing! This also has the side-effect of working on many more systems than you expect, and works with FFI and other techniques.

Hand-coded assembly is rare. This is one thing a few people told me after my previous post as I complained about the fact that with the new ABI as it is we’re losing most of the hand-coded assembly. This might strictly be true, but it might be less rare than you think. Even excluding all the multimedia software, crypto software usually makes good use of SIMD as well, and that’s done through hand-coded assembly, not through the compiler’s intrinsics.

There is also another issue with hand-coded assembly in software such as Ruby — while Ruby 1.9 fails to build on x32, it gets much more interesting on Ruby 1.8 because while it builds just file, it_segfaults at runtime_. Reminds you of something?

Furthermore, it’s the C library itself that comes with most of the handcoded assembly — the only reason why you don’t feel the porting pressure is simply that H.J. Lu that takes care of most of those is one of the authors of the new ABI, which means the code is already ported there.

x32 is going to be compatible with x86, if not now in the future. Okay this I didn’t have a comment about before, but it’s one misconception I’ve noticed before being thrown around. Luckily, the presentation comes to help, slide 22 makes it very clear that the ABI are not compatible. Among other things you have to consider that the x32 ABI at least corrects some of the actual mistakes in x86, including the use of 32-bit data types for off_t and similar. Again, something I talked about two years ago.

This is the future of 64-bit processors. No; again refer to the slides in particular slide 10. This has been explicitly designed for closed systems rather than as a replacement for x86-64! How does that feel now?

The porting effort is going to be trivial, you just have to change the few lines of assembler and change the size of pointer arithmetic. This is not the case. The porting requires a number of other issues to be tackled, and handcrafted assembly is just the tip of the iceberg. Breaking the assumption that x86-64 has 64-bit pointers is, by itself, quite a big deal, but not as big as one might assume at first (it’s the same way on Windows), what I think is going to be a big issue is going to be the implementation of FFI style C bindings — remember I said it wasn’t an easy answer?

CPUs perform better on 32-bit operands than 64-bit. Interestingly, the only CPU that Intel admits do perform better on 32-bit on the presentation I already linked a few times, is the Atom — the quote is actually “64bit imul latency is twice of 32bit imul on Atom”.

Now, what the heck is imul? That’s a signed multiply operation. Do you multiply pointers? It doesn’t make sense. Besides, pointers are not signed. Are you telling me that your most concern is for a platform (Atom) that has extra latency on an operation when people use 64-bit data types and they should instead use 32-bit? And your solution for that concerns is to create a new ABI where it’s harder to use 64-bit data types instead of going to fix whatever program is causing the problem?

I guess I should end it here, because this last note about the Atom and imul is probably going to make the day of most people who have half a clue.