There has been many comments on my previous post about the new x32 ABI; some are interesting, others are more “out there” — the feeling I get is that there is quite a bit of cargo culting, with people thinking “there has to be a reason why is is developed, so it’ll be good for me!” without actually having the technical background to judge the usefulness of all this.
So in the same spirit with which I commented on ccache almost exactly four years ago (wow, I have been keeping a blog for a very long time, haven’t I?), I’ll try to debunk a few of the myths and misconception around this new ABI.
The new x32 ABI has proven to be faster. Not really; what we have right now are a few benchmarks, published by those who actually created the ABI, Of course you’d expect that those who spent time to set it up found it interesting and actually faster, but I honestly have doubts about the results, for reasons that will be clearer by reading the next few entries.
It’s also interesting to note that while the overall benchmarks seem to be positive, the numbers are quite close in general.. and even Intel’s presentation only gives you actual “big” numbers only when comparing with the original x86 ABI — which nobody is saying is better than what x86-64 is!
The data is also coming from a synthetic test, not from an actual overall system usage, and if you have any clue about benchmarks you know that the numbers can easily lie out of their teeth!
The new ABI generates smaller code, which means more instruction will fit in cache, and you’ll have smaller files as well. This is absolutely false. Not only the code generated is generally the same as x86-64 (you’re not changing the instruction set at all, you’re just changing the so-called “data model”, which means you change the size of
long (and related types) and of the pointers (and thus the address space).
From one side it is theoretically correct that you’re going to have smaller data structures, which means you can make better use of the data cache (not of the instruction cache, be sure!) — but is this the correct approach? In my informed opinion, it should be a better idea to look into actually writing code that considers the cachelines, if your code is cache-hungry! You can use
dev-util/dwarves which is a set of utilities by Arnaldo (acme) —
pahole will tell you how your data structures will be split in memory.
Also remember that for compatibility the syscalls are kept the same with x86-64, which means that all the kernel code executed, and all the data structures that are shared with the kernel are the same as x86-64 (which means that a number of data structures won’t even change their size with the new ABI).
Actually, referring again to the same slides you can see on on slide 24 that the x32 code can be longer than x86’s original code — it would have been nice if they included the same code in x86-64, especially since I don’t speak VCISC, but I think it’s just the same code.
It might be of interest to compare the size of the
libc.so.6 file itself; this is the output of
rbelf-size from my Ruby Elf suite:
exec data rodata relro bss overhead allocated filename 1239436 7456 341974 13056 17784 94924 1714630 /lib/libc.so.6 1259721 4560 316187 6896 12884 87782 1688030 x32/libc.so.6
The executable code is actually bigger in the x32 variant — the big change is of course in the data sections (data, rodata, relro and bss) as the pointers have been halved — I honestly wonder how’s it possible for the C library to have so many pointers in its own structures, but it’s a question beside the point. Even if these numbers are halved, the difference is not that big, in total you have something along the lines of 30KB less data allocated, which is unlikely to even change the memory map.
The data size reduction is useful. Okay this seems to be a common issue. Sure it is the case that the data structures are smaller with x32, that’s its design after all. The main question would probably be “is this significant?” — I don’t think it is. Even in the example above with the C library, the difference while still “big enough”, is just under 20% of the allocated space … of the C library! A library that is supposed to implement the very minimal interface.
Now if you add up all the possible libraries, you probably can shave off a few megabytes of data of course but … you’ll have to add in all the porting issues that I’m going to discuss soon. Yes it is true that C++ and most VM languages will have less pressure, especially when copying objects, thanks to the reduced pointers’ size, but this is still quite a stretch. Especially since for the most part you’ll have to keep data buffers aligned to at least 8 bytes (64-bit) to make use of the new instructions — you already to align them to 16 bytes (128-bit) to make use of some SIMD sets.
And for those who think that x32 is reducing the size of files on disk — remember that as it is you can’t run a pure-x32 install; what you get is usually going to be a mix of three ABIs: x86, amd64 and x32!
But there is no reason for $application to deal with more than 4GiB memory. Yes of course that might be true, but really, do you care about the pointer size? If you really want to make sure that the application doesn’t use more than a given amount of memory, use system limits! They are definitely less intrusive than building a new ABI altogether.
Interestingly there are two way different, contrasting, applications of a full 64-bit address space on systems with less than 4GiB of RAM: ASLR (Address Space Layout Randomization — which can really load the various objects an application require at widely different addresses), and Prelink (which can then make sure that every unique object on the system is always loaded at the same address, yes that’s really the opposite of what ASLR does!).
long but they don’t need the full 64-bit space. And of course the solution is to create a new ABI for it, according to some people.
I’m not going to say that there are many applications that still use
long without a clue on why they do that; they probably have some very little range of values they want to use and yet they use “big values” such as
long, as they probably learnt programming on systems that use it as a synonym for
int — or even better they learnt programming on systems where
long is 32-bit but
int is 16-bit (hello MS-DOS!).
The solution to this is simply to use the standard integers provided by
stdint.h such as
int16_t — so that you always use the data size you’re expecting and needing! This also has the side-effect of working on many more systems than you expect, and works with FFI and other techniques.
Hand-coded assembly is rare. This is one thing a few people told me after my previous post as I complained about the fact that with the new ABI as it is we’re losing most of the hand-coded assembly. This might strictly be true, but it might be less rare than you think. Even excluding all the multimedia software, crypto software usually makes good use of SIMD as well, and that’s done through hand-coded assembly, not through the compiler’s intrinsics.
There is also another issue with hand-coded assembly in software such as Ruby — while Ruby 1.9 fails to build on x32, it gets much more interesting on Ruby 1.8 because while it builds just file, it segfaults at runtime. Reminds you of something?
Furthermore, it’s the C library itself that comes with most of the hand coded assembly — the only reason why you don’t feel the porting pressure is simply that H.J. Lu that takes care of most of those is one of the authors of the new ABI, which means the code is already ported there.
x32 is going to be compatible with x86, if not now in the future. Okay this I didn’t have a comment about before, but it’s one misconception I’ve noticed before being thrown around. Luckily, the presentation comes to help, slide 22 makes it very clear that the ABI are not compatible. Among other things you have to consider that the x32 ABI at least corrects some of the actual mistakes in x86, including the use of 32-bit data types for
off_t and similar. Again, something I talked about two years ago.
This is the future of 64-bit processors. No; again refer to the slides in particular slide 10. This has been explicitly designed for closed systems rather than as a replacement for x86-64! How does that feel now?
The porting effort is going to be trivial, you just have to change the few lines of assembler and change the size of pointer arithmetic. This is not the case. The porting requires a number of other issues to be tackled, and handcrafted assembly is just the tip of the iceberg. Breaking the assumption that x86-64 has 64-bit pointers is, by itself, quite a big deal, but not as big as one might assume at first (it’s the same way on Windows), what I think is going to be a big issue is going to be the implementation of FFI style C bindings — remember I said it wasn’t an easy answer?
CPUs perform better on 32-bit operands than 64-bit. Interestingly, the only CPU that Intel admits do perform better on 32-bit on the presentation I already linked a few times, is the Atom — the quote is actually “64bit imul latency is twice of 32bit imul on Atom”.
Now, what the heck is
imul? That’s a signed multiply operation. Do you multiply pointers? It doesn’t make sense. Besides, pointers are not signed. Are you telling me that your most concern is for a platform (Atom) that has extra latency on an operation when people use 64-bit data types and they should instead use 32-bit? And your solution for that concerns is to create a new ABI where it’s harder to use 64-bit data types instead of going to fix whatever program is causing the problem?
I guess I should end it here, because this last note about the Atom and
imul is probably going to make the day of most people who have half a clue.
Not that you’re wrong about the rest, but imul is used in sneaky ways for arrays when there isn’t a sufficiently fancy mov form for the data being used, so that claim isn’t entirely spurious.
Ciaran, this is interesting. Can you point to some examples of hand-written assembly that makes use of imul on pointers? Or to a C code example that results in such when compiled with a specific compiler and its optimization options?
Ciaran, @imul@ on a pointer look seriously f*cked up, are you not confusing yourself with using @imul@ to calculate an offset (and then add the pointer)? That could indeed cause the delay on Atom but also seems seriously screwed up from the compiler side, given you can use @mul@ (unsigned multiply) to achieve the same.
In C, @a@ is the same as @5[a]@, and the 5 can be assumed to be no larger than a pointer. The 5 might be negative, though, since a could be somewhere in the middle of an array. Unfortunately, SableCC seems to know all this. Fortunately, no-one seems to use SableCC.
It’s not that I don’t know about that syntax… but honestly if _that_ syntax is TFU on Atom… there’s a bigger problem than the size of pointers 🙂
Ciaran, what does that have to do with
ais a pointer to (or array of)
ainto a register on x86_64 will look something like
mov eax, [rbx + 20]. The displacement is signed, so loading
mov eax, [rbx - 20]. If the index is in a register, loading
mov eax, [rbx + 4*rcx]. This, too, works with negative indexes. So where is the
imul?The commutative property of the array subscript operator is completely irrelevant to this discussion. You probably mentioned it only to show off.
The problem is that mul only has the eax:edx = eax * op form (or rax:rdx = rax * op or the 16bit ax:dx = ax * op and the same for 8 bit). imul, if you know the result can not overflow into negative land, has more forms, so the compiler often uses imul to better suite the register allocation. x86, always a pain in the …
The fancy x86 scaled addressing only works with 1, 2, 4 and 8. If your array element is larger, or not a power of two, you need a mul.The problem is that mul only has the eax:edx = eax * op form (or rax:rdx = rax * op or the 16bit ax:dx = ax * op, similar for 8 bit). imul, if you know the result can not overflow into negative land, has more forms, so the compiler often uses imul to better suite the register allocation. x86, always a pain in the …
If the scale is a larger power of two, you can use a shift. However, this is irrelevant. Address calculation looks exactly the same regardless of the pointer size.
It does not matter if the operands are signed or unsigned for the 2 and 3-operand forms of imul since the result is only the lower half of the product. Consequently, there is no corresponding mul instruction, and Intel could have named the instruction mul when they introduced it with the 386.On the other hand side, there is no reason to use the 1-operand forms that actually do calculate the complete product for address calculations. And only for those forms do mul and imul both exist.
What’s more, none of this matters whatsoever for the x32 vs x86_64 discussion. If an application works at all in x32, any array indexes are by necessity 32-bit (or smaller). Given this, the same 32-bit
imulinstruction can be used to calculate the offset (which is then added to a pointer to obtain the address of the desired array element) regardless of the pointer size. If pointers are 64-bit, the 32-bit offset needs to be sign-extended in some cases, but nobody ever claimed that to be slow.
There are certainly use cases where the data size saving is nontrivial. For instance, gitk on the Linux kernel takes ~700MB more RSS on x86-64 than x86 for me (I’m lazily making the assumption that x86 is a good model for x32 in this particular instance). If that 700MB takes you over the RAM you have then that’s the difference between usable and unusable.What I don’t know is what use cases measurably benefit from the 64-bit ISA but are also sufficiently pointer-heavy that they’ll see an appreciable benefit from the data size reduction too.
If those 700MB take you over the edge, your computer has far too little RAM to begin with. I would not even consider building a machine with less than 8GB these days.
Just an answer to your question: “Do /you/ care about the pointer size?”There is a simple reason I still use 32-bit kernel + userspace: Memory Usage. Example? Just use firefox for a day and see RSS memory, on 32-bit it’s 200MB, on 64-bit it’s a little less than double. Would be interesting to benchmark since these numbers are just out of personal experience.So if x32 becomes mainstream, I’d be happy to move to a 64-bit kernel leaving back all the PAE and highmem hacks, but I find absolutely no reason to move my laptop/desktop userspace to 64-bit.
It’s interesting that anon would say this. Anon can run a regular 64-bit kernel with a regular 32-bit userspace just by installing a 64-bit kernel, and changing nothing else. (I do this on one machine just because I can’t be arsed to touch its userspace.) Curiously, 32-bit Firefox crashes about weekly on that machine. While 32-bit FF on a 32-bit kernel works OK, and 64-bit FF on a 64-bit kernel works OK too, 32-bit FF on a 64-bit kernel gets out-of-address-space failures on a regular basis, typically when nobody is using it, most usually in the fool GC cycle tracer. Would x32 FF be any less crashy? My crystal ball says “Ask again later”, but my entrails suggest not.In the meantime, don’t run a 32-bit Firefox on a 64-bit kernel unless you keep fewer than 150 tabs open at at a time. But who would ever do that?
How about the Glasgow Haskell Compiler? When I complained about its memory usage, I was told to use i386 instead of amd64. The reason for this is that ghc creates large amounts of very small and short lived objects. Most of the space is occupied by pointers. This not only consumes a lot more memory on amd64, it also takes more time in the (generational) garbage collector (copying the live objects). So for a number of cases an i386 ghc is actually faster than an amd64 ghc. In this case it seems that x32 nicely fits the gap. I have no benchmarks for x32 at hand, because the porting work is not done.I’d expect noticeable improvements by x32 for applications that eat lots of memory and occupy that memory with pointers. This is a very small subset and as you laid out it is not clear whether this is worth the effort.
The smaller code size claim isn’t entirely bogus though. An instruction that works on 64bit data needs a REX prefix. Now you need that REX prefix anyway if your instruction wants to use one of the registers not available in i386 mode, but in the cases where it doesn’t, you save one byte per instruction compared to x86_64.
A question (already posted at LWN http://lwn.net/Articles/503… in case you’re read it there):Do we have performance numbers showing MIPS n32 being faster than MIPS n64? If so, that would hint that it’s worth investigating x32 further.For background, IRIX had three ABIs. The original 32 bit IRIX ABI was called o32; when 64 bit IRIX came on the scene, it acquired two new ABIs, n64 (a full-blown 64 bit ABI), and n32 (n64 but with 32 bit pointers instead of 64 bit).If the MIPS history showed n32 being a performance advantage on some workloads, it would be worth investigating x32’s speed on those same workloads. Note, however, that I recall n32 being touted as a porting benefit – most MIPS code needed a simple recompile to move from o32 to n32, but would need auditing for sizeof(int)==sizeof(void*) bugs when moving to n64.
But why would we need imul in the first place when dealing with pointers? Compilers like icc tend to use instructions like movl (%ebp, %edx, 4), %eax to load integers (or similar to load floats)
camper is right on imul. mans, on 64-bit an array with 32-bit indices can span more than 4GB, so the compiler must use a 64-bit instruction to compute the offset from the base. Here we’re talking mostly of arrays of structs.imul is also used when subtracting pointers, with fancy tricks such as multiplying by the inverse rather than dividing by the size of the pointed-to type.Simon, the decrease in red prefixed is balanced by the extra addr32 prefixes used by x32.
Paulo,I don’t understand how “the decrease in red prefixed is balanced by the extra addr32 prefixes used by x32” relates to the performance difference on MIPS between n32 and n64 ABIs.I’m suggesting that, as MIPS n32 and n64 ABIs relate to each other in the same way as x86_64 and x32, it’s worth looking for workloads that historically (i.e. on SGI IRIX) ran faster when built for n32 than it did when built for n64, when run on a 64-bit IRIX system. If such a workload can be found, it would be interesting to compare its performance on x32 as compared to x86_64.
Paolo, if the indexes are 32-bit, only a 32-bit multiplication is needed. The result then needs to be sign-extended (if the index is signed), but this is a very cheap operation not likely to cause any measurable performance difference.
anon,I think this is better to be addressed by the compiler dev team.
Great analysis!There are only few applications that may benefit from this model, and the complexity introduced is too costly.One example is redis, which can use pointer compression if this problem is significant enough before switching the whole system to a new ABI.
@Simon Farnsworth: For “the decrease in red prefixed is balanced by the extra addr32 prefixes used by x32”, i think Paulo meant holger.For the n32 vs n64:RISC targets (or more correct targets with fixed instruction size encoding) do not count. For them the larger constants needed are a major pain. It’s a major FAIL in all 64 Bit RISC.The x86 ISA with it’s variable length instructions is much better prepared for that. The downside is to build a fast instruction decoder, but look under your desktop, ohh, there is one!Does any one know if Motorola^wFreescale is holding any patent on the 68k instruction encoding scheme?It was IMHO the sane cross-bread of variable size and fixed length.@Patrik Osgnach:not every index operation is scaled with 1, 2, 4 or 8. That are the only scales you can natively express in the x86 ISA. Powers of two can be expressed by shifts, but not everything is a clean power of two.
mans, a 32-bit array index can address more than 4GB of memory and thus needs a 64-bit multiply
@KaffeemonsterI beg to differ; if there are no real world workloads out there where n32 is noticeably faster than n64, we have some evidence that shrinking pointer size is not, in and of itself, helpful outside benchmarks.If there are workloads which gained significantly from n32 versus n64, you have a workload where it’s worth retesting on x32 to see if that’s faster as well, once you have taken the architecture differences into account.Bear in mind that the difference between n32 and n64 is simply pointer size; all else is kept equal (so the only time the RISC penalty for loading constants applies is when you’re loading pointers – all other code is identical between the two ABIs). The same applies to x32 as compared to x86_64; you would therefore expect that if n32 was significantly faster on some workloads, x32 would also be significantly faster on those workloads. If no-one can find such workloads, it suggests that x32, which is also simply a shrunk-pointer version of x86_64, is unlikely to be faster in the real world.
@SimonI’m not saying that n32 can not be faster for some workloads than n64.But the conditions (architectural differences) to x86 are so different, thata) n32/n64 is apples-oranges to x32/x86_64, even if the general directionis right where to look for advantages (both are fruit, pointer size change is a major part of it).b) i believe that x32 is not the slam dunk win Intel is trying to sell to me.You will surely find a workload where x32 is the winner (one post here said the glasgow haskell thingy has a good chance to win immensely from it), still i think these exact workloads where the wins are not “in the noise” are rare (compared to the general use. Yes, you will prop. find someone who’s main workload is exactly the winning one. Do i get some new SIMD instructions because i (and others) could really need them?).Creating a whole new ABI with all the bells and whistles and problems for those few workloads looks like a major pain, draining maintainer power in the core toolchain (binutils, gcc, glibc). There is already enough to do for x86 and x86_64.And sneaking it in the back of the __x86_64__ preprocessor define is simply stupid, creating work for all general developer.
Paolo, two points:1. If the application is to work on x32, it can’t be using more than 4GB of memory, whence 32-bit offsets are large enough. It would be trivial to add a compiler flag (if one does not already exist) to inform it of this restriction, thus allowing it to use 32-bit offsets.2. Even with 64-bit offsets, computing the offset from a 32-bit index requires only a widening 32-bit multiply (with 64-bit result), not a full 64-bit multiply.
@KaffeemonsterI think we’re violently agreeing – my claim is that o32 versus n32 versus n64 is a similar ABI change in the MIPS world, and if the x32 effort was likely to produce performance gains, we would have historic evidence from the n32/n64 ABIs on IRIX/MIPS, showing workloads where n32 was considerably faster than n64.If such workloads don’t exist (I can’t find any, but I’ve not looked hard), the chances are high that x32 is only a performance benefit in very specialised situations set up to benchmark x32 in a good light, and it’s up to the people pushing x32 to show that there are real workloads that benefit.
“And for those who think that x32 is reducing the size of files on disk — remember that as it is you can’t run a pure-x32 install; what you get is usually going to be a mix of three ABIs: x86, amd64 and x32!”This isn’t necessarily true, I have a “native” hardened x32 install running under kvm at present, with NO amd64 or x86 binaries, with just MULTILIB_ABIS=”x32″ set in make.conf.Only a few large packages fail to compile for me, notably firefox and libreoffice, other than that and having to disable asm on a number of multimedia packages the only other real issue I have is x32 can’t use alsa yet.I have 3 machines in use, all over 2 years old and two are maxed out with 4 GB of ram, not all of us are in a position to just dump older hardware and buy shiny new machines with at least 8 GB…
mans, re. your bullet 1 you would be reinventing x32 changes to gcc while keeping longs 64-bit. for bullet 2 a widening multiply is useless because it puts the two halves in two separate 32-bit registers . Widening multiplies are mostly useful to realize divides by constant.
No, it would not be “reinventing x32.” All it would do is tell the compiler that no object is ever larger than 2GB, allowing it to use faster 32-bit arithmetic to calculate offsets. Nothing else would change, so no porting effort would be required.Regarding multiplies, while getting the result of a widening multiply in different registers is annoying, merging them requires only two more cycles (on Atom).
ah ok, you mentioned 4Gb earlier. I think -mcmodel exists already.
IIUC, -mcmodel only covers code and static data, not dynamic allocations.
If you have a 4GB datamodel, also all your pointer arithmetic will be in 32-bit.This means you can evaluate all array and record indexing etc in 32-bit, and have all offsets etc in 32-bit. The array indexing might actually use imul on systems with signed basetypes (like C and Pascal).So yes, there is a common case where x32 changes a 64-bit mul to a 32-bit one. If it really happens (only makes sense if the whole calculation can be done in 32-bit), and if it is noticable is something else though.
You probably know very little about real world binary code, pointer usage, processor resource utilization and many other related issues needed to actually analyze the utility of x32 ABI.1 – Pointers are used a LOT by almost every kind of app. Specially by all virtual machine / byte code interpreter / generic libraries, that includes, php, perl, python, JAVA, flash, GUI toolkits, high level C/C++ libraries. You are certainly underestimating just how much pointers are used. I also happened to have studied the C source a closed source app larger than one million lines of C code (can’t say which), and I’ll tell you, the larger the app, the more pointers are pervasively used (along with dynamic memory allocation, another place where x32 abi is much faster than the 64bit one).2 – Smaller pointers save data AND instruction cache space. Storing a 64-bit constant into a 64-bit register uses a much larger instruction than storing a 32-bit constant into a 64 or 32 register. So x32 saves not only data size but also code size, pure 64 bit code ends up with 64-bit constants for any static address, while x32 abi code ends up with 32-bit constants for any static address.3 – The only overhead from the full 64 bit kernel API to x32 ABI involves system calls that must expand a 32bit address to 64bits, nothing else. Kernel API interface uses only a tiny percentage of total CPU time.4 – Benchmarking is the only objective way of comparing one model to the other, but anyone that fully understands the concept outlined and understands enough of processor performance will tell you the x32 ABI will be in average much faster than the corner cases where it will be slower (both faster than the full 64bit model and to the pure 32bit mode)5 – Doubling the register file is no small advantage (over the pure 32 bit mode)6 – One of x32 abi proponents is nobody else than one of the biggest names in computer science (Donald Knuth), who are you ? Do you at least have a computer science/engineering degree ? I don’t but I know more assembly, C programming and performance tuning/analysis than many computer science teachers (I earn a living out of those skills).7 – Even systems with 16GB of RAM might have no need for the full 64-bit API (the vast majority of apps don’t need the full 64 bit mode).8 – I guess that’s enough Troll feeding. My impression is either you just have no idea how little you know, or you’re a full blown troll. Sorry for my lack of politeness, your site pops up on google search of x32 ABI, and it just smears something very good ! The only really bad thing about x32 ABI is fragmentation (if you actually have to run x32 ABI libraries along with other libraries), but in many cases you could go purely x32 ABI (in lieu of the full 64bit mode).
So it seems to me you’re calling me for not having experience (without actually noticing that this blog is nothing but six years of experience in a quite wide range of software), and only making this up out of (in your opinion wrong) impressions, then you don’t call to any actual specifics. The one who feels has no experience, to me, is you.So let’s start from the specific question. Who am I? I’m one of the developers of the biggest and most widely used Free Software multimedia library out there: “libav”:http://www.libav.org/ as well as the QA lead of the most performance-driven Linux distribution in the history of the world, Gentoo Linux. I think I know my shit. So who are you who comes here from Google to comment without actually stating who you are?I could even not address your point given your double-standard, but it’s common knowledge among the readers of this blog that I _don’t_ have a degree — but since that doesn’t stop you from thinking you know more than most teachers, it doesn’t stop me from thinking I have much more real-world experience than you. And even if not that, I’m pretty sure that other people who commented above, both agreeing and disagreeing, have much more experience than _me_ so what? Experience is of course subjective.I’ll only spend one more moment to note that there is a Godwin law in computer science nowadays and that is focused on the name of Knuth — not to minimize his importance, but people who still hold him to be completely in line with the general landscape seem to have a very thin grasp of reality. Ignoring the whole “closer to perfection” issue of his versioning scheme, he’s the person who basically dissed parallel computing for a desktop “for not making TeX any faster”:http://www.informit.com/art… (without actually thinking that actually it might, see the footnote at the end of the same article).As for the (few) technical points you have, let’s just say this: * static addresses in real-world application don’t exist, at all; TEXTRELs are bad; * nobody spoke about system calls overhead so I don’t really see why you wanted to bring that up; * nobody’s saying that x32 has no performance improvement over x86 — we’re all saying that for the common case its improvement over the (already stable and widespread) x86-64 is basically noise; * you also only talk about registers — there’s actually another feature where x32 and x86-64 beat x86 quite easily and that’s PIC; the fact you ignored that shows how little you know of real-world use cases (and before somebody says that’s just an Unix thing, x32 is just a Linux ABI); * blanket statements like “the vast majority of apps don’t need the full 64 bit mode” are utter bullshit; first, it all depends on what you define “an app”, second it depends on what that app does. The problem is not whether any of a thousand applications require to address on a single process more than 256GB of RAM — it’s whether it’ll ever require to address more than 4GB! Which is a more and more common situation if nothing else because applications map files in memory and those files keep growing; * and finally — benchmarks don’t tell you anything about real-world situations; profiling does. And up to now nobody provided _profiling_ for a whole system to agree with Intel’s heavily partial benchmarks.
I have 30 years of experience with computers in general… Started back in the 8 bit times (Z80 CPU)My first assembly code was written about those 30 years ago, C code 25 years ago.Back then if you wrote slow code, it would be really slow. Today we waste hardware nilly willy because we can.1 – What matters is the resulting binary code, all function calls, load address of any non auto symbol, and some other instances require a load time FIXUP of that PIC address to that real address or even worse a base address + offset calculation everytime you need to load that static object address, in both scenarios, x32 wins. Also notice that any string literal is a static object (as well as some corner cases).2 – System calls are perhaps the only case there x86_64 has the potential to be faster, because no address translations need to be done (I was actually conceding the main way that x86_64 can be faster than X32)3 – You fail to disprove my point about how much pointers are used all over the place, x32 wins big time due to smaller pointers leading to less data cache misses4 – If most apps needed the full 64 bit mode, they wouldn’t be useable in plain 32 bit mode ! Requiring mapping an entire file that will keep growing to memory is in itself ineficient and bad coding practice, typical of the last generation of lazy programmers. You can map as little as a single 4KB page (not very efficient), you can also map a few dozen MBs at a time, its a pretty bad thing mapping a file into memory that keeps growing, are you going to map the current file size plus a few GB just in case ?5 – So you manage a library written in C. Are you going to disprove the fact that JAVA, Perl, PHP, Python, … will be SUBSTANTIALLY faster in X32 ABI versus X86_64 due to the smaller pointer size ? I don’t need a benchmark or profiling to know it will be faster, all virtual machines running a higher level language will use pointers like crazy6 – In order to do some real world, sofisticated profiling of X32, we need a serious distro ready to go with X32. Do we have that beyond a simple release candidate of a recent Gentoo version ?I rest my case, still, it doesn’t look like you really understand everything that happens under the hood at the CPU level. Even though I was never a full time C programmer, I was always able to understand performance characteristics of binary code better than many C programmers that didn’t quite understood all the finer points of CPU architectures.Parallel computing is an entirely separate issue, concurrency issues can kill the performance of many apps. That’s another issue that don’t need a whole lot of benchmarking or profiling, if you know the under the hood behavior of modern processors.While you seem to do some very important work for the open source community, that doesn’t mean you know everything there is to know about CPU performance, in order to make a complete analysis. I stand by my basic point: X32 ABI is a good thingDo you also advocate abandoning CPU architectures that don’t have a 64 bit mode ? Doesn’t look like the ARM community will be a fan of your library if your code can’t run on the Raspberry PI or similar CPUs, which don’t have a 64 bit mode ?
http://indico.cern.ch/getFi…This is the evaluation of CERN, for some very real world programs, they measured a reduction of memory usage by a factor of 1.6 for e.g. the LHCb application. The savings for their other main applications and crafty (a high performance chess program) are at least 20%, usually more, in memory reduction, and a slight but significant increase in cpu performance except in one case.Your use of “debunk” is unfortunately wrong: you debunk myths by facts, not by speculation.
I have a degree and I am a fucking chump.Some of the brightest programmers I have ever met have been simultaneously degreeless and bottomlessly insightful into both c++ and the giblets of Linux/Linux development. (I worked at Trolltech)Bizarre to see someone seemingly possessive of insight pull that card; it does not belong in civilized discourse.In any case, I am excited by x32 but welcome well structured, well formulated discourse which takes a contrary stance. Very cool blog entries.
It’s actually easy to demonstrate a case where a 2D array indexing calculation uses a 64-bit imul for x86-64, but a 32-bit imul for x32. Especially if you use `size_t` for sizes, indices, and loop counters when looping over memory, because x32 makes `size_t` a 32-bit type. return arr[y*rowstride + x]; // See x86-64 and x32 asm for this and another example at https://godbolt.org/g/xmEu9zAtom is not the only CPU with slower 64-bit `imul`. AMD K8/K10 and Bulldozer-family have worse 64-bit imul, according to Agner Fog’s instruction tables (http://agner.org/optimize/). Silvermont (Atom successor) also has somewhat slower imul r64,r64 than r32,r32. This applies even on Knight’s Landing (Xeon Phi), although you probably don’t want to use x32 on Atom/Silvermont anyway because the extra address-size override prefixes will more easily exceed the 3-prefix (including escape bytes) limit and cause decode bottlenecks. It might be useful on KNL, though, for cases where you don’t need large data. (Obviously you’d want x86-64 available for other programs where you do want 64 bits of virtual address space.)It’s unfortunate that gcc ends up using address-size prefixes all over the place for x32, because pointer math has to wrap at 4GB (to make negative offsets work). But current gcc is just dumb, and doesn’t even look for the opportunity to avoid it when it knows a 32-bit pointer is zero-extended into a 64-bit register. I’m not surprised that you end up losing as much or more code-size to `67` address-size prefixes as you save from sometimes avoiding REX prefixes on pointer math in instructions that don’t use r8-r15. (gcc doesn’t seem to try very hard to do register allocation in a way that minimizes total REX prefixes, e.g. by favouring eax-ebp for 32-bit values.)I think the main benefit of x32 is reducing the cache footprint of pointer-heavy data structures. Memory space is cheap (most of the time), but cache space and memory bandwidth are precious. The gitk example Richard posted is a great use-case. 700MB of saved memory usage from halving the size of pointers is huge.This article’s “debunking” of that is pretty crap. Lots of structs have *multiple* pointer members, or a pointer and some size_t members. Shrinking those members can drop the size of a struct down below the previous multiple of 16B, so even if the whole struct had 16B alignment (e.g. if you’re using malloc()ed memory), you could save 16B per object from going from 20B to 16B (shrinking one pointer). Assuming malloc() can’t do anything with the partially-used 16B chunk.Relaxing the alignment requirement of many structs from 8B to 4B might also matter, IDK.I know this is just hand-waving on my part, too. But understanding x86 performance is what I do. See some of my StackOverflow answers, like https://stackoverflow.com/q… or https://stackoverflow.com/q…. Or [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/q… where I published some micro-optimization details that AFAIK were not known until I investigated.I’m not sure if x32’s benefits are worth maintaining another ABI in binary distros and in source code, because I don’t have a good feel for exactly how bad that is. Most projects with inline or stand-alone asm code shouldn’t need to do much different with x32 to make their asm work.The world seems to have decided that x32 wasn’t worth the benefit, but I worry that many people’s decisions were based on articles like this that made poor arguments about how much/little benefit there was.