So my previous posts were picked up by none others than LWN.net — it was quite impressive to see the tweet of them picking up my blog post, it’s the first time, although I did author articles for them before.
Now in the comments of the articles and LWN’s own signalling of it, you can find a lot of discussion about the merits of x32, and a little of it tries to paint me as uninformed. I would like to just say a few words about that right now so that I don’t have to go through this later on. I’ve been toying around ELF, x86-64, PIC and structure optimisation for a very long time. I’ll come back in a moment on why I didn’t do a more thorough analysis and my own benchmarks of the architecture, but if you really think I’m just an amateur because I work on Gentoo Linux and not Fedora or Ubuntu, please think again. I might not be one of the “greats”, but I don’t think I’d be boasting if I say that I know what I’m doing — most of the times at least.
So why did I not go into doing my own benchmark to show the numbers of (non-)improvement on x32? Because for me it would be time wasted. I’m not Phoronix, I don’t benchmark stuff for a living, and I’m neither proposing the ABI or going to do work on it myself. I looked into the new ABI because from one side, it’s always cool to learn about new techniques and technology, even when they sound a little over the top (I did look a lot into FatELF as well and I was very negative about it — I hope Ryan doesn’t hold a grudge against me, I was quite unlikeable from his point of view I’m sure), and from the other because my colleague Luca suggested it could be useful to get some more performance out of a device we’re working on.
Now, said device is embedded, runs Gentoo Linux, and needs libav and x264 – I’m not going to give you any more specifics about it – which is why my first test on the new ABI has been testing libav (and finding it requiring way too much work than it would make sense for us). Looking into it has also told me that some of the assumption I made about how the new ABI would have been designed, for instance the fact that
long was still 32-bit surprised me.
I’ve been told my arguments are “strawmen” because I singled out some specific topic instead of doing a top-down analysis — as the title of my post, and the reference to my old ccache article should have suggested, I was looking into some of the things I’ve been discussing, or have been told. The only exception to that has been my answer to “x32 is going to be compatible with x86, if not now in the future.” — I have talked with nobody about this but I’ve seen this kind of misconception floating around, especially at the time of the FatELF proposal, about a 64-bit ABI which would be binary compatible with good old 32-bit x86.
The purported reason for having such an ABI would be being able to load 32-bit closed-source libraries into the address space of 64-bit programs or vice-versa. The idea is that this way the copy of Skype I’m running wouldn’t be loading into my memory a copy of the 32-bit
libc.so.6 library, which is used by no other process.
If it feels like my posts have been aimed squarely at the Gentoo folks, it might very well be right, although that was not the intention. Most people who look into new ABIs as they come out are probably on the same page as most Gentoo users with their bleeding edge feeling — if you have only production Fedora installs, you really won’t give much about an ABI fedora is not released for yet! And given Mike made us the first distribution releasing something for the ABI, it feels right to discuss Gentoo issues first.
Now I also been told that I didn’t talk enough about the reduction in size of data structures, which improves the use of the data cache (not the instruction cache as Francesco said in the comments of the first article), and for that people got the impression I don’t know how much of a difference that makes … that would be wrong given that I’ve actually discussed methods to minimize data usage and have spent time writing a tool to reduce copy-on-write even when that means making changes for ludicrously small improvements.
I have also been working closely with
pahole from Arnaldo’s dwarves package to make sure that the software I manage has properly-designed structures, not only reducing the size of the single object, but making sure that attributes that are to be used together are grouped nearby — this is pretty important for data cache handling, and might goes against what most people are told in school, here at least, that attributes in classes have to be ordered semantically, not by use.
On a different note it would be nice if it was possible to tell the compiler that a given structure never leaves the object, and thus it can reorder it as needed to get the best performance — but that would also require that each unit reorders it properly. Nevermind.
There are some interesting things to be considered as well — if you need fast access to objects in an array, you might be interested in using a little more memory and make sure the object’s size is a power of two, so that instead of using expensive multiplications you can use left shifts to calculate the offset from base pointer of a given index.
I know that reducing the size of pointers and long will reduce the pressure on the data cache, which in turn means you can have faster pointer chasing and better access to thinks like linked lists and so on — on the other hand I don’t think that this improvement is worth all the compatibility and porting headaches that a new ABI involves, especially considering that, as we move along, more and more software will make a better use of the 64-bit address space, as developers start to understand they have to drop the old design and paradigm of scores of years ago and replace it with modern design; Paul-Henning Kamp of FreeBSD and Varnish fame said it very well in the linked ACM article.
So to sum it up: I still don’t think x32 is worth my time, whether it is for porting, bug-filing or benchmarking. Of course if somebody gets libav to work on x32 I’ll be the first person to set up a FATE instance for it, and if Gentoo decides to make it a first-class citizen I’ll set up a tinderbox instance for it, but … I sure hope I won’t have to spend more time on it.
What I think I’ll spend some time on in the next few days, that I started thinking about after all the comments, is some posts describing things such as what an ABI actually is in this content, and how to see whether your structures are simply inadequate for what you’re trying to do. It might get interesting.
And to finish this off, I know I use “Now,” to start paragraphs way too often — I guess this is the reason why O’Reilly wouldn’t consider me as an author.
You’re blog posts read better than any Phoronix article
I think a maybe more obvious way to ask the question is: Are you really sure those few % gained by x32 in some corner-cases will be able to compensate for the massive speed loss due to the code only being optimized for amd64?I am rather confident that for an all-x32 general-purpose system the answer is no.And a all-new ABI for a few corner-cases to be honest seems like a rather crazy thing to do. Particularly when there’s a good chance that for these corner-cases you could have achieved the same performance with a few compiler hacks (like reintroducing “near” pointers) and special memory allocations for some critical arrays that ensures the pointers fit in 32 bit.
On the LWN’s article there are people trying to sell the idea “32bit ARM CPU with 64bit memory addresses is the same idea as x32”, hence “x32 is tailored for Android”; I see dumb people.I perfectly remember a very old speech about x32 (it was just a prototype idea at that time) dating back to 2005 / 2006 (or even older, I don’t remember the period with precision) in which was discussed also the main purpose of the idea: high-density clusters. In *that scenario* all the concepts behind the x32 ABI have a sense, outside that scenario is a mere academic exercise without a real-world use case. x32 could be useful for cluster environments used for crunching big data, like for example the one used into the research field, where the reduced size of _data cache_ can lead to faster simulations, and short simulations means less time wasted on crunching your data (I am speaking in the order of days / weeks, not seconds or minutes), ergo: you have saved a lot of money that can be re-used more efficiently for the research itself, greatly boosting the research’s results. Obviously, here we are speaking of environments with just the bare essentials: kernel, the toolchain and nothing else than *your highly optimized application used for the simulations*.
I am running heaps of web servers written in Python.In 64-bit Python, each of my process consumes almost 2x the amount of memory as they do on 32-bit Python. Having an x32 option would let me put twice as many clients on the same server thus saving a lot on my server costs.