Multi-architecture, theory versus practise

You probably remember the whole thing about FatELF and my assertion that FatELF does nothing to solve what the users supporting it want to see solved: multiple architecture support by vendors. Since I don’t want to be taken for one of those people who throw an assertion and pretend everybody falls in line with it, I’d like to explain somewhat further what the problem is in my opinion.

As I said before, even if FatELF could simplify deployment (at the expense of increasing exponentially the complexity of any other part of the operating system that deals with executables and libraries), it does nothing to solve a much more important problem, that has to be solved before you can even think of achieve multi-architecture support from vendors: development.

Now, in theory it’s pretty easy to write multi-architecture code: you make no use of any machine-dependent feature, no inline assembly, no function call outside the scope of a standard. But is it possible for sophisticated code to keep this way? It certainly often is not for open source software, even when it already supports multiple architecture and multiple software platforms as well. You can find that even OpenOffice require a not-so-trivial porting to support Linux/HPPA and that’s a piece of software that, while deriving from a proprietary suite (and having been handled by Sun which is quite well known for messy build systems), has been heavily hacked at by a large community of developers, and includes already stable support for 64-bit architectures.

Now try to be a bit imaginative, and find yourself working at a piece of proprietary code: you’ve already allocated money to support Linux, which is for many points of view, a fringe operating system. Sure it starts to increase in popularity, but then again a lot of those using it won’t run proprietary application nonetheless… or wouldn’t pay for them. (And let’s not even start with the argument but Chrome OS will bring a lot more users to Linux since that’s already been shown as a moot point). Most likely at this point you are looking at supporting a relatively small subset of Linux users; it’s not just a matter of difference between distributions, it’s just a way to cut down testing time; if it works on unsupported distributions is fine, but you won’t go out of your way for them; the common “enterprisey” distributions are fine for that.

Now, at the end of the nineties or at the beginning of the current decade, you wouldn’t have to think much in term of architectures either: using Linux on anything but x86 mostly required lots of effort (and lead to instability). In all cases you had to “eradicate” the official operating system of the platform, which meant Windows for x86, Solaris for SPARC and Mac OS for PPC; but while the former was quite obvious, the other still required more work since they were developed by the developer of the hardware in the first place.

Nowadays it is true that things changed, but how did they change exactly? The major change is definitely the introduction of the AMD64 (or x86-64 if you prefer) architecture: an extension to the “good” old x86 that supports 64-bit addresses. This alone created quite a few problems: from one side, since it allows for compatibility with the old x86 software, proprietary, commercial software didn’t flock to support it so fast: after all their software could still be used, even though it required further support from the distributions (multilib support that is), on the other side, multilib was before something that only a few niche architectures like mips looked out for, so support for it wasn’t as ready for most distributions.

And, to put the cherry on top, users started insisting for some software to be available natively for x86-64 systems, so that it would be more compatible, or at least shinier in their eyes; Java, Flash Player, and stuff like that had to be ported over. But here we’re reaching the point where theory (or, if you’re definitely cynical – like me – fantasy) clashes against practise: making Flash work on AMD64 systems didn’t just involve calling another compiler, as many people think, partly because the technologies weren’t all available to Adobe to rebuild, and partly because the code made assumption about the architecture it was working on.

Let be honest: it’s hypocrite to say that Free Software developers don’t make such assumption; it’s more like porters and distributions fixed their code long time ago; proprietary software does not have this kind of peer review, and they are, generally, not interested on it. It takes time, it takes effort and thus it takes money. And that money does not generally come out of architectures like Alpha, or MIPS. And I’m not calling out for the two of them without reason here: they are the two architectures that have actually allowed for some porting to be done for AMD64 before its time. The former was probably the previously most available 64-bit system Linux worked decently on (SPARC64 is a long story), and had code requirements very similar to x86-64 in term of both pointer size and PIC libraries. The latter had the first implementations of multilib around.

But again, handling endianness correctly (and did you know that MIPS, ARM and PowerPC exist in multiple endian variations?), making sure that the pointers are not assumed to be of any particular size, and never use ASM-only routines is simply not enough to ensure your software will work on any particular architecture. There are many problems, some of which are solvable by changing engineering procedures, and some others which are simply not solvable without spending extra time debugging that architecture.

For instance, if you’ve got an hand-optimised x86-only assembly routine, and a replacement C code for the other architectures, that code is unlikely to get tested as much as the x86 code, if your development focuses simply on x86. And I’m not kidding you when I say that it’s not such a rare thing to happen, also with Free Software projects. Bugs in that piece of code will be tricky to identify unless you add to your development process the test and the support for that particular architecture; which, trust me, is not simple.

Similarly, you can think of the strict aliasing problem: GCC 4.4 introduced further optimisations that can make use of strict aliasing assumption on x86 as well; before, this feature was mostly used by other architectures. Interestingly enough, the amount of strict-aliasing bug is definitely not trivial, and will cause some spurious bugs at runtime. Again, this is something that you can only fix by properly testing, and debugging, on different architectures. Even though some failures now happen on x86 too, this does not mean that the same problems happen, no more no less, on anything else. And you need to add your compiler’s bug to the mix, which is also not so simple.

And all of this is only covering the problems with the code itself, and comes nowhere near the problems of cross-compilation, it does not talk about the problems and bugs that can be in your dependencies’ code for the other architectures, or the availability of stable-interface distributions for those architectures (how many architectures is RHEL available for?).

After all this, do you still think that the only problem keeping vendors from supporting multiple architectures is the lack of a “Universal Binary” feature? Really? If so, I have some fresh air to sell you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s