C++ name demangling

I’ve been having some off time (well, mostly time I needed to do something that kept me away from facebooker craziness since that was driving me crazy quite literally), and I decided to get more work to do in Ruby-Elf (which now has its own page on my site — and a Flattr button as well, like this blog and Autotools Mythbuster thanks to Sebastian.)

What I’m working on right now is supporting C++ name demangling, in pure Ruby; the reason for that is that I wanted to try something “easier” before moving on to trying to implement the full DWARF specification in Ruby-Elf. And my reason to wishing for a DWARF parser in there is that Måns confirmed my suspicion that it is possible to statically-analyze the size of the stack used by a function (well, with some limitations, but let’s not dig into that right now). At any rate I decided to take a stab at that because it would come useful for the other tools in Ruby-Elf.

Now, while I could assume that most of my readers already know what a demangler is (and thus what is mangling), I’ll see to introduce it for all the others who would otherwise end up bored by my writing. C++ provides a much harder challenge for linkers and loaders, because the symbols are no longer identified just by their “simple” name; C++ symbols have to encode namespaces and class levels, a number of special functions (the operators), and in the case of functions and operators, the list of parameters (because two functions with the same name and different set of parameters, in C++, are valid). To do so, all compilers implement some kind of scheme to translate the symbols into identifiers that use a very limited subset of ASCII characters.

Different compilers use different schemes to do so; sometimes even differing depending on the operating system they are building on (mostly that’s for compatibility with other “native” compilers on that platform). The scheme that you can commonly find on Linux binaries is the so-called GCC3-mangling that is used, as the name leaves to intend, by GCC 3 and later and by ICC on Linux (and iirc OSX). The complete definition of the mangling scheme is available as part of the Itanium ABI definition (thanks to Luca who found me the link to that document); you can recognize the symbols mangled under this algorithm by their starting with _Z. Other mangling schemes are described among other documentation — link provided by Dark Shikari.

While standardising over the same mangling algorithm has been proposed many times, compilers make use of the difference in mangling scheme to prevent risk of cross-ABI linking. And don’t expect that the compilers will standardise their ABI anytime soon.

At any rate, the algorithm itself looks easy at the first glance; most of the names are encoded by having the length of the name encoded in ASCII and then the name itself; so an object bar in a namespace foo would be encoded as _ZN3foo3barE (N-E delimit the global object name). Unfortunately when you add more details in, the complexity increases, a lot. To avoid repeating the namespace specification time after time, when dealing with objects in the same namespace, or repeating the full type name when accepting objects of its own class, or multiple parameters with the same type, for instance, the algorithm supports “backreferences” (registers); name fragments and typenames (but not function names) are saved into these registers and then recovered with specific character sequences.

Luckily, Ragel helps a lot to parse this kind of specifications; unfortunately I have reached a point where proceeding further require definitely a lot more work than I’d have expected. The problem is, you have recursion within the state machines: a parameter list might contain.. another parameter list (for function pointers); a type might contain a typelist, as part of a template… and doing this in Ragel is far from easy.

There are also shorthands used to replace common name fragments, such as std:: or std::allocator that would be used so many times that the symbol names would be exceedingly huge. All in all, it’s a quite complex operation. And not a perfect one either; different symbols can demangle to the same string; and not even the c++filt shipping with binutils take care of validating the symbol names it finds, so for instance _Zdl is translated to “operator delete” even though there is nothing like that in C++: you need a parameter for that operator to work as intended. I wonder if this can be used to obscure proprietary libraries interfaces…

Now, since Luca also asked about this, I’d have to add, as Måns already confirmed, that having too-long symbol names can slow down the startup of a program; in particular it increases the time needed for bindings; even though generally-speaking the loader will not compare the symbol name itself, but rather its hash on the hash table, the longer the symbol the more work the hash function has to do. And to give an idea of how long a name we’re talking about, take for instance the following real symbol; exported symbol, nonetheless, coming from gnash:

_ZN5gnash13iterator_findERN5boost11multi_index21multi_index_containerINS_8PropertyENS1_10indexed_byINS1_14ordered_uniqueINS1_13const_mem_funIS3_RKNS_9ObjectURIEXadL_ZNKS3_3uriEvEEEEN4mpl_2naESC_EENS5_INS1_3tagINS_12PropertyList8OrderTagESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EENS6_IS3_iXadL_ZNKS3_8getOrderEvEEEESC_EESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EESaIS3_EEEi

gnash::iterator_find(boost::multi_index::multi_index_container, mpl_::na, mpl_::na>, boost::multi_index::ordered_unique, boost::multi_index::const_mem_fun, mpl_::na>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, std::allocator >&, int)

The second line is filtered through the binutils demangler to provide the decorated name of the function for C++. If you cannot guess, one of the problems is that the mangling algorithm contains no shorthand for boost templates, contrarily to what it has for the standard library templates. I know I cannot pretend that this relates directly with the problems I see with C, but it shows two things: the first is that C++ has an unseen amount of complexity that even long-time developers fail to catch properly; the second is that ELF itself doesn’t seem well-designed to handle the way C++ is compiled into code; and this is all without even looking at the recent undocumented changes introduced in GLIBC to make sure that C++ is implemented correctly.

I wonder if I’ll ever be able to complete this demangler, and then start with the other schemes supported in ELF files…

For A Parallel World. Theory lesson n.1: avoiding pointless autoconf checks

In my “For A Parallel World” series I have, up to now, only dealt with practical problems related with actual parallel builds. Today, I wish to try something different writing about some good practises to follow when working with autotools.

Once again I want to repeat that no, I don’t think autotools are completely flawed, or that they are way too complex for normal people to use; I certainly don’t agree that CMake is “just superior” like some good programmer said recently (although I admit after some stuff I’ve seen I’d gladly take CMake over some custom-built makefiles). I do think that there are too many bad examples of autotools usage, I do think that autotools could very well use better documentation and better tutorials, and I do think that a lot of people have been misusing autotools to the point that you blame the tool for how it was used.

You certianly know what the problem I’m referring to right now is, about parallel builds and autoconf: the lengthy and serialised execution of autoconf checks. It’s a very boring part of a build, but it’s unfortunately necessary, usually. What is the problem there though? The problem is that a lot of packages execute more tests than are actually needed, which is going to bore you to death since you’re waiting for things to be checked that you’re not going to have any useful result from.

The causes of this are quite varied: legacy systems being supported, overzealousness of the developer writing with respect to missing headers or functions, autoscan-like tools, boilerplate checks coming from a bastardised build systems (like the one forced upon each KDE 3 package), and of course mis-knowledge of autoconf. In addition to this, libtool 1.5 was very bad and checked for C++ and Fortran compilers, features, linkers and so on even though they were not to be used; luckily 2.2 is now fixed, and upstream projects are slowly migrating to the new version that takes much less time to run.

I’m currently looking for a way to scan a source tree to identify possibly overzealous checks from configure, so that I can help reducing the pointless checks myself, but in the mean time I want to give some indications that might help people to write better autotools-based buildsystems for their projects, at least.

My first suggestion is to require a standard base: I’m not referring to stuff like the Linux Standard Base, I’m referring to requiring a standard base for the language; in the case of C, which is mostly what autotools are used for (okay there’s also C++ but that’s not a topic I want to deal with right now), you can ensure that ou have some stuff present by requiring a C99-compliant compiler. You’re going to cut out some compilers, but C99 nowadays is pretty well supported under any operating system I ever dealt with, and even under Linux you can choose between three C99 compilers: GCC, Sun Studio Express and the Intel C Compiler. As you can guess, as long as you use C99 features, the compatibility between these three compilers is also almost perfect (there are some implementation-dependent things that vary, but if you avoid those it’s quite good).

But the important part here is that, by requiring C99 explicitly, you’re requiring also that the standard headers that come with that, which you don’t need to check for: stuff like locale.h, stdio.h, stdbool.h, stdlib.h, …; they have to be present for C99 to be supported. And even better, you can require POSIX.1-2001, and _XOPEN_SOURCE 600, so that you have a wide enough featureset you can rely upon. It’s very easy: -D_POSIX_C_SOURCE=200112 and -D_XOPEN_SOURCE=600, together with a C99-compatible compiler (like gcc -std=c99 or sunc99) and you can rely upon presence of functions like nanosleep() or gethostname(); you won’t have to check for them.

Now of course to support legacy systems you cannot relay on these standards, that are pretty new and not well supported, if at all, by older versions of compilers and operating systems that you might be interested in supporting. Well, guess what? A good way to deal with this is, rather than checking everything with autotools, and then dealing with all the issues one by one, is to assume things are available and give legacy operating system a chance to run this by having a library to supply the missing parts. Such a library can implement replacement or stubs for the missing functions, and headers; then the users of the legacy operating systems might just provide the library as extra to the project itself.

If you don’t like this approach, which in my opinion is quite nice and clear, you can rely fully on an external library instead, such as glib, to provide you some platform-independent support (like integer named types, byteswap macros, string functions). Again this reduces to requiring things to be available, rather than adapting (maybe too much) the software to the platforms it supports.

Sometimes, these approaches can be a bit an overkill though, since you might not have the need for the full C99, but you can accept C89 just fine, with a few touches; for this reason you might just assume that your functions are present, but they might not be using the exact name you expect (for instance there are a few functions that change name between POSIX and Windows), or you might be wanting to look at the function name for a known replacement function present in an extension library (that might as well be glib itself!).

In these cases you can well rely on the checks that come from autotools, but even then, I’d suggest you to be careful with what you write. One common problem for instance is overchecking headers. Let’s say you have to find at least one header that declares standard integer types (int64_t and similar); in general you can expect that one of inttypes.h, stdint.h or sys/types.h will be present and will have the definitions you need. The simplest test to find them is to use something like this:

dnl in configure.ac
AC_CHECK_HEADERS([inttypes.h stdint.h sys/types.h])

/* in a common header file */
#if defined(HAVE_INTTYPES_H)
# include 
#elif defined(HAVE_STDINT_H)
# include 
#elif defined(HAVE_SYS_TYPES_H)
# include 
#else
# error "Missing standard integer types"
#endif

While the code in the header is quite good, since only one of the found types is used, the example code in configure.ac is overchecking, since it’s checking all three of them, even if just the first hit is used. Indeed, if you check the output of a configure script using that, you’ll get this:

..snip..
checking for inttypes.h... (cached) yes
checking for stdint.h... (cached) yes
checking for sys/types.h... (cached) yes
..snip..

(the fact that the test is cached is because autoconf already checks for them, that’s overchecking in autoconf itself, and should probably be fixed).

Instead of the former code, a slightly different variant can be used:

AC_CHECK_HEADERS([inttypes.h stdint.h sys/types.h], [break;])

With this, the checks will stop at the first header that has been found. It might not sound much of a difference but if you pile them up, well it’s the usual point, little drops make a sea. This gets particularly useful when packages rename their include files, or they decide they want to move from just the basename to packagename/basename approach (think FFmpeg); you can test for the new one, and if that doesn’t hit, check the old one, but avoid checking twice if the new one hit already.

The same approach can be used with AC_CHECK_FUNCS so that you only check for the first function of a series of possible replacements, and go with that one.

But the most important thing is to make sure that all the checks you run during configure are actually useful and used. It’s not uncommon for software to check for a series of headers or functions to define the HAVE macros, but never use the macro themselves. It’s tremendously unfortunate and it should be avoided at all costs, you should always check your software for that, especially when you make changes that might make checks pointless.

Do you maintain an autotools-based software? Then please take a look at your configure.ac and make sure you’re not running pointless checks, all the users will be happy if you can reduce their build time!

Shortcomings of our multilib/FHS filesystem layout

Today I’m taking a day off from Gentoo to finish my slides, I really love LaTeX-beamer to do the work, I don’t like doing the work tho. I’m not really the kind of person who creates colourful and full of graphics presentations, I’m more an experience person. Anyway, I’m almost done luckily, and I can hopefully go prepared to the course.

For this reason, today I have nothing to write about changes in course and similar stuff, but I do have something I was thinking about a lot in the past months and that I should probably share with the readers of Planet Gentoo and of my blog.

For who don’t know what multilib is, it’s a technique that allows to run two (or more) different and incompatible kinds of userland sunder the same kernel, if the kernel is able to run both of those userland. The classic example is Gentoo/AMD64, that comes by default with a multilib setup that provides both 64-bit and 32-bit libraries, allowing to run 32-bit binaries, such as OpenOffice or VMware player, even if the main system is running 64-bit.

This is of course an advantage, but comes with some costs, and those are mainly the requirement of installing two times the stuff needed, one for 64-bit and one for 32-bit. Right now portage does not support the so-called true multilib so we have to use pre-built binary packages containing the 32-bit libraries (the app-emulation/emul-linux-x86-* packages), while glibc and gcc both have a multilib-aware build system so they are built from source correctly.

But why I talk of shortcomings? Because we had in the time a few problems, especially now we have some problems with Wine and the current eselect-compiler, problems that I actually found and reported quite some time ago, unfortunately this haven’t been solved yet. The cause is that the current support for 32-bit building in portage is largely an hack that works only when correctly adjusted for the platform, not generic enough, in my opinion.

The reason of this, as I suggested in the title also comes from the FHS, Filesystem Hierarchy Standard, that we partially support. Following that standard, we have a single binary directory, one lib directory per ABI/userland (“lib” being the main ABI), and share and include directories that contain platform-independent data files and headers.

Yes because in a perfect world the includes are just the same in all the platforms, so one has not to think about them too much. Unfortunately we all know it’s not true. Starting form glibc, that comes with headers that are completely unreadable by humans with all their preprocessor conditionals, but still are different (quite different) from arch to arch; then there are linux-headers, that of course changes betweena rchitecture for what regards the asm directory. But this is not limited to those two system packages, there are also packages that install autoconf-generated config.h files, that are fo course dependent on the userland.

The result is that we have to use some dirty trick by having three copies of every header for glibc in Gentoo/AMD64: one for the x86 version, one for the x86_64 version, one file that includes one or the other based on what the compiler is defining. This is a mess of its own, and I think somehow two include directories would have been simpler.

But it does not end here. On UNIX systems there’s classically a /usr/libexec directory that contains the commands called by other software, but not required to be visible to users, being them normal users or root (commands not visible to users but visible to root are in sbin). That directory is usually replaced by /usr/$libdir/misc in Gentoo, which means on multilib setup we have two distinct libraries. This works for most cases, but there can be one big fault in the logic.

Take CUPS, that installs usually in /usr/lib/cups all its executable files for filters, backends and similar. CUPS open them as pipes, it does not care what they are, they can be compiled, interpreted scripts, java programs, as long as the system can run them, CUPS does not care. If we were to follow the /usr/$libdir/misc reasoning, as we did for a while, one wouldn’t be able for instance to have a 64-bit CUPS working with a Canon iPixma backend (that needs to be compiled 32-bit as Canon does not release some of their libraries as 64-bit), although it supposedly work correctly; for this reason on 1.2 series I was able to push for using /usr/libexec for it, even if it “broke” what base-system team usually do.

We need it to share the executables for different ABI, and this is right, the program stored in /usr/$libdir/misc are executed, not loaded, so they can be used by different ABIs just the same way. The reasoning for the misc directory to be different was given to me as “it’s a step forward to have two ABIs completely installed on a system”, but seems like a stupid thing to do to me. We don’t have, nor probably want, two distinct sets of bin directories, because we can exec them as we want. If we really want them to be differentiated, well, then we have to came up with something better than /usr/lib/bin and /usr/lib64/bin or /usr/bin and /usr/bin64.

For what concerns libexec, its content should be always available with the best ABI possible, so either native or the one that takes less overhead to use. Think of a Gentoo/FreeBSD AMD64 box, /usr/libexec should have 64-bit FreeBSD binaries, but if for one binary (is it a generic command or a CUPS backend) those are not available it can fallback to i386 FreeBSD binaries, if neither are available, then it can install i386 Linux binaries. Of course being able to use three different ABIs require having three different libraries sets, and three different headers sets (as a lot of headers changes between FreeBSD and Linux), but then we can have only one command set, as we can mix and match different ABIs as long as the loader is not invoked, but rather a pipe is used.

A final alternative would be to have entirely different trees for the different ABIs, like /32bit/{bin,lib} and /64bit/{bin.lib} with /usr/{32,64}bit stuff and all, but leaving /usr/libexec as a share point. But I don’t think this is a good idea on the long run.

Anyway, I’d really like if Portage would have full true multilib support, but that will probably require quite a lot of work the way it is. I proposed an alternative method that might be simpler to handle, but requires rewriting almost the full support written right now so I doubt it makes much sense after all :)

Oh well back to work now that gdb finally installed on the iBook (I need it for the course), and I can test that gdbserver works as intended. Too bad they (the ones for whom I work) can’t use it as they work with an embedded ARM which environment was designed with a very old version of GDB not supporting TCP connections.