I’ve been having some off time (well, mostly time I needed to do something that kept me away from facebooker craziness since that was driving me crazy quite literally), and I decided to get more work to do in Ruby-Elf (which now has its own page on my site — and a Flattr button as well, like this blog and Autotools Mythbuster thanks to Sebastian.
What I’m working on right now is supporting C++ name demangling, in pure Ruby; the reason for that is that I wanted to try something “easier” before moving on to trying to implement the full DWARF specification in Ruby-Elf. And my reason to wishing for a DWARF parser in there is that Måns confirmed my suspicion that it is possible to statically-analyze the size of the stack used by a function (well, with some limitations, but let’s not dig into that right now). At any rate I decided to take a stab at that because it would come useful for the other tools in Ruby-Elf.
Now, while I could assume that most of my readers already know what a demangler is (and thus what is mangling), I’ll see to introduce it for all the others who would otherwise end up bored by my writing. C++ provides a much harder challenge for linkers and loaders, because the symbols are no longer identified just by their “simple” name; C++ symbols have to encode namespaces and class levels, a number of special functions (the operators), and in the case of functions and operators, the list of parameters (because two functions with the same name and different set of parameters, in C++, are valid). To do so, all compilers implement some kind of scheme to translate the symbols into identifiers that use a very limited subset of ASCII characters.
Different compilers use different schemes to do so; sometimes even differing depending on the operating system they are building on (mostly that’s for compatibility with other “native” compilers on that platform). The scheme that you can commonly find on Linux binaries is the so-called GCC3-mangling that is used, as the name leaves to intend, by GCC 3 and later and by ICC on Linux (and iirc OSX). The complete definition of the mangling scheme is available as part of the Itanium ABI definition (thanks to Luca who found me the link to that document); you can recognize the symbols mangled under this algorithm by their starting with
_Z. Other mangling schemes are described among other documentation — link provided by Dark Shikari.
While standardising over the same mangling algorithm has been proposed many times, compilers make use of the difference in mangling scheme to prevent risk of cross-ABI linking. And don’t expect that the compilers will standardise their ABI anytime soon.
At any rate, the algorithm itself looks easy at the first glance; most of the names are encoded by having the length of the name encoded in ASCII and then the name itself; so an object
bar in a namespace
foo would be encoded as
_ZN3foo3barE (N-E delimit the global object name). Unfortunately when you add more details in, the complexity increases, a lot. To avoid repeating the namespace specification time after time, when dealing with objects in the same namespace, or repeating the full type name when accepting objects of its own class, or multiple parameters with the same type, for instance, the algorithm supports “backreferences” (registers); name fragments and typenames (but not function names) are saved into these registers and then recovered with specific character sequences.
Luckily, Ragel helps a lot to parse this kind of specifications; unfortunately I have reached a point where proceeding further require definitely a lot more work than I’d have expected. The problem is, you have recursion within the state machines: a parameter list might contain.. another parameter list (for function pointers); a type might contain a typelist, as part of a template… and doing this in Ragel is far from easy.
There are also shorthands used to replace common name fragments, such as
std::allocator that would be used so many times that the symbol names would be exceedingly huge. All in all, it’s a quite complex operation. And not a perfect one either; different symbols can demangle to the same string; and not even the
c++filt shipping with binutils take care of validating the symbol names it finds, so for instance
_Zdl is translated to “operator delete” even though there is nothing like that in C++: you need a parameter for that operator to work as intended. I wonder if this can be used to obscure proprietary libraries interfaces…
Now, since Luca also asked about this, I’d have to add, as Måns already confirmed, that having too-long symbol names can slow down the startup of a program; in particular it increases the time needed for bindings; even though generally-speaking the loader will not compare the symbol name itself, but rather its hash on the hash table, the longer the symbol the more work the hash function has to do. And to give an idea of how long a name we’re talking about, take for instance the following real symbol; exported symbol, nonetheless, coming from gnash:
_ZN5gnash13iterator_findERN5boost11multi_index21multi_index_containerINS_8PropertyENS1_10indexed_byINS1_14ordered_uniqueINS1_13const_mem_funIS3_RKNS_9ObjectURIEXadL_ZNKS3_3uriEvEEEEN4mpl_2naESC_EENS5_INS1_3tagINS_12PropertyList8OrderTagESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EENS6_IS3_iXadL_ZNKS3_8getOrderEvEEEESC_EESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EESaIS3_EEEi gnash::iterator_find(boost::multi_index::multi_index_container, mpl_::na, mpl_::na>, boost::multi_index::ordered_unique, boost::multi_index::const_mem_fun, mpl_::na>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, std::allocator >&, int)
The second line is filtered through the binutils demangler to provide the decorated name of the function for C+. If you cannot guess, one of the problems is that the mangling algorithm contains no shorthand for boost templates, contrarily to what it has for the standard library templates. I know I cannot pretend that this relates directly with the problems I see with C, but it shows two things: the first is that C+ has an unseen amount of complexity that even long-time developers fail to catch properly; the second is that ELF itself doesn’t seem well-designed to handle the way C++ is compiled into code; and this is all without even looking at the recent undocumented changes introduced in GLIBC to make sure that C++ is implemented correctly.
I wonder if I’ll ever be able to complete this demangler, and then start with the other schemes supported in ELF files…