C++ name demangling

I’ve been having some off time (well, mostly time I needed to do something that kept me away from facebooker craziness since that was driving me crazy quite literally), and I decided to get more work to do in Ruby-Elf (which now has its own page on my site — and a Flattr button as well, like this blog and Autotools Mythbuster thanks to Sebastian.)

What I’m working on right now is supporting C++ name demangling, in pure Ruby; the reason for that is that I wanted to try something “easier” before moving on to trying to implement the full DWARF specification in Ruby-Elf. And my reason to wishing for a DWARF parser in there is that Måns confirmed my suspicion that it is possible to statically-analyze the size of the stack used by a function (well, with some limitations, but let’s not dig into that right now). At any rate I decided to take a stab at that because it would come useful for the other tools in Ruby-Elf.

Now, while I could assume that most of my readers already know what a demangler is (and thus what is mangling), I’ll see to introduce it for all the others who would otherwise end up bored by my writing. C++ provides a much harder challenge for linkers and loaders, because the symbols are no longer identified just by their “simple” name; C++ symbols have to encode namespaces and class levels, a number of special functions (the operators), and in the case of functions and operators, the list of parameters (because two functions with the same name and different set of parameters, in C++, are valid). To do so, all compilers implement some kind of scheme to translate the symbols into identifiers that use a very limited subset of ASCII characters.

Different compilers use different schemes to do so; sometimes even differing depending on the operating system they are building on (mostly that’s for compatibility with other “native” compilers on that platform). The scheme that you can commonly find on Linux binaries is the so-called GCC3-mangling that is used, as the name leaves to intend, by GCC 3 and later and by ICC on Linux (and iirc OSX). The complete definition of the mangling scheme is available as part of the Itanium ABI definition (thanks to Luca who found me the link to that document); you can recognize the symbols mangled under this algorithm by their starting with _Z. Other mangling schemes are described among other documentation — link provided by Dark Shikari.

While standardising over the same mangling algorithm has been proposed many times, compilers make use of the difference in mangling scheme to prevent risk of cross-ABI linking. And don’t expect that the compilers will standardise their ABI anytime soon.

At any rate, the algorithm itself looks easy at the first glance; most of the names are encoded by having the length of the name encoded in ASCII and then the name itself; so an object bar in a namespace foo would be encoded as _ZN3foo3barE (N-E delimit the global object name). Unfortunately when you add more details in, the complexity increases, a lot. To avoid repeating the namespace specification time after time, when dealing with objects in the same namespace, or repeating the full type name when accepting objects of its own class, or multiple parameters with the same type, for instance, the algorithm supports “backreferences” (registers); name fragments and typenames (but not function names) are saved into these registers and then recovered with specific character sequences.

Luckily, Ragel helps a lot to parse this kind of specifications; unfortunately I have reached a point where proceeding further require definitely a lot more work than I’d have expected. The problem is, you have recursion within the state machines: a parameter list might contain.. another parameter list (for function pointers); a type might contain a typelist, as part of a template… and doing this in Ragel is far from easy.

There are also shorthands used to replace common name fragments, such as std:: or std::allocator that would be used so many times that the symbol names would be exceedingly huge. All in all, it’s a quite complex operation. And not a perfect one either; different symbols can demangle to the same string; and not even the c++filt shipping with binutils take care of validating the symbol names it finds, so for instance _Zdl is translated to “operator delete” even though there is nothing like that in C++: you need a parameter for that operator to work as intended. I wonder if this can be used to obscure proprietary libraries interfaces…

Now, since Luca also asked about this, I’d have to add, as Måns already confirmed, that having too-long symbol names can slow down the startup of a program; in particular it increases the time needed for bindings; even though generally-speaking the loader will not compare the symbol name itself, but rather its hash on the hash table, the longer the symbol the more work the hash function has to do. And to give an idea of how long a name we’re talking about, take for instance the following real symbol; exported symbol, nonetheless, coming from gnash:

_ZN5gnash13iterator_findERN5boost11multi_index21multi_index_containerINS_8PropertyENS1_10indexed_byINS1_14ordered_uniqueINS1_13const_mem_funIS3_RKNS_9ObjectURIEXadL_ZNKS3_3uriEvEEEEN4mpl_2naESC_EENS5_INS1_3tagINS_12PropertyList8OrderTagESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EENS6_IS3_iXadL_ZNKS3_8getOrderEvEEEESC_EESC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_SC_EESaIS3_EEEi

gnash::iterator_find(boost::multi_index::multi_index_container, mpl_::na, mpl_::na>, boost::multi_index::ordered_unique, boost::multi_index::const_mem_fun, mpl_::na>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, std::allocator >&, int)

The second line is filtered through the binutils demangler to provide the decorated name of the function for C++. If you cannot guess, one of the problems is that the mangling algorithm contains no shorthand for boost templates, contrarily to what it has for the standard library templates. I know I cannot pretend that this relates directly with the problems I see with C, but it shows two things: the first is that C++ has an unseen amount of complexity that even long-time developers fail to catch properly; the second is that ELF itself doesn’t seem well-designed to handle the way C++ is compiled into code; and this is all without even looking at the recent undocumented changes introduced in GLIBC to make sure that C++ is implemented correctly.

I wonder if I’ll ever be able to complete this demangler, and then start with the other schemes supported in ELF files…

How _not_ to fix GCC 4.4 bugs

So with GCC 4.4 and glibc 2.10, C++ support went, once again, stricter. Now, leaving aside all my possible comments on a language for which there is still an absolute vacuum of actual implementations years after publishing, let just look at what the problem is this time around.

The main issue, in which glibc 2.10 is also related, is that the C-style string functions now return pointers with the same constant modifier as they are given; so if you look for a characters in a constant string, it’ll return a constant string pointer, and vice-versa if you give it a variable string, it’ll return you a variable string pointer (more here if you’re bored).

This means that the following code will build fine with either GCC 4.3 or glibc 2.10 but will fail when both of those (or more recent) versions are used:

#include <cstring>

int main() {
  char *foo = strchr("ciao", 'a');
}

% g++-4.3.3 test-const.cc -o test-const    
% g++-4.4.0 test-const.cc -o test-const 
test-const.cc: In function ‘int main()':
test-const.cc:4: error: invalid conversion from ‘const char*' to ‘char*'

The error from the compiler is quite real: you’re mixing up different type of variables, although in this particular instance you’re not doing anything wrong with it, but for instance take the following code rather than the one shown earlier:

#include <cstring>

int main() {
  char *foo = strchr("ciao", 'a');
  *foo = 'e';
}

This code is trying to change something that it shouldn’t; in particular, given the pointer is now pointing inside a literal, which is then inside the .rodata section of the ELF and in a shared, non-writeable area of memory, when executed this will cause a segmentation fault (a crash, for those not used to this terminology). But it can get less obvious and more sneaky, since instead of a literal, you could have a parameter, declared constant.

Of course, whenever you have a variable or a parameter that is declared constant, but is not actually residing in read-only memory areas (like .rodata), you’re just a cast away from having it non-constant. But then you’d be seeing the cast, and that would be like a yellow light sign. On the other hand, with the old method of just having the function cast away the constant modifier, it was less obvious at first sight.

Okay so we know what the problem is, why the error was introduced, let’s go down to business with what the problem is. I have seen more than a few patches out there that, to make software build on GCC 4.4/glibc 2.10 simply cast away the constant modifier, C-style:

#include <cstring>

int main() {
  char *foo = (char*)strchr("ciao", 'a');
}

This is wrong. You should not do that. Why? Because you’re hiding a problem; in more than half the cases, the solution is simply to change the declaration of the variable:

#include <cstring>

int main() {
  const char *foo = strchr("ciao", 'a');
}

Of course, this does not cover all the cases; there are a few when the pointer is actually used to change memory areas. In those cases, since fixing the issue might be overkill, I’d highly suggest a different syntax:

#include <cstring>

int main() {
  char *foo = const_cast(strchr("ciao", 'a'));
  *foo = 'e';
}

This uses the explicit const_cast keyword from C++, and the very fact that it’s an eyesore in the code should be enough to scream “Workaround!”, which is what it is in truth.

So please, don’t just cast it away C-style, give it a bit more thoughts, please!