Ruby-Elf and multiple compilers

I’ve written about supporting multiple compilers; I’ve written about testing stuff on OpenSolaris and I have written about wanting to support Sun extensions in Ruby-Elf.

Today I wish to write about the way I’m currently losing my head to get Ruby-Elf testsuite to apply over other compilres beside GCC. Since I’ve been implementing some new features for missingstatic upon request (by Emanuele “exg”), I decided to add some more tests, in particular considering Sun and Intel compilers that I decided to support for FFmpeg at least.

The new tests not only apply the already-present generic ELF tests (but rewritten and improved, so that I can extend them much more quickly) over files built with ICC and Sun Studio under Linux/AMD64, but also adds tests to check for the nm(1)-like code on a catalogue of different symbols.

The results are interesting in my view:

  • Sun Studio does not generate .data.rel sections, it only generates a single .picdata section, which is not divided between read-only and read-write (which might have bad results with prelinking);
  • Sun Studio also emits uninitialised non-static TLS variables as common symbols rather than in .tbss (this sounds like a mistake to me sincerely!);
  • the Intel C Compiler enables optimisation by default;
  • it also optimises out unused static symbols with -O0;
  • and even with __attribute__((used)) it optimises out static uninitialised variables (both TLS and non-TLS);
  • oh and it puts a “.0” suffix to the name of unit-static data symbols (I guess to discern between them and function-static symbols, that usually have a code after them);
  • and least but not last: ICC does not emit a .data.rel section, nor a .picdata section: everything is emitted in .data section. This means that if you’re building something with ICC and expect cowstats to work on it, then you’re out of luck; but it’s not just that, it also means that prelinking will not help you at all to reduce memory usage, just a bit to reduce startup time.

Fixing up some stuff for Sun Studio was easy, and now cowstats will work fine even under Sun Studio compiled source code, taking care of ICC quirks not so much, and also meant wasting time.

On the other hand, there is one new feature to missingstatic: now it shows the nm(1)-like symbol near the symbols that are identified as missing the static modifier, this way you can tell if it’s a function, or constant, or a variable.

And of course, there are two manpages: missingstatic(1) and cowstats(1) (DocBook 5 rulez!) that describe the options and some of the working of the two tools; hopefully I’ll write more documentation in the next weeks and that’ll help Ruby-Elf being accepted and used. Once I have enough documentation about it I might actually decide to release something. — I’m also considering the idea of routing --help to man like git commands do.

Reminding a weakness of Prelink

For extra content about this entry, please refer to the previous one which talks about array of strings and PIC.

As I say in the other post, prelink can reduce the amount of dirty RSS pages due to COW of PIC code. As prelink assigns to every library a predefined load address in memory, which is either truly unique, or unique within the scope of the set of programs who are said to be able to load that library, there is no COW (Copy-on-Write) during load as the loader doesn’t have to change the address loaded from the file, and is thus able to share the same page with many processes. This is how prelink save (sometimes a lot of) memory.

Unfortunately there is one big problem especially with modern software architectures: many programs now use runtime-loaded plugins for functions; the whole KDE architecture is based on this, even for KDE 4, as well as xine and others and others.

The problem is that prelink can’t really take into account the plugins, as it doesn’t know about them. For instance it can’t understand that amarok is able to load the xine engine, thus amarokapp is not going to be prelinked for libxine.so. Additionally, it can’t understand that libxine.so is able to load the xineplug_decode_ff.so, which in turn depends on libavcodec.so. This means that for instance when using the -m switch, it could be assigning libqt3-mt.so and libavcodec.so the same address.. causing a performance hit rather than improvement at runtime, when the linker will have to relocate all the code of libavcodec.so, triggering a COW.

The same is true for almost all scripting languages which use C-compiled extensions: Perl, Ruby, Python, as you can’t tell that the interpreter can load them just by looking at the ELF file, which is what prelink does.

A possible way around this is to define post-facto, that is after compilation, which shared object the program can load. It could probably be done through a special .note section in the ELF file and a modified prelink, but I’m afraid it would be quite difficult to properly implement it especially in ebuilds. On the other hand, it might give quite some performance improvement; as I said today’s software architecture are often based on on-demand loading of code through plugins, so it could be quite interesting.

Some more about arrays of strings

If you want to read this entry, make sure you read my previous entry about array of strings too, possibly with comments.

Mart (leio), commented about an alternative to using const char* const for arrays of strings. While it took me a while to get it, he has a point. First let me quote his comment:

If you have larger strings, or especially if you have strings with wildly differing length, then you can also use a long constant string that contains all the strings and a separate array that stores offsets into that array.

For example:

static const char foo[] =
    "Foo" 
    "Longer string" 
    "Bar";

static const int foo_index[3] = { 0, 4, 18 };

and then you can just do

#include  
#define N_ELEMENTS(array) (sizeof((array)) / sizeof ((array)[0])) 

int main(void)
{
    int i;
    for(i = 0; i < N_ELEMENTS(foo_index); ++i)
    {
        printf("%sn", foo + foo_index[i]);
    }
}

There’s perl and other scripts to auto-generate such an array pair from a more readable form. Some low-level GNOME libraries have one, and Ulrich Dreppers DSO Howto does too (they differ, having different pros and cons).

Thought it might also be useful to someone if dealing with largely differently sized strings :)

Of course if they are similarly sized, then it’s better to use the method described here. Sometimes it even makes sense to split the array up into two parts – one for the small strings that could use this method, and one for the larger variable sized ones using this method or just a different size that fits them all without wasting muc

For a simple example, the code the code he published is actually quite pointless, as the method of using const char* const works just the same way, putting everything in .rodata.

When this makes sense is in shared objects when using PIC. In those cases, even const char* const doesn’t get to .rodata, but goes into .data.relro. This is due to the way that type of array is implemented. As we’ve seen for the const char* case, the strings are saved in the proper .rodata section, but then the pointers in the array were saved to .data as being non-constant.

When we’re using PIC, the address at which the sections is loaded is not known at compile time, it’s the ELF loader (or whatever loader is used for the current executable format) which has to replace the address of the variables. Which means that while for the purpose of the C language the pointers are constant, they are filled in at runtime, and thus cannot reside on the read-only shared pages. They can’t be shared also because two processes might load the same page at different addresses, this is how PIE is useful together with address randomisation in hardened setups.

When using PIC, the code emitted would be:

        .section        .rodata
.LC0:
        .string "bar"
.LC1:
        .string "foobar"
        .section        .data.rel.ro.local,"aw",@progbits
        .align 16
        .type   foo, @object
        .size   foo, 16
foo:
        .quad   .LC0
        .quad   .LC1

What this method tries to address is the constant 4KB of dirty RSS page that every process loading a given library built with PIC would have, just to keep the .data.rel.ro information. So Mart’s method uses up a bit more processing time (an extra load) compared to the array of arrays of characters, coming more or less to the same performances of const char *const, allowing for variable-sized strings without padding, but without wasting a 4KiB memory page, trading that for readability.

It’s not entirely a bad idea, actually, and I should consider it more for xine-lib, although I doubt I can spare the 4KB page for all of them, as having a structure to pass information like ID description and other stuff like that ends up propping up more stuff in .data.relro anyway.

On the other hand, .data.relro is a COW page, but a COW could be avoided by using prelink: prelinking gives a suitable default address for a shared objects, which in turn should fill the .data.relro sections with the right value already. Of course, prelink does not work with PIE and randomised addresses.

I think we should try to make use of array of arrays of characters whenever possible anyway ;) Faster and easier.