Some more about arrays of strings

If you want to read this entry, make sure you read my previous entry about array of strings too, possibly with comments.

Mart (leio), commented about an alternative to using const char* const for arrays of strings. While it took me a while to get it, he has a point. First let me quote his comment:

If you have larger strings, or especially if you have strings with wildly differing length, then you can also use a long constant string that contains all the strings and a separate array that stores offsets into that array.

For example:

static const char foo[] =
    "Foo" 
    "Longer string" 
    "Bar";

static const int foo_index[3] = { 0, 4, 18 };

and then you can just do

#include  
#define N_ELEMENTS(array) (sizeof((array)) / sizeof ((array)[0])) 

int main(void)
{
    int i;
    for(i = 0; i < N_ELEMENTS(foo_index); ++i)
    {
        printf("%sn", foo + foo_index[i]);
    }
}

There’s perl and other scripts to auto-generate such an array pair from a more readable form. Some low-level GNOME libraries have one, and Ulrich Dreppers DSO Howto does too (they differ, having different pros and cons).

Thought it might also be useful to someone if dealing with largely differently sized strings :)

Of course if they are similarly sized, then it’s better to use the method described here. Sometimes it even makes sense to split the array up into two parts – one for the small strings that could use this method, and one for the larger variable sized ones using this method or just a different size that fits them all without wasting muc

For a simple example, the code the code he published is actually quite pointless, as the method of using const char* const works just the same way, putting everything in .rodata.

When this makes sense is in shared objects when using PIC. In those cases, even const char* const doesn’t get to .rodata, but goes into .data.relro. This is due to the way that type of array is implemented. As we’ve seen for the const char* case, the strings are saved in the proper .rodata section, but then the pointers in the array were saved to .data as being non-constant.

When we’re using PIC, the address at which the sections is loaded is not known at compile time, it’s the ELF loader (or whatever loader is used for the current executable format) which has to replace the address of the variables. Which means that while for the purpose of the C language the pointers are constant, they are filled in at runtime, and thus cannot reside on the read-only shared pages. They can’t be shared also because two processes might load the same page at different addresses, this is how PIE is useful together with address randomisation in hardened setups.

When using PIC, the code emitted would be:

        .section        .rodata
.LC0:
        .string "bar"
.LC1:
        .string "foobar"
        .section        .data.rel.ro.local,"aw",@progbits
        .align 16
        .type   foo, @object
        .size   foo, 16
foo:
        .quad   .LC0
        .quad   .LC1

What this method tries to address is the constant 4KB of dirty RSS page that every process loading a given library built with PIC would have, just to keep the .data.rel.ro information. So Mart’s method uses up a bit more processing time (an extra load) compared to the array of arrays of characters, coming more or less to the same performances of const char *const, allowing for variable-sized strings without padding, but without wasting a 4KiB memory page, trading that for readability.

It’s not entirely a bad idea, actually, and I should consider it more for xine-lib, although I doubt I can spare the 4KB page for all of them, as having a structure to pass information like ID description and other stuff like that ends up propping up more stuff in .data.relro anyway.

On the other hand, .data.relro is a COW page, but a COW could be avoided by using prelink: prelinking gives a suitable default address for a shared objects, which in turn should fill the .data.relro sections with the right value already. Of course, prelink does not work with PIE and randomised addresses.

I think we should try to make use of array of arrays of characters whenever possible anyway ;) Faster and easier.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s