2007-12-19

Array of pointers and array of arrays

While doing microoptimisations on xine, I started considering one particular optimisation that might actually make some difference on the grand plan of things, is to change some array of pointers to array of arrays.

Talking on #gentoo-it, I realised that it’s not clear to everybody that there is a substantial difference between these two forms:

static const char *foo[] = { "foo1", "foo2", "foo3" };

static const char foo[][8] = { "foo1", "foo2", "foo3" };Code language: JavaScript (javascript)

The first is an array of pointers containing three pointers to the three literals, that will be anonymous entries in the .rodata section of a file.

The second is an array of arrays of characters, a single object containing four literals.

It’s easier to understand this when you actually see the assembler representation of the two above.

The first, in AMD64 assembler is compiled this way:

        .section        .rodata
.LC0:
        .string "foo1"
.LC1:
        .string "foo2"
.LC2:
        .string "foo3"
        .data
        .align 16
        .type   foo, @object
        .size   foo, 24
foo:
        .quad   .LC0
        .quad   .LC1
        .quad   .LC2
Code language: CSS (css)

While the second is compiled this way:

        .section        .rodata
        .align 16
        .type   foo, @object
        .size   foo, 24
foo:
        .string "foo1"
        .zero   3
        .string "foo2"
        .zero   3
        .string "foo3"
        .zero   3
        .text
        .type   print_foo, @function
Code language: CSS (css)

As you can see while the size (in this case) is the same (if I avoided the padding to [8] the strings, setting them to [5] which is the minimum required, the second would be smaller. As we’ll see, the first is actually using more memory, but that’s up for later.

In the first case, three strings are defined, they get an automatic label (LC0, 1 and 2) assigned, and then at the foo object, the addresses of the three strings are listed as quad (quad-words, 64-bit sized words, as are addresses on 64-bit architectures).

I’m not going to show the assembler code to access a member of the two objects, but I’ll try to describe it in words, as that’s also quite interesting:

foo[2];Code language: CSS (css)

Accessing the second member of foo returns a string (a pointer to characters, or an array of characters). But the way this is achieved with the two objects is different. In both cases there are actions that are actually calculated at buildtime, and others that need to be done at runtime. I won’t try to separate them, as they will probably change depending on optimisations and on the architecture used.

In the first object, the offset from the address of foo is calculated, this is 2 (the index to read) multiplied per the size of a pointer (8 bytes in 64-bit achitectures). Then the 8-bytes forming the pointer that is returned are read at the address (foo+offset).

In the second object, the offset from the address of foo is calculated, once again, but this time it’s 2 (the index to read) multiplied per the size of the sub arrays (again in this case, 8-bytes). Then the pointer that is returned is (foo+offset).

It might seem the same, but if you look closely, you can see that in the first case the pointer is read at (foo+offset), while in the latter the pointer is (foo+offset). This actually saves you from an operation.

As I said, the size of foo in both cases is the same, but the first case takes up more memory. This because in the first case foo contains the pointers to the strings, not the strings themselves, so you have to sum the size of the array itself with the size of the strings to actually have the occupied space. The result is then (8 (size of a pointer) * 3) + (5 (size of a string of 4 characters) * 3), which turns out to be 39; for the second case instead we just have 8 (size of an array of characters) * 3, which is 24.

Now let me explain why in the example I’m using [8] as size rather than using [5] which is the actual space the three strings in the example require. As you’ve read, the offset is calculated by multiplying the size of the string by the index we’re accessing. If the size of the array is 5, it would be 5*i, if it’s 8, it would be 8*i. Multiplication is an operation that even on modern CPUs take a bit of time, but even on the oldest 8086, 8*i can be simplified.

The value 8 is a power of two, it’s exactly 2 to the third power. As computers use a binary representation of data, multiplying for a power of two is just a matter of adding a zero to the right of a number (like we do in usual math when multiplying by ten). In programming that operation is called a (binary) left shift, and it’s usually implemented through a very fast CPU instruction.

This means that the operation 8*i can be replaced by i<<3 (left shift by three bits), which is faster than the 5*i in almost every possible CPU. So by padding the elements to a power of two boundary, access is made faster. And as we see in this case, we’re still using less memory than we would have with the “usual” method.

Now, there is another thing that you might have noticed in the two assembler listings: in the first one, the section is first set to .rodata to write down the three strings, and then it’s moved to .data to define the foo array. This is due to the way we declared foo:

static const char *foo[] = { "foo1", "foo2", "foo3" };Code language: JavaScript (javascript)

declares a static array of pointers to constant characters. So the characters are constant, and thus go to .rodata, but the pointers are not, so they need to go in .data, as you may modify them (and thus trigger a COW – Copy On Write – of the section’s memory page). This is actually a classic mistake and can often generate dirty RSS pages due to the possible COW.

In the case of the array, we declared it as

static const char foo[][8] = { "foo1", "foo2", "foo3" };Code language: JavaScript (javascript)

which declares a static constant array of arrays. In this case, the constant applies to the array and its members down to the characters, so the whole object is written into .rodata, making it shareable, and avoiding a possible COW trigger. Quite nicer.

As not always you can use the array of arrays (because the strings might have too different sizes, and then padding them to a power of 2 can waste way too much memory), you can change the array of pointers declaration in this way:

static const char *const foo[] = { "foo1", "foo2", "foo3" };Code language: JavaScript (javascript)

which declares a static array of constant pointers to constant characters. As you can guess, now that even the pointers are constant, the result is the following code:

        .section        .rodata
.LC0:
        .string "foo1"
.LC1:
        .string "foo2"
.LC2:
        .string "foo3"
        .align 16
        .type   foo, @object
        .size   foo, 24
foo:
        .quad   .LC0
        .quad   .LC1
        .quad   .LC2Code language: CSS (css)

As you can see now everything is once again in .rodata section, no COW triggers around.

Now I hope this entry will be useful for those of you who care about these microscopic optimisations, and for those of you who are interested in what the compilers produce at the end of the day.

Flameeyes 2845 posts

Comments 12

Andrew says:

2007-12-19 at 20:55

Very nice write up, I was told that the [] and the * were the same when being taught C++ (guess teachers aren’t infallible, eh?). That line with “static const char *const foo[]” seems dead useful, I might start using it in my own projects.

Reply
moesasji says:

2007-12-19 at 23:46

Your piece reminds me again why the book “inside the c++ object model” by stanley B. Lipman always gives me a head-ache when I attempt to read it.However this post really leaves me wondering how important these kind of micro-optimizations are for the xine-project. I can imagine that it becomes relevant in a crucial loop or for embedded systems where memory is tight, but is this really that important for xine as a media-player??

Reply
Flameeyes says:

2007-12-19 at 23:57

It’s important for every application, as sometimes even just with these micro optimisations you can get rid of whole pages of dity RSS from your application.The idea behind this is to make sure that the compiler won’t create dirty RSS (copy-on-write pages) when you don’t really need it, just because you forgot a const.And there are many, many, many lookup tables, especially for a multimedia software.

Reply
Kevin Bowling says:

2007-12-20 at 11:39

People like Flameeyes and this kind of optimization are what keep free software relevant and superior. I’m not a zealot, but it just isn’t something you see in commercial software. Less eyes and peer review, deadlines, and closed doors prevent it.Case in point:Thinking like this doesn’t go into Microsoft products. As a result each release balloons in system requirements and effectively counteracts Moore’s Law.The result? Millions of obsolete computers are tossed in the landfill while they are perfectly good. This further leads to the low quality of modern PCs. Compare an early IBM PC to a modern clone. The difference in build quality is staggering. Manufacturing technology has advanced greatly but reliability has not.E-waste is a real problem. Shitty programming (and languages that encourage) are part of it.

Reply
Mart Raudsepp says:

2007-12-24 at 07:32

If you have larger strings, or especially if you have strings with wildly differing length, then you can also use a long constant string that contains all the strings and a separate array that stores offsets into that array.For example:

static const char foo[] = "Foo" "Longer string" "Bar";static const int foo_index[3] = { 0, 4, 18 };and then you can just do#include <stdio.h> #define N_ELEMENTS(array) (sizeof((array)) / sizeof ((array)[0])) int main(void){ int i; for(i = 0; i < N_ELEMENTS(foo_index); ++i) { printf("%sn", foo + foo_index[i]); }}

There’s perl and other scripts to auto-generate such an array pair from a more readable form. Some low-level GNOME libraries have one, and Ulrich Dreppers DSO Howto does too (they differ, having different pros and cons).Thought it might also be useful to someone if dealing with largely differently sized strings :)Of course if they are similarly sized, then it’s better to use the method described here. Sometimes it even makes sense to split the array up into two parts – one for the small strings that could use this method, and one for the larger variable sized ones using this method or just a different size that fits them all without wasting much

Reply
Mart Raudsepp says:

2007-12-24 at 07:34

I mean the method described in the original post of course, when referring to similarly sized (length) strings

Reply
Mart Raudsepp says:

2007-12-24 at 07:48

duh, and last string in foo doesn’t need a in the end of course, so it would be

static const char foo[] = "Foo" "Longer string" "Bar";

Sorry for the comment spam 😛

Reply
Flameeyes says:

2007-12-24 at 07:56

Mart, the long string method does not really change much compared to the use of an array of constant pointers to constant character (const char *const), as you need to get first the offset from the offsets array, then sum it to the address of the other, then dereference.. you still have two dereferences to get to the actual data, and splitting up the strings might actually make it simpler for the compiler to split them in different sections if you have more than 4KB of .rodata.On the other hand, if you read the strings in sequence, that might actually be faster; this is probably why ELF format does use that format for .strtab and .dynstr (string table and dynamic string table); it also uses another trick too, if you have two strings, say “bar” and “foobar”, it saves on the string table just “foobar” and uses the offset of that plus three to just get “bar”.While such complexity is worth the hassle for something like the loader, I find it a bit too much for a standard program, myself.

Reply
Mart Raudsepp says:

2007-12-25 at 08:08

My described methods main point is to move the string data from .data.ro or .data to .rodata so it would be shared. So you are right, it’s not so much about performance than making sure the library can use up less number of 4KB non-shared pages

Reply
Mart Raudsepp says:

2007-12-25 at 08:11

as for the tricks that the loader does if you have “bar” and “foobar”, I think a good compiler would do that to at least string literals (that aren’t in an array) on its own. It’s also another reason why declaring things as “const” as possible is important, if they aren’t, it can’t do such memory optimizations for constant strings under the hood so well

Reply
Flameeyes says:

2007-12-25 at 08:19

You probably refer to .data.relro rather than .data.ro.But the form “const char *const” put the array and the strings to .rodata, not .data.relro, so I fail to see what the gain is between the two…

Reply
mr.t says:

2008-01-08 at 08:53

i want you to know how much i love u

Reply

Popular tags

The Latest
View All

Identity Crisis In The Age of AI

Bloke On A Trike

This Blog, Brought To You Through AI! (Well, Kinda)

Did I Finally Solve My Audiobook Woes? Well, Maybe.

Array of pointers and array of arrays

Comments 12

Leave a ReplyCancel reply

Array of pointers and array of arrays

Share this:

Comments 12

Leave a ReplyCancel reply

Related Posts