While doing microoptimisations on xine, I started considering one particular optimisation that might actually make some difference on the grand plan of things, is to change some array of pointers to array of arrays.
Talking on #gentoo-it, I realised that it’s not clear to everybody that there is a substantial difference between these two forms:
The first is an array of pointers containing three pointers to the three literals, that will be anonymous entries in the .rodata section of a file.
The second is an array of arrays of characters, a single object containing four literals.
It’s easier to understand this when you actually see the assembler representation of the two above.
The first, in AMD64 assembler is compiled this way:
.section .rodata .LC0: .string "foo1" .LC1: .string "foo2" .LC2: .string "foo3" .data .align 16 .type foo, @object .size foo, 24 foo: .quad .LC0 .quad .LC1 .quad .LC2Code language: CSS (css)
While the second is compiled this way:
.section .rodata .align 16 .type foo, @object .size foo, 24 foo: .string "foo1" .zero 3 .string "foo2" .zero 3 .string "foo3" .zero 3 .text .type print_foo, @functionCode language: CSS (css)
As you can see while the size (in this case) is the same (if I avoided the padding to
 the strings, setting them to
 which is the minimum required, the second would be smaller. As we’ll see, the first is actually using more memory, but that’s up for later.
In the first case, three strings are defined, they get an automatic label (LC0, 1 and 2) assigned, and then at the
foo object, the addresses of the three strings are listed as quad (quad-words, 64-bit sized words, as are addresses on 64-bit architectures).
I’m not going to show the assembler code to access a member of the two objects, but I’ll try to describe it in words, as that’s also quite interesting:
foo;Code language: CSS (css)
Accessing the second member of
foo returns a string (a pointer to characters, or an array of characters). But the way this is achieved with the two objects is different. In both cases there are actions that are actually calculated at buildtime, and others that need to be done at runtime. I won’t try to separate them, as they will probably change depending on optimisations and on the architecture used.
In the first object, the offset from the address of foo is calculated, this is 2 (the index to read) multiplied per the size of a pointer (8 bytes in 64-bit achitectures). Then the 8-bytes forming the pointer that is returned are read at the address (
In the second object, the offset from the address of foo is calculated, once again, but this time it’s 2 (the index to read) multiplied per the size of the sub arrays (again in this case, 8-bytes). Then the pointer that is returned is (
It might seem the same, but if you look closely, you can see that in the first case the pointer is read at (
foo+offset), while in the latter the pointer is (
foo+offset). This actually saves you from an operation.
As I said, the size of
foo in both cases is the same, but the first case takes up more memory. This because in the first case
foo contains the pointers to the strings, not the strings themselves, so you have to sum the size of the array itself with the size of the strings to actually have the occupied space. The result is then (8 (size of a pointer) * 3) + (5 (size of a string of 4 characters) * 3), which turns out to be 39; for the second case instead we just have 8 (size of an array of characters) * 3, which is 24.
Now let me explain why in the example I’m using
 as size rather than using
 which is the actual space the three strings in the example require. As you’ve read, the offset is calculated by multiplying the size of the string by the index we’re accessing. If the size of the array is 5, it would be 5*i, if it’s 8, it would be 8*i. Multiplication is an operation that even on modern CPUs take a bit of time, but even on the oldest 8086, 8*i can be simplified.
The value 8 is a power of two, it’s exactly 2 to the third power. As computers use a binary representation of data, multiplying for a power of two is just a matter of adding a zero to the right of a number (like we do in usual math when multiplying by ten). In programming that operation is called a (binary) left shift, and it’s usually implemented through a very fast CPU instruction.
This means that the operation
8*i can be replaced by
i<<3 (left shift by three bits), which is faster than the
5*i in almost every possible CPU. So by padding the elements to a power of two boundary, access is made faster. And as we see in this case, we’re still using less memory than we would have with the “usual” method.
Now, there is another thing that you might have noticed in the two assembler listings: in the first one, the section is first set to
.rodata to write down the three strings, and then it’s moved to
.data to define the
foo array. This is due to the way we declared
declares a static array of pointers to constant characters. So the characters are constant, and thus go to
.rodata, but the pointers are not, so they need to go in .data, as you may modify them (and thus trigger a COW – Copy On Write – of the section’s memory page). This is actually a classic mistake and can often generate dirty RSS pages due to the possible COW.
In the case of the array, we declared it as
which declares a static constant array of arrays. In this case, the constant applies to the array and its members down to the characters, so the whole object is written into
.rodata, making it shareable, and avoiding a possible COW trigger. Quite nicer.
As not always you can use the array of arrays (because the strings might have too different sizes, and then padding them to a power of 2 can waste way too much memory), you can change the array of pointers declaration in this way:
which declares a static array of constant pointers to constant characters. As you can guess, now that even the pointers are constant, the result is the following code:
.section .rodata .LC0: .string "foo1" .LC1: .string "foo2" .LC2: .string "foo3" .align 16 .type foo, @object .size foo, 24 foo: .quad .LC0 .quad .LC1 .quad .LC2Code language: CSS (css)
As you can see now everything is once again in
.rodata section, no COW triggers around.
Now I hope this entry will be useful for those of you who care about these microscopic optimisations, and for those of you who are interested in what the compilers produce at the end of the day.