For A Parallel World. Home Exercise n.1: a drop-in dynamic replacement for memcpy()

Since I’ve written about OpenMP I’ve been trying to find time to test it on real usage scenarios; unfortunately between health being far from optimal the past few days with general aches, and work piling up, I haven’t been able to get to work on it at all. It would be much nicer if I could get a job that would allow me to spend time on these things, but I don’t want to rant about that since I’m happy to have jobs from time to time, as it is.

Yesterday I toyed a bit around with OpenMP and xine-lib, I wanted to try implementing a memcpy() replacement that could use OpenMP to work in parallel, it’s especially useful for multimedia applications. Beside some issues with autotools and OpenMP which I’m going to address in a future post, I ended up with a few more things in my mind (the usual problem with trying out new things: you want to achieve one result, and you get material for other three tests; now I know why Mythbusters starts with one idea and then end up doing four or five similar tests).

My parallel memcpy() replacement was just as lame as my byteswapping attempt from before, a single for parallelised with the proper pragma code. Just to make it not too lame, I used 64-bit copies (unaligned, but I would expect that not to matter on x86-64 at least, it was just a test). The reason why I didn’t go for a less lame method is that from a second test on byteswapping, which I didn’t have time to write about yet, I found that using more complex tricks does not really help. While splitting the memory area to swap in X equally-sized areas, with X being the maximum number of threads OpenMP is managing, identified dynamically, caused a slight improvement on the huge areas (half a gigabyte and a full gigabyte), it didn’t make any acceptable difference (considering the cost of the more complex code) on smaller blocks, and I really doubt that such huge memory areas would ever go swapped all at once. Splitting the area in page-sized (4KiB) blocks actually made the code slower, I guess, since I didn’t go deeper to check, that the problem there is that the threads are usually all executed on either the same core or on anyway on the same CPU, which means that the pages are all mapped on the memory directly connected to that CPU; splitting it up in pages might make it swap between the two different CPUs and thus make it slower. I’ll look more deeply into that when I have time.

Unfortunately, using this particular memcpy() implementation didn’t let me start xine properly, I probably made a mistake, maybe unaligned 64-bit copies on x86-64 don’t work, just like on any other architecture, but I didn’t go around trying to fix that for the very same reason why I’m writing this post.

It turns out that xine, just like MPlayer, and other multimedia application, have their own implementation of “fast memcpy()”, using SIMD instructions (MMX, MMXEXT, SSE, AltiVec, …). They benchmark at runtime which one has the best result (on my system it’s either the Linux kernel implementation, not sure which version, or the MMX version), and then they use that. This has some problems that are obvious, and some that are much less obvious. The first problem is that the various implementations do have to take care of some similar issues which cause code duplication (handling of small memory area copies, handling of unaligned copies and so on). The second is much more subtle and it’s what I think is a main issue to be solved.

When a programmer in a C program uses functions like memcpy(), strlen() and others, the compilation process (with optimisations) will hit some particular code called “builtins”. Basically the compiler knows how to deal with it, and emits different machine code depending on the way the function is called. This usually happens when the parameters to the call are known at build time, because they are either constant or can be derived (for static functions) from the way the function is called. How this affects mathematical functions and functions like strlen() can be better understood reading an old article of mine; for what concerns memcpy(), I’ll try to be brief but explain it here.

Let’s take a very easy function that copies an opaque type that is, in all truth, a 64-bit data field:

#include 

void copy_opaque(void *to, void *from) {
  memcpy(to, from, 8);
}

Building this code on x86-64 with GCC 4.3 and no optimisation enabled will produce this code:

copy_opaque:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $16, %rsp
        movq    %rdi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movq    -16(%rbp), %rsi
        movq    -8(%rbp), %rdi
        movl    $8, %edx
        call    memcpy
        leave
        ret

As you can see there is a call to memcpy() after setting up the parameters, just like one would expect. But turn on the optimisation with -O2 and the resulting code is quite different:

copy_opaque:
        movq    (%rsi), %rax
        movq    %rax, (%rdi)
        ret

The function has been reduced to two instructions, plus the return, with no stack usage. This because the compiler knows that for 64-bit copies, it can just emit straight memory access and simplify the code quite a bit. The memcpy() function is not a static inline, but the compiler knows its interface and can produce optimised code just fine when using builtin. Similarly, when using -O2 -fno-builtin to ask the compiler not to use builtins knowledge, for instance because you’re using special access functions, you can see that the resulting code is still composed of two instructions, but of a different type:

copy_opaque:
        movl    $8, %edx
        jmp     memcpy

Let’s go back to the builtin though, since that’s what it’s important to know before I can explain why the dynamically-chosen implementation in xine and company is quite suboptimal.

When you change the size of the memory area to copy in copy_opaque() from 8 to a different constant, you can see that the code changes accordingly. If you use a number that is not a multiple of 8 (that is the biggest size that x86-64 can deal with without SIMD), you can see that the “tail” of the area is copied using smaller move operations, but it’s still expanded. If you compare the output with multiple power-of-two values, you can see that up to 128 it inlines multiple movq instructions, while starting with 256, it uses rep movsq. With very big values, like (1 << 20), the compiler emits a straight memcpy() call. This is because the compiler can assess the overhead of the call and decide when it’s big enough to use a function rather than inlining code.

It can also decide this based on what type of optimisation is requested, for instance I said above that rep movsq starts to get used after the value 256 (1 << 8), but that was intended with the -O2 level; with -Os, it’s already when you have more than two 64-bit words.

Since the library functions like memcpy() and similar are very widely used, the fact that the compiler can emit much simpler code is very useful. But this works just as long as the compiler knows about them. As I said, turning off the builtin replacement will cause the code to be compiled “literally” with a call to the function, which might have a much higher overhead than a straight copy. Now it might be quite easier to grasp what the problem is with the dynamic memcpy() replacement used by xine and other software.

Let’s change the code above to something like this:

#include 

extern void *fast_memcpy(void *to, void *from, size_t foo);

void copy_opaque(void *to, void *from, size_t foo) {
  fast_memcpy(to, from, 8);
}

Now, even turning on the optimisations, won’t make any difference, the compiler will always emit a call to memcpy():

copy_opaque:
        movl    $8, %edx
        jmp     fast_memcpy

As you might guess, this is certainly slower than the straight copy that we had before, even if the memcpy() replacement is blazing fast. The jump will also require a symbol resolution since fast_memcpy() is not statically defined, so it’ll have to pass through the PLT (Procedure Linking Table) which is an expensive operation. Even if the symbol were defined internally to the same library, this would still most likely cause a call to the GOT (Global Offset Table) for shared objects.

By redefining the memcpy() function, xine and others are actually slowing the code down, at least when the size of the copy is known, a constant at build time. GCC extensions actually allow to define a macro, or even better a static inline function, that can discern whether a compile-time constant is used, and then fall back to the original memcpy() call, which the compiler will mangle as it prefers, but this is quite complex, and in my opinion not worth the bother.

Why do I say this? Well the first issue is that sometimes even if a value is not properly a constant at build-time, the compiler can find some particular code path where the function can be replaced, and thus emit adaptive code. The second is that you might just as well always use properly optimised memcpy() functions when needed, and if the C library does not provide anything as fast, you just need to use the Force ELF.

When the C library does not provide functions optimised for your architecture, for compatibility or any other reason, you can try to replace them through the ELF feature called symbol interposing, which basically work in the same as symbol collisions (I have some “slides” I’ve been working on for a while on the subject, but I’ll talk more extensively about this in a future post), and allows to intercept or replace calls to C library functions. It’s the same method used to implement the sandbox used by Portage, or the OSS wrappers for ALSA, PulseAudio, ESounD and so on.

What I’d like to see is a configurable library that would allow to choose between different memcpy() implementations, maybe on a per-size basis too, parallel and non-parallel, at runtime, through a file in /etc. This is quite possible, and similar features, with replacement for many common library functions, are available with freevec (which unfortunately only implements AltiVec on 32-bit PPC). But a more arch-neutral way to handle this would most likely help out.

Anyway, if somebody is up to take the challenge, I’d be glad to test, and to add to the tinderbox to test on the packages’ testsuites too. Let’s see what happens.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s