Postmortem of a patch, or how do you find what changed?

Two days ago, Luca asked me to help him figure out what’s going on with a patch for libav which he knew to be the right thing, but was acting up in a fashion he didn’t understand: on his computer, it increased the size of the final shared object by 80KiB — while this number is certainly not outlandish for a library such as libavcodec, it does seem odd at a first glance that a patch removing source code is increasing the final size of the executable code.

My first wild guess which (spoiler alert) turned out to be right, was that removing branches out of the functions let GCC optimize the function further and decide to inline it. But how to actually be sure? It’s time to get the right tools for the job: dev-ruby/ruby-elf, dev-util/dwarves and sys-devel/binutils enter the battlefield.

We’ve built libav with and without the patch on my server, and then rbelf-size told us more or less the same story:

% rbelf-size --diff libav-{pre,post}/avconv
        exec         data       rodata        relro          bss     overhead    allocated   filename
     6286266       170112      2093445       138872      5741920       105740     14536355   libav-pre/avconv
      +19456           +0         -592           +0           +0           +0       +18864 

Yes there’s a bug in the command, I noticed. So there is a total increase of around 20KiB, where is it split? Given this is a build that includes debug info, it’s easy to find it through codiff:

% codiff -f libav-{pre,post}/avconv
[snip]

libavcodec/dsputil.c:
  avg_no_rnd_pixels8_9_c    | -163
  avg_no_rnd_pixels8_10_c   | -163
  avg_no_rnd_pixels8_8_c    | -158
  avg_h264_qpel16_mc03_10_c | +4338
  avg_h264_qpel16_mc01_10_c | +4336
  avg_h264_qpel16_mc11_10_c | +4330
  avg_h264_qpel16_mc31_10_c | +4330
  ff_dsputil_init           | +4390
 8 functions changed, 21724 bytes added, 484 bytes removed, diff: +21240

[snip]

If you wonder why it’s adding more code than we expected, it’s because there are other places where functions have been deleted by the patch, causing some reductions in other places. Now we know that the three functions that the patch deleted did remove some code, but five other functions added 4KiB each. It’s time to find out why.

A common way to do this is to generate the assembly file (which GCC usually does not represent explicitly) to compare the two — due to the size of the dsputil translation unit, this turned out to be completely pointless — just the changes in the jump labels cause the whole file to be rewritten. So we rely instead on objdump, which allows us to get a full disassembly of the executable section of the object file:

% objdump -d libav-pre/libavcodec/dsputil.o > dsputil-pre.s
% objdump -d libav-post/libavcodec/dsputil.o > dsputil-post.s
% diff -u dsputil-{pre,post}.s | diffstat
 unknown |245013 ++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 125163 insertions(+), 119850 deletions(-)

As you can see, trying a diff between these two files is going to be pointless, first of all because of the size of the disassembled files, and secondarily because each instruction has its address-offset prefixed, which means that every single line will be different. So what to do? Well, first of all it’s useful to just isolate one of the functions so that we reduce the scope of the changes to check — I found out that there is a nice way to do so, and it involves relying on the way the function is declared in the file:

% fgrep -A3 avg_h264_qpel16_mc03_10_c dsputil-pre.s
00000000000430f0 :
   430f0:       41 54                   push   %r12
   430f2:       49 89 fc                mov    %rdi,%r12
   430f5:       55                      push   %rbp
--
[snip]

While it takes a while to come up with the correct syntax, it’s a simple sed command that can get you the data you need:

% sed -n -e '/ dsputil-func-pre.s
% sed -n -e '/ dsputil-func-post.s
% diff -u dsputil-func-{pre,post}.s | diffstat
 dsputil-func-post.s | 1430 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 1376 insertions(+), 54 deletions(-)

Okay that’s much better — but it’s still a lot of code to sift through, can’t we reduce it further? Well, actually… yes. My original guess was that some function was inlined; so let’s check for that. If a function is not inlined, it has to be called, the instruction for which, in this context, is callq. So let’s check if there are changes in the calls that happen:

% diff -u =(fgrep callq dsputil-func-pre.s) =(fgrep callq dsputil-func-post.s)
--- /tmp/zsh-flamehIkyD2        2013-01-24 05:53:33.880785706 -0800
+++ /tmp/zsh-flamebZp6ts        2013-01-24 05:53:33.883785509 -0800
@@ -1,7 +1,6 @@
-       e8 fc 71 fc ff          callq  a390 
-       e8 e5 71 fc ff          callq  a390 
-       e8 c6 71 fc ff          callq  a390 
-       e8 a7 71 fc ff          callq  a390 
-       e8 cd 40 fc ff          callq  72e0 
-       e8 a3 40 fc ff          callq  72e0 
-       e8 00 00 00 00          callq  43261 
+       e8 00 00 00 00          callq  8e670 
+       e8 71 bc f7 ff          callq  a390 
+       e8 52 bc f7 ff          callq  a390 
+       e8 33 bc f7 ff          callq  a390 
+       e8 14 bc f7 ff          callq  a390 
+       e8 00 00 00 00          callq  8f8d3 

Yes, I do use zsh — on the other hand, now that I look at the code above I note that there’s a bug: it does not respect $TMPDIR as it should have used /tmp/.private/flame as base path, dang!

So the quick check shows that avg_pixels8_l2_10 is no longer called — but does that account for the whole size? Let’s see if it changed:

% nm -S libav-{pre,post}/libavcodec/dsputil.o | fgrep avg_pixels8_l2_10
00000000000072e0 0000000000000112 t avg_pixels8_l2_10
00000000000072e0 0000000000000112 t avg_pixels8_l2_10

The size is the same and it’s 274 bytes. The increase is 4330 bytes, which is around 15 times more than the size of the single function — what does that mean then? Well, a quick look around shows this piece of code:

        41 b9 20 00 00 00       mov    $0x20,%r9d
        41 b8 20 00 00 00       mov    $0x20,%r8d
        89 d9                   mov    %ebx,%ecx
        4c 89 e7                mov    %r12,%rdi
        c7 04 24 10 00 00 00    movl   $0x10,(%rsp)
        e8 cd 40 fc ff          callq  72e0 
        48 8d b4 24 80 00 00    lea    0x80(%rsp),%rsi
        00 
        49 8d 7c 24 10          lea    0x10(%r12),%rdi
        41 b9 20 00 00 00       mov    $0x20,%r9d
        41 b8 20 00 00 00       mov    $0x20,%r8d
        89 d9                   mov    %ebx,%ecx
        48 89 ea                mov    %rbp,%rdx
        c7 04 24 10 00 00 00    movl   $0x10,(%rsp)
        e8 a3 40 fc ff          callq  72e0 
        48 8b 84 24 b8 04 00    mov    0x4b8(%rsp),%rax
        00 
        64 48 33 04 25 28 00    xor    %fs:0x28,%rax
        00 00 
        75 0c                   jne    4325c 

This is just a fragment but you can see that there are two calls to the function, followed by a pair of xor and jne instructions — which is the basic harness of a loop. Which means the function gets called multiple times. Knowing that this function is involved in 10-bit processing, it becomes likely that the function gets called twice per bit, or something along those lines — remove the call overhead (as the function is inlined) and you can see how twenty copies of that small function per caller account for the 4KiB.

So my guess was right, but incomplete: GCC not only inlined the function, but it also unrolled the loop, probably doing constant propagation in the process.

Is this it? Almost — the next step was to get some benchmark data when using the code, which was mostly Luca’s work (and I have next to no info on how he did that, to be entirely honest); the results on my server has been inconclusive, as the 2% loss that he originally registered was gone in further testing and would, anyway, be vastly within margin of error of a non-dedicated system — no we weren’t using full-blown profiling tools for that.

While we don’t have any sound numbers about it, what we’re worried about is for cache-starved architectures, such as Intel Atom, where the unrolling and inlining can easily cause performance loss, rather than gain — which is why all us developers facepalm in front of people using -funroll-all-loops and similar. I guess we’ll have to find an Atom system to do this kind of runs on…

Exit mobile version