Cleaning up after yourself

I have noted in my post about debug information that in feng I’m using a debug codepath to help me reduce false positives in valgrind. When I wrote that I looked up an older post, that promised to explain but never explained. An year afterwards, I guess it’s time for me to explain, and later possibly document this on the lscube documentation that I’ve been trying to maintain to document the whole architecture.

The problem: valgrind is an important piece of equipment in the toolbox of a software developer; but as any other tool, it’s also a tool; in the sense that it’ll blindly report the facts, without caring about the intentions the person writing the code had at the time. Forgetting this leads to situation like Debian SA 1571 (the OpenSSL debacle), where an “unitialised value” warning was intended like the wrong thing, where it was actually pretty much intended. At any rate, the problem here is slightly different of course.

One of the most important reasons to be using valgrind is to find memory leak: memory areas that are allocated but never freed properly. This kind of errors can make software either unreliable or unusable in production, thus testing for memory leak for most seasoned developers is among the primary concerns. Unfortunately, as I said, valgrind doesn’t understand the intentions of the developers, and in this context, it cannot discern between memory that leaks (or rather, that is “still reachable” when the program terminates) and memory that is being used until the program stops. Indeed, since the kernel will free all the memory allocated to the process when it ends, it’s a common doing to simply leave it to the kernel to deallocate those structures that are important until the end of the program, such as configuration structures.

But the problem with leaving these structures around is that you either have to ignore the “still reachable” values (which might actually show some real leaks), or you receive a number of false positive introduced by this practice. To remove the false positive, is not too uncommon to free the remaining memory areas before exiting, something like this:

extern myconf *conf;

int main() {
  conf = loadmyconf();

The problem with having code written this way is that even just the calls to free up the resources will cause some overhead, and especially for small fire-and-forget programs, those simple calls can become a nuisance. Depending on the kind of data structures to free, they can actually take quite a bit of time to orderly unwind it. A common alternative solution is to guard the resource-free calls with a debug conditional, of the kind I have written in the other post. Such a solution usually ends up being #ifndef NDEBUG, so that the same macro can get rid of the assertions and the resource-free calls.

This works out quite decently when you have a simple, top-down straight software, but it doesn’t work so well when you have a complex (or chaotic as you prefer) architecture like feng does. In feng, we have a number of structures that are only used by a range of functions, which are themselves constrained within a translation unit. They are, naturally, variables that you’d consider static to the unit (or even static to the function, from time to time, but that’s just a matter of visibility to the compiler, function or unit static does not change a thing). Unfortunately, to make them static to the unit you need an externally-visible function to properly free them up. While that is not excessively bad, it’s still going to require quite a bit of work to jump between the units, just to get some cleaner debug information.

My solution in feng is something I find much cleaner, even though I know some people might well disagree with me. To perform the orderly cleanup of the remaining data structures, rather than having uninit or finalize functions called at the end of main() (which will then require me to properly handle errors in sub-procedures so that they would end up calling the finalisation from main()!), I rely on the presence of the destructor attribute in the compiler. Actually, I check if the compiler supports this not-too-uncommon feature with autoconf, and if it does, and the user required a debug build, I enable the “cleanup destructors”.

Cleanup destructors are simple unit-static functions that are declared with the destructor attribute; the compiler will set them up to be called as part of the _fini code, when the process is cleared up, and that includes both orderly return from main() and exit() or abort(), which is just what I was looking for. Since the function is already within the translation unit, the variables don’t even need to be exported (and that helps the compiler, especially for the case when they are only used within a single function, or at least I sure hope so).

In one case the #ifdef conditional actually switches a variable from being stack-based to be static on the unit (which changes quite a bit the final source code of the project), since the reference to the head of the array for the listening socket is only needed when iterating through them to set them up, or when freeing them; if we don’t free them (non-debug build) we don’t even need to save it.

Anyway, where is the code? Here it is:

dnl for


           # define CLEANUP_DESTRUCTOR __attribute__((__destructor__))

(the CC_ATTRIBUTE_DESTRUCTOR macro is part of my personal series of additional macros to check compiler features, including attributes and flags).

And one example of code:

static void CLEANUP_DESTRUCTOR accesslog_uninit()
    size_t i;

    if ( feng_srv.config_storage )
        for(i = 0; i < feng_srv.config_context->used; i++)
            if ( !feng_srv.config_storage[i].access_log_syslog &&
                 feng_srv.config_storage[i].access_log_fp != NULL )

You can find the rest of the code over to the LScube GitHub repository — have fun!

Cleaning up

Following the debug for yesterday’s problem I decided to spend some time to analyse the memory behaviour of feng with valgrind, and I noticed a few interesting things. The main one is that there are quite a few places where memory is allocated but is never ever freed, because it’s used from a given point till the end of the execution.

Since memory gets deallocated when the process exits; adding explicit code to free that type of resources is not necessary; on the other hand, it’s often suggested as one of the best practises in development since leaks are never a good thing, code may be reused in different situations where freeing the resource is actually needed, and most importantly: not freeing resources will sidetrack software like valgrind that is supposed to tell you where the leaks are, which either forces you to reduce the level of detail (like hiding all the still-reachable memory) or to cope with false positives.

With valgrind, the straightforward solution is to use a suppression file to hide what is known to be false positives; on the other hand, this solution is not extremely practical: it is valgrind specific, yet valgrind is not the only software doing this; it does not tell you if you were, for any reason, fiddling with pointers during the course of the program, and it separates the knowledge about the program from the program itself, which is almost never a good thing.

Now of course one could just always free all the resources; the overhead of doing that is likely minimal when it comes to complex software, because a few calls to the free() functions and friends are probably just a small percentage of the whole code of the software; but still if they are unnecessary there is no real reason to have them compiled in the final code. My approach is perhaps a bit more hacky, but I think it’s a nice enough approach: final memory cleanup is done under preprocessor conditional, checking for the NDEBUG preprocessor macro, the same macro that is used to disable assertions.

In a few places, though, I’ve been using the GOnce constructor (more or less equivalent to pthread_once) that allows me to execute initialisation when actually needed instead of having initialisation and cleanup functions to call from the main function. While this works pretty well, it makes it tricky to know when to actually free resources. Luckily most modern compilers support the destructor attribute, which basically is a way to say that a given function, that can very well be static, should be called before closing down. Relying on these destructors for actual program logic is not a very good idea because they are not really well supported, but for leak testing and debugging they can be quite useful instrumentation.

I’ll try to provide more detailed data once I make feng not report leaks any longer.

Fighting with the glib slice allocator

I’ve written the other day how I think glib is good for server software; Lennart did point out that there is one problem with the memory management because the way g_malloc() works, and that it shouldn’t thus be used for lower system level software like udev and similar because you cannot make sure you stay inside your resource limits.

Today I noticed one more problem that I’ve got to work on to get feng to be a nice piece of software, while using glib. This problem is also related to memory management, but this time it’s due to the slice allocator, which is a slab-like allocator for the userspace that is used both by all the newer glib structures allocation, and by most of feng’s structure when we’re dealing with fixed size objects (feng also has lots of non-fixed-size objects, but those cannot be easily allocated with slices).

It actually worked quite well when implemented for most of the objects; where this turned out to be a problem was in my producer/consumer code (now fine tuned and pretty much working fine). It was actually quite curious because to reproduce the problem I had to use one particular file: the 1080p AVI version of Big Buck Bunny. Using the same 1080p version in H.264 and QuickTime format, the problem does not happen at all.

So what the problem is? When playing back one or more times that file with one or multiple clients, the amount of memory used by the feng process starts to allocate more and more and more memory; Luca stopped when it reached 1GiB of resident memory, I stopped at 2 GiB. Now, the reason why this happens is because my producer/consumer code (which is where the problem starts to appear) is not limited in the number of packets that are read (even though it does not read in memory the whole file at once, it’ll wait to be asked to read more data, but once requests start to arrive, it can be a problem), and thus it will create a very high number of objects in a very short time, and since the slab allocator would run out of memory quite quickly, it’ll start allocating new memory chunks to store the objects in. _As an added note, the objects I’m talking about in this article are just the internal representation of the actual objects produced and consumed, of the same type as glib’s own GList objects. They are not the actual data that gets read (which is variable-sized) they are just small objects that keep a pointer to the actual data.

Now, the problem is that these chunks are, for what I can see, never freed until, at least, the thread that allocated them is killed; this would be okay if we were handling everything in separated threads, but there are a number of variables in game here: the allocation and deallocation happen on different threads (the former on the producer object, which is executed by the resource-reading thread, and the latter usually on the consumer, unless you abort the playback, which is either the main thread now or the worker thread when we’ll switch); resource reading happens either on a separated thread (called “mediathread”), or in the newer code in one of a series of threads in a pool where the read event is done asynchronously; neither of these two threads will be deleted by the time the connection is closed. There is also a concern about creating multiple threads per client since the server is supposed to be able to handle hundreds of clients just fine.

Just to make sure the problem lies with the slice allocator code, I’ve ran valgrind disabling the slice allocator by setting G_SLICE=always-malloc. This is actually quite interesting since it has shown me that there are other problems in feng that were hidden earlier by the slice code. It also has shown that there is really no leak, at least in my producer/consumer code. So the problem is not that objects don’t get deallocated, but rather that they are deallocated much more slowly than they are allocated, which, for slice-allocated objects, mean that the chunks get expanded. Unfortunately the idea of just making the producer/consumer use normal heap-allocated objects is not appealing. With the speed they are allocated, the allocate/free overhead is realistically bad.

I’m now considering a few possible solutions to these problem, but seriously it looks like my best choice is to limit the rate of buffer allocations. But before I do that, I wish to flesh out, for me, and whoever is reading or is going to have this problem, what is causing the issue and why on earth does it only happens with the AVI file and not with the QuickTime format file. The reason is actually a mix of different factors, both on account of the file format itself and on the way feng works; I can actually do my best to reduce the issues related to the latter but I can’t do much about the former.

Let’s start with the file; the AVI name stands for Audio and Video Interleaved which is what the file actually is; the video and audio data is interleaved in chunks; and by design it does not really try to equilibrate the audio and video parts for streaming as much for decoding on old computers with very low horsepower. For this reason to get enough data to stream for all the tracks, the demuxer has to read many more chunks that it’s likely doing for more streaming-oriented formats like QuickTime.

This ties down with what feng is doing; every time an RTP packet is sent to the client, a request is triggered for the demuxer to read a packet, both for audio and video tracks; unfortunately, this usually leads to many more chunks of data to be read together, since a single read of the demuxer is producing more data than can be sent through a single RTP packet. It’s almost literally eating more than it can swallow.

I’m already planning out in my mind a few possible ways to avoid this situation, but I also have to think that this wasn’t happening before with the old bufferpool library; the reason for that is that, quite simply, it didn’t have a seemingly infinite memory area to play with, unlike the slab allocator. The concept of bufferpool’s slots allowed for resources to be limited by force, and when more data was produced than it could be consumed it either stopped waiting for a slot to free or started dropping buffers away. While it worked I still don’t like those choices.

The one way I think I’m preferring now among the ones I thought about is to add a maximum length to the queue, at least of unseen packets. While simply having a maximum amount of elements in the queue could work, as well as having a circular array like bufferpool used to have, I’m not too keen on that idea because it also means that a single slow client in a multicast stream could stop all the others, or at least slow them down as well; limiting the tide of the queue, the added buffers that nobody has seen yet, seems to me like having a better potential.

There is also the problem of the read requests happening quite too soon; since we know that there is no symmetry between the number of RTP packets sent and the number of times the resource is going to be read, we cannot just request the demuxer to read data as soon as we send an RTP packet. What we have to do is to keep a count on how many buffers haven’t been sent yet to the client and if we see that going too low, we just request a new read.

Funnily enough, both these problems seems like have been already thought about by the original developers of feng, and you can see indication on the in the way both bufferpool and feng already had names calling back to these ideas. Indeed the original read request was made through an event_buffer_low() function; unfortunately both its use and its implementation, by the time I started working on it, diverged from the original idea.

Okay I know this post has been almost surely uninteresting to all the people reading me that are not interested in lscube development; on the other hand, I needed to write this down to clear my mind anyway, and someone might actually find this interesting, after all.

A few statistics out of cowstats on media-libs libraries

Today is a short day to work, my connection is unstable and unusable, and tonight I’m out with a couple of friends of mine.

So I simply prepared a (complex) oneliner to gather some statistics about media libraries I have installed in my system, out of the static archives (so it’s not the worst case).

qlist -I -v media-libs -C | 
  while read pkg; do 
     pkgname=${pkg##*/}; mkdir -p ${pkgname}; pushd ${pkgname}; 
     qlist $pkg | egrep '.a$' | 
        while read lib; do 
            libname=$(basename $lib); mkdir -p $libname; pushd $libname; 
             ar x ${lib}; 
             ruby -I ~/devel/repos/ruby-elf  ~/devel/repos/ruby-elf/cowstats.rb --statistics --total *.o > ~/mytmpfs/libstats/${pkgname}:${libname}; 

The result of this, grepped around a bit gave me these statistics:

a52dec-0.7.4-r5:liba52.a:  Total 4593 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:libasound.a:  Total 13164 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:smixer-ac97.a:  Total 192 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:smixer-hda.a:  Total 192 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:smixer-python.a:  Total 2144 bytes of variables in copy-on-write sections
alsa-lib-1.0.15:smixer-sbase.a:  Total 104 bytes of variables in copy-on-write sections
alsa-oss-1.0.15:libalsatoss.a:  Total 28 bytes of variables in copy-on-write sections
alsa-oss-1.0.15:libaoss.a:  Total 128 bytes of variables in copy-on-write sections
alsa-oss-1.0.15:libossredir.a:  Total 112 bytes of variables in copy-on-write sections
audiofile-0.2.6-r3:libaudiofile.a:  Total 6408 bytes of variables in copy-on-write sections
faac-1.26-r1:libfaac.a:  Total 9612 bytes of variables in copy-on-write sections
faad2-2.6.1:libfaad.a:  Total 8138 bytes of variables in copy-on-write sections
flac-1.2.1-r2:libFLAC.a:  Total 1044 bytes of variables in copy-on-write sections
fontconfig-2.5.0-r1:libfontconfig.a:  Total 2196 bytes of variables in copy-on-write sections
gd-2.0.35:libgd.a:  Total 144508 bytes of variables in copy-on-write sections
giflib-4.1.6:libgif.a:  Total 1043 bytes of variables in copy-on-write sections
ilmbase-1.0.1:libHalf.a:  Total 1 bytes of variables in copy-on-write sections
ilmbase-1.0.1:libIex.a:  Total 8 bytes of variables in copy-on-write sections
ilmbase-1.0.1:libIlmThread.a:  Total 16 bytes of variables in copy-on-write sections
ilmbase-1.0.1:libImath.a:  Total 311 bytes of variables in copy-on-write sections
imlib-1.9.15-r2:libImlib.a:  Total 28 bytes of variables in copy-on-write sections
imlib2-1.4.0:id3.a:  Total 8 bytes of variables in copy-on-write sections
imlib2-1.4.0:libImlib2.a:  Total 17468 bytes of variables in copy-on-write sections
imlib2-1.4.0:xpm.a:  Total 8 bytes of variables in copy-on-write sections
jasper-1.900.1-r1:libjasper.a:  Total 13703 bytes of variables in copy-on-write sections
lcms-1.17:liblcms.a:  Total 11002 bytes of variables in copy-on-write sections
libao-0.8.8:libalsa09.a:  Total 96 bytes of variables in copy-on-write sections
libao-0.8.8:libao.a:  Total 590 bytes of variables in copy-on-write sections
libao-0.8.8:liboss.a:  Total 72 bytes of variables in copy-on-write sections
libao-0.8.8:libpulse.a:  Total 80 bytes of variables in copy-on-write sections
libart_lgpl-2.3.19-r1:libart_lgpl_2.a:  Total 16 bytes of variables in copy-on-write sections
libcddb-1.3.0-r1:libcddb.a:  Total 2472 bytes of variables in copy-on-write sections
libdvdcss-1.2.9-r1:libdvdcss.a:  Total 2064 bytes of variables in copy-on-write sections
libdvdread-0.9.7:libdvdread.a:  Total 2108 bytes of variables in copy-on-write sections
libexif-0.6.16-r1:libexif.a:  Total 51096 bytes of variables in copy-on-write sections
libgpod-0.6.0:libgpod.a:  Total 20 bytes of variables in copy-on-write sections
libid3tag-0.15.1b:libid3tag.a:  Total 1 bytes of variables in copy-on-write sections
liblrdf-0.4.0:liblrdf.a:  Total 57376 bytes of variables in copy-on-write sections
libmikmod-3.1.11-r5:libmikmod.a:  Total 7798 bytes of variables in copy-on-write sections
libmp4v2-  Total 288 bytes of variables in copy-on-write sections
libsdl-1.2.13:libSDL.a:  Total 49795 bytes of variables in copy-on-write sections
libsndfile-1.0.17-r1:libsndfile.a:  Total 18651 bytes of variables in copy-on-write sections
libsvg-0.1.4:libsvg.a:  Total 552 bytes of variables in copy-on-write sections
libvorbis-1.2.0:libvorbis.a:  Total 4 bytes of variables in copy-on-write sections
netpbm-10.40.0:libnetpbm.a:  Total 10612 bytes of variables in copy-on-write sections
openexr-1.6.1:libIlmImf.a:  Total 1205 bytes of variables in copy-on-write sections
raptor-1.4.16:libraptor.a:  Total 3365 bytes of variables in copy-on-write sections
sdl-gfx-2.0.16:libSDL_gfx.a:  Total 6184 bytes of variables in copy-on-write sections
sdl-image-1.2.6:libSDL_image.a:  Total 67259 bytes of variables in copy-on-write sections
sdl-mixer-1.2.8:libSDL_mixer.a:  Total 495 bytes of variables in copy-on-write sections
sdl-net-1.2.7:libSDL_net.a:  Total 7 bytes of variables in copy-on-write sections
sdl-pango-0.1.2:libSDL_Pango.a:  Total 44 bytes of variables in copy-on-write sections
sdl-ttf-2.0.9:libSDL_ttf.a:  Total 19 bytes of variables in copy-on-write sections
smpeg-0.4.4-r9:libsmpeg.a:  Total 117478 bytes of variables in copy-on-write sections
t1lib-5.1.1:libt1.a:  Total 45191 bytes of variables in copy-on-write sections
tiff-3.8.2-r3:libtiff.a:  Total 7746 bytes of variables in copy-on-write sections
x264-svn-20070924:libx264.a:  Total 16832 bytes of variables in copy-on-write sections
xvid-1.1.3-r2:libxvidcore.a:  Total 209384 bytes of variables in copy-on-write sections

As you can see there’s quite some space for improvement, and remember that these statistics are done on non-PIC objects, the PIC objects will have probably more, within .data.rel sections.

I’ll be writing a few more patches around for this, trying to reduce the numbers as much as I can, even if sometimes it won’t cause much improvement on the actual shared library, either because of relocation of symbols, or just because there are one or two static variables that cause the 4KB for the cow section to be allocated.

My overlay now features patches to mark tables as constant for: libvorbis, libtheora, libpng, giflib, speex, flac, libdca and libmpcdec. I’ll add a few more either tonight when I come home or tomorrow, or anyway in the next weeks.

All the patches were sent to the respective upstream, so that they can be applied there and be available for everybody in the future.

Hopefully some of them might just be applied in Gentoo too, soon :)

Introducing cowstats

No it’s not a script to find statistics on Larry, it’s a tool to get statistics for copy-on-write pages.

I’ve been writing for quite a while about memory usage, RSS memory and other stuff like that on my blog so if you want to get some more in-deep information about it, please just look around. If I start linking here all the posts I’ve made on the topic (okay the last one is not a blog post ;) ) I would probably spend the best part of the night to dig them up (I only linked here the most recent ones on the topic).

Trying to summarise for those who didn’t read my blog for all this time, let’s start with saying that a lot of software, even free software, nowadays wastes memory. When I say waste, I mean it uses memory without a good reason to, I’m not saying that software that uses lots of memory to cache or precalculate stuff and thus be faster is wasting memory, that’s just using memory. I’m not even referring to memory leaks, which are usually just bugs in the code. I’m saying that a lot of software wastes memory when it could save memory without losing performances.

The memory I declare wasted is that memory that could be shared between processes, but it’s not. That’s a waste of memory because you end up using twice or more of the memory for the same goal, which is way sub-optimal. Ben Maurer (a GNOME contributor) wrote a nice script (which is in my overlay if you want; I should finish fixing a couple of things up in the ebuild and commit it to main tree already, the deps are already in main tree) that tells you, for a given process, how much memory is not shared between processes, the so-called “dirty RSS” (RSS stands for Resident Set Size, it’s the resident memory, so the memory that the process is actually using from your RAM).

Dirty RSS is caused by “copy-on-write” pages. What is a page, and what is a copy-on-write pages? Well, memory pages are the unit used to allocate memory to processes (and to threads, and kernel systems, but let’s not go too in deep there); when a process is given a page, it usually also get some permission on that, it might be readable, writable or executable. Trying not to get too in deep on this either (I could easily write a book on this, maybe I should, actually), the important thing is that read-only pages can easily be shared between processes, and can be mapped directly from a file on-disk. This means that two process can use both the same 4KB read-only page, using just 4KB of memory, while if the same content was present in a writable page, the two processes would have their own copy of it, and would require 8KB of memory. Maybe more importantly, if the page is mapped directly from a file on-disk, when the kernel need to make space for new allocated memory, it can just get rid of the page, and then re-load it from the original file, rather than writing it down on the swap file, and then load from that.

To make it easier to load the data from the files on disk, and reduce the memory usage, modern operating systems use copy-on-write. The pages are shared as long as they are not changed from the original; when a process tries to change the content of the page, it’s copied in a new empty, writable page, and the process gets exclusive access to the page, “eating” the memory. This is the reason why using PIC shared objects usually save memory, but that’s another story entirely.

So we should reduce the amount of copy-on-write pages, trying to favour read-only sharable pages. Great, but how? Well, the common way to do so is to make sure that you mark (in C) all the constants as constant, rather than defining them as variables even if you never change their value. Even better, mark them static and constant.

But it’s not so easy to check the whole codebase of a long-developed software to mark everything constant, so there’s the need to analyse the software post-facto and identify what should be worked on. To do so I used objdump (from binutils) up to now, it’s a nice tool to have raw information about ELF files, it’s not easy, but I grew used to it so I can easily grok its output.

Focusing on ELF files, which are the executable and library files in Linux, FreeBSD and Solaris (plus other Unixes), the copy-on-write pages are those belonging, mostly, to these sections: .data, .data.rel and .bss (actually, there are more sections, like .data.local and, but let’s just consider those prefixes for now).

.data section keeps the non-stack variables (which means anything declared as static but non-constant in C source) which were initialised in the source. This is probably the cause of most waste of memory: you define a static array in C, you don’t mark it constant properly (see this for string arrays), but you never touch it after definition.

.data.rel section keeps the non-stack variables that need to be relocated at runtime. For instance it might be a static structure with a string, or a pointer to another structure or an array. Often you can’t get rid of relocations, but they have a cost in term of CPU time used, and also a cost in memory usage, as the relocation will trigger for sure the copy-on-write… unless you use prelink, but as you’ll read on that link, it’s not always a complete solution. You usually can live with these, but if you can get rid of instances here, it’s a good thing.

.bss section keeps the uninitalised non-stack variables, for instance if you declare and define a static array, but don’t fill it at once, it will be added to the .bss section. That section is mapped on the zero page (a page entirely initialised to zero, as the name suggests), with a copy-on-write: as soon as you write to the variable, a new page is allocated, and thus memory is used. Usually, runtime-initialised tables falls into this section. It’s often possible to replace them (maybe optionally) with precalculated tables, saving memory at runtime.

My cowstats script analyse a series of object files (tomorrow I’ll work on an ar parser so that it can be ran on static libraries; unfortunately it’s not possible to run it on executables or shared libraries as they tend to hide the static symbols, which are the main cause of wasted memory), looks for the symbols present in those sections, and lists them to you, or in alternative it shows you some statistics (a simple table that tells you how many bytes are used in the three sections for the various object files it was called with). This way you can easily see what variables are causing copy-on-write pages to be requested, so that you can try to change them (or the code) to avoid wasting memory.

I wrote this script because Mike asked me if I had an automated way to identify which variables to work on, after a long series of patches (many of which I have to fix and re-submit) for FFmpeg to reduce the memory usage. It’s now available at as it’s simply a Ruby script using my ELF parser for ruby started last May. It’s nice to see that something I did some time ago for a completely different reason now comes useful again ;)

I mailed the results on my current partly-patched libavcodec, they are quite scary, it’s over 1MB of copy-on-write pages. I’ll continue working so that the numbers will come near to zero. Tomorrow I’ll also try to run the script on xine-lib’s objects, as well as xine-ui. It should be interesting.

Just as a test, I also tried running the script over libvorbis.a (extracting the files manually, as for now I have no way to access those archives through Ruby), and here are the results:

cowstats.rb: lookup.o: no .symtab section found
File name  | .data size | .bss size  | .data.rel.* size
psy.o             22848            0            0
window.o          32640            0            0
floor1.o              0            8            0
analysis.o            4            0            0
registry.o           48            0            0
    55540 bytes of writable variables.
    8 bytes of non-initialised variables.
    0 bytes of variables needing runtime relocation.
  Total 55548 bytes of variables in copy-on-write sections

(The warning tells me that the lookup.o file has no symbols defined at all; the reason for this is that the file is under one big #ifdef; the binutils tools might be improved to avoid packing those files at all, as they can’t be used for anything, bearing no symbol… although it might be that they still can carry .init sections, I admit my ignorance here).

Now, considering the focus of libvorbis (only Vorbis decoding), it’s scary to see that there are almost 55KB of memory in writable pages; especially since, looking down to it, I found that they are due to a few tables which are never modified but are not marked as constant.

The encoding library libvorbisenc is even worse:

File name   | .data size | .bss size  | .data.rel.* size
vorbisenc.o      1720896            0            0
    1720896 bytes of writable variables.
    0 bytes of non-initialised variables.
    0 bytes of variables needing runtime relocation.
  Total 1720896 bytes of variables in copy-on-write sections

Yes that’s about 1.7 MB of writable pages brought in by libvorbisenc per every process which uses it. And I’m unfortunately going to tell you that any xine frontend (Amarok included) might load libvorbisenc, as libavcodec has a vorbis encoder which uses libvorbisenc. Yes it’s not nice at all!

Tomorrow I’ll see to prepare a patch for libvorbis (at least) and see if Xiph will not ignore me at least this time. Once the script will be able to act on static libraries, I might just run it on all the ones I have on my system and identify the ones that really need to be worked on. This of course will have not to hinder my current jobs (I’m considering this in-deep look at memory usage part of my job as I’m probably going to need it in a course I have to teach next month), as I really need money, especially to get a newer box before end of the year, Enterprise is getting slow.

Mike, I hope you’re reading this blog, I tried to explain the thing I’ve been doing in the best way possible :)

Precalculating tables

I’m continuing with my tentative to reduce the dirty RSS (Resident memory) in FFmpeg’s libraries. After a few more patches for libpostproc, I’ve moved on libavutil and libavcodec a second, trying to reduce the memory they do use.

The basic problem with libavcodec is the need for huge tables to speed up decoding. These tables are used to look up particular bytes (or shorts) so that their conversion to another given format need not to be calculated manually.

There are two kinds of tables usually used: prefilled (or precalculated), which are constant and goes to .rodata esction of executable files, and runtime initialised tables. You usually have to choose one or the other, although for smaller tables, choosing precalculated is the win-win choice, as the code used to generate it might be bigger than the actual table (this is the case for two tables in I-forgot-which xine plugin, which were runtime initialised before I replaced them.

FFmpeg, since a few days ago, has an option to choose between runtime-initialised and hardcoded tables, although that’s just for CRC code. Myself, I choose hardcoded, and you’ll now see why.

Runtime-initialised and hardcoded tables have different pros and cons. So it’s nice, most of the times, which means for all but the smallest tables, to be able to choose between the two.

The first difference between the two is the section they are mapped on in the resulting ELF file: pre-initialised tables are saved in .rodata, while runtime-initialised are either saved in .bss (which means they are mapped to a zero’d page, and COW’d as soon as the initialisation code is ran), or allocated dynamically at initialisation.

This means that the runtime-initialised tables are private to the process, they are then dirty RSS or heap memory used up. On the other hand, hardcoded tables are mapped out of .rodata from the on-disk ELF file, and shared between processes.

The problem of hardcoded tables is that they need to be actually saved inside the ELF file, so if (this is a real example) you got two 16KiB tables, for which the initialisation code is about 500 bytes, with hardcoded tables you save the 500 bytes but add 32KiB to the size of the executable. As you can see, the executable will be smaller if you don’t use hardcoded tables.

The size of the executable does not really change any cache-related parameter, as the tables are the same size, and does not mean that the processes will use more memory. On the other hand, they’ll use less memory because they can share the pages. It might change the time needed to load the file from disk, but that’s a one-time cost versus a per-execution cost of the initialisation.

On modern systems with big hard drives, it’s more than likely you wouldn’t see any difference in load time, and you most likely will not have trouble with the increased size. You might want to see the memory usage reduced though, especially for processes that might use xine somehow (think KDE4).

I’ve been writing a few patches to hardcode tables in FFmpeg today, hopefully I’ll be able to get them integrated soon. In my overlay I have a live SVN ebuild for FFmpeg, and it has an USE flag for the hardcoded tables. Probably the size of the libraries will be increased a lot since I’ve seen quite a few tables in .bss that might be hardcoded.

My target at the moment is to reduce the 40KiB dirty RSS I see for in my Amarok’s process. Hopefully this will make FFmpeg’s shared library support more viable (considering that the runtime relocations due to PIC makes FFmpeg quite slower, not counting the fact that it’s missing one register compared to the non-PIC version.

KDE4 users would be happy for that :)

For what concerns my personal life, today I have a very bad throat ache, but I received the smoking pipe I ordered, which is actually a nice way to relieve stress while thinking… no I don’t smoke, I have no intention to, I just bit it and keep doing what I’m doing. With the complications I had during hospitalisation, I’ll never try to smoke at all.

Reminding a weakness of Prelink

For extra content about this entry, please refer to the previous one which talks about array of strings and PIC.

As I say in the other post, prelink can reduce the amount of dirty RSS pages due to COW of PIC code. As prelink assigns to every library a predefined load address in memory, which is either truly unique, or unique within the scope of the set of programs who are said to be able to load that library, there is no COW (Copy-on-Write) during load as the loader doesn’t have to change the address loaded from the file, and is thus able to share the same page with many processes. This is how prelink save (sometimes a lot of) memory.

Unfortunately there is one big problem especially with modern software architectures: many programs now use runtime-loaded plugins for functions; the whole KDE architecture is based on this, even for KDE 4, as well as xine and others and others.

The problem is that prelink can’t really take into account the plugins, as it doesn’t know about them. For instance it can’t understand that amarok is able to load the xine engine, thus amarokapp is not going to be prelinked for Additionally, it can’t understand that is able to load the, which in turn depends on This means that for instance when using the -m switch, it could be assigning and the same address.. causing a performance hit rather than improvement at runtime, when the linker will have to relocate all the code of, triggering a COW.

The same is true for almost all scripting languages which use C-compiled extensions: Perl, Ruby, Python, as you can’t tell that the interpreter can load them just by looking at the ELF file, which is what prelink does.

A possible way around this is to define post-facto, that is after compilation, which shared object the program can load. It could probably be done through a special .note section in the ELF file and a modified prelink, but I’m afraid it would be quite difficult to properly implement it especially in ebuilds. On the other hand, it might give quite some performance improvement; as I said today’s software architecture are often based on on-demand loading of code through plugins, so it could be quite interesting.

One thing for which KDEPIM sucks

I’m referring to KDEPIM from KDE 3, I certainly hope this kind of problems is fixed for KDE 4; if it isn’t, that would likely be a good reason why KDE should reform itself in a new way.

So, last night I started working on a simple script (that for now uses a mixture of Ruby and readelf calls) that loads all the symbols present in the libraries on the system, and then checks for duplicates, to see which symbols would collide if loaded in the same address space.

A bit of history on this. If you ever got into Michael Meeks’s patches for the infamous -Bdirect option, he talked about the incompatibility of that with the technique of interposing. Symbol interposing is a way to provide different implementations of the same interface on different libraries, allowing to shift between one and the other by changing the way they are linked or by using LD_PRELOAD. A common case of interposing are the threading libraries; the C library usually has weak symbols for the various pthread_* functions, beside from pthread_new that is the one used to actually create the threads, so that a library that can be used both with and without threading can still define its own mutexes and use them, making the calls no-ops when no threads are used.

To use interposing, the same symbol is present in more than one library; usually the library always linked in has a weak symbol, while the implementations have normal symbols. This when used consciously is called interposing, but when this happens (way more often) without a proper conscious use, this is instead symbol collision; a symbol collision is bad, because a library expecting to use a certain function in a certain way might be using it in a totally different way because of the collision.

Even when the symbol is just the same, for the same function, symbol collisions requires more work from the linker to identify them, hash values and prelink not always works, so they usually should be avoided. One way to avoid them, without renaming the functions, if they are used only internally, is to use hidden visibility. This solves the issue, but needs to be properly implemented to be good.

So, my tool checks for symbols’ collisions, by gathering all the symbols in all the libraries installed on my system in a table of an sqlite database, and then counting how many times the same symbol is present. Of course there are false positives, cases in which the presence of different symbols with the same name implies neither interposing nor collisions, and that’s the case of most plugin infrastructures: the plugins export one or more specific symbols that have a given name, the loader knows to look for them and binds them in a per-plugin data structure. In the case of xine, this is achieved through the xine_plugin_info structure present in every plugin. To avoid reporting those as collisions (they are not) I also added a suppressions file, where I can declare regular expressions of symbols (for some files) that needs not to be counted as collision.

The output of the script shown me this:

Symbol soap_bool2s(soap*, bool) present 3 times
Symbol soap_in_int(soap*, char const*, int*, char const*) present 3 times
Symbol soap_s2bool(soap*, char const*, bool*) present 3 times

and so on for a long time. This means that the same internal library is being linked in those three shared objects, and replicated. This is bad, because it wastes memory (as the code is not shared between the instances of programs using those three libraries) and on-disk space (as the code is replicated on three files when it should be on just one).

I’ve prepared a patch to kdepim-kresources (the package in which these libraries are) that instead of declaring libgwsoap an internal library declares it as an installed library, so that the three libraries links to the libgwsoap rather than linking it in.

The results?

flame@enterprise ~ % qsize kdepim-kresources
kde-base/kdepim-kresources-3.5.6: 186 files, 25 non-files, 64362.24 KB
flame@enterprise ~ % qsize kdepim-kresources
kde-base/kdepim-kresources-3.5.6: 191 files, 25 non-files, 37207.401 KB

The first is before the change, the second after the change.
And this is far from being the sole part of kdepim having such a stupid problem.

Sorry KDE guys, but KDEPIM 3.x is a failure, when it comes to properly build and install stuff.

Huge structures

Today I was finally able to talk with Vir (Matthias Kretz, the Phonon guy), and I got a partially good news, but also a pretty bad news.

The good thing is that there will be a notification daemon in KDE4, that will use Phonon, so that not every application that requires notification will have to load libxine (and all its damned plugins); it might be loaded on the fly on a file dialog to preview a multimedia file though, which means it’s still going to be used by more than just a couple of applications at a time.

So the result is that we are in the need to branch xine-lib to make more consistent changes to the codebase for a 1.2 series, like for instance a shared cache for the plugins, created – if at all possible – during make install phase, as well as configuration options caching. It’s a massive amount of work that has to go in that direction, so we can make more data in the plugins constant, reducing the chances of having to make the RSS pages dirty, so reducing the overall size of xine-lib’s memory occupation.

Other changes includes the ability to choose the demuxer through a shared mime type, rather than by extension, which would allow not to have to load all the demuxers one by one when we’re passing through them, and a way for the plugin to tell the library if it supports ways to check what it can demux other than by content (some plugins fall back to a by-content detection when they don’t support an extension decoding); plus there is the always-improvable support for seeking, right now there is only one choice: an input plugin can seek or cannot, simple as that, while there can be difference grades of seeking: no seeking at all (sequential), slow seeking (you can seek, but you shouldn’t just jump around to find what the format is, an example of this is HTTP streaming), usual seeking (which is what most plugins already implement), and time seeking (MMS and RTSP support this).

RTSP support is also quite lacking, and it could certainly get improved by using libnemesi (Luca knows I tried already but as it is the structure of xine is not good enough); and another external library we might as well use is libavformat from FFmpeg: right now we only support libavcodec and, to some extents, libpostproc from FFmpeg project, while we could easily make use of libavformat as an extra demuxer plugin, as that can certainly support a lot of formats without us needing to take care of them one by one.

Unfortunately, I don’t think we’ll be able to provide much of this with a 1.2 series right away, especially since we don’t have many developers around at the moment, but that can be hoped.

Now, to return on the title of my blog, I’ve decided to play a bit with pahole, a nice software coming from Arnaldo Carvalho de Melo, a kernel developer, that is able to analyse the dwarf data of binaries (them being final executeables, libraries or kernels) to identify the structures used internally, and their content, to find their size and the eventual padding holes that would allow to reduce their size.

Well, a simple run of it, shown that there is at least one structure video_overlay_object_t (which is also used in video_ovelray_event_t) that takes more than 80KB of space. I’m not sure how many of these structures are loaded, but even a single one is a lot of space used, and I’m quite sure most of it is wasted; as far as I can see, it is used for the overlaid buttons in DVD menus (and other kind of menus I suppose, although I don’t think we currently support Matroska’s menus, although I’m not sure of it); even if that is the case, I don’t really see often DVD menus with 32 buttons, and as every button take more than 2KB, it should have probably be improved by using a button list rather than a static array, allocating the buttons only when effectively needed.

As I said there is a lot of work to do, especially when it comes to memory, but hopefully as soon as we have a new DVCS the work might become easier. Of course switching to a distributed mode will require some adjustment time, as we all need to learn Mercurial (well beside Darren that is), and there will be a lot to do to let others involved, but for instance then the VDR patches from Reinhard Nissl would just become a different branch of xine-lib, that could easily be merged from time to time (today I merged two more patches from him, and I’m sure more will come in the future).

I suppose right now the main question is: where will we host the sources? I suppose the best thing would be some hosting provider on this side of the Atlantic to avoid USA patents, but it’s still something we haven’t taken a sure decision, one idea was to ask on Alioth but it’s all up to be seen.

If the switch goes well, I could also see us moving away from’s tracker system, that is pretty much unusable, but there isn’t much choice about it, I don’t intend using Mantis again in my life (I used it for NoX-Wizard, do you remember Fabrizio?); Scarab looks interesting, but I’m not sure where we could find an hosting provider with JSP support for no price (or a cheap one).

Oh well, time to go to sleep soon now. If you’re interested in pahole, you can find an ebuild for it in my overlay (I know, gitarella is having trouble, I’m afraid that the Ruby update screwed up both gitarella and ServoFlame, so my services are down until I find time (and will) to fix them.)