Introducing cowstats

No it’s not a script to find statistics on Larry, it’s a tool to get statistics for copy-on-write pages.

I’ve been writing for quite a while about memory usage, RSS memory and other stuff like that on my blog so if you want to get some more in-deep information about it, please just look around. If I start linking here all the posts I’ve made on the topic (okay the last one is not a blog post ;) ) I would probably spend the best part of the night to dig them up (I only linked here the most recent ones on the topic).

Trying to summarise for those who didn’t read my blog for all this time, let’s start with saying that a lot of software, even free software, nowadays wastes memory. When I say waste, I mean it uses memory without a good reason to, I’m not saying that software that uses lots of memory to cache or precalculate stuff and thus be faster is wasting memory, that’s just using memory. I’m not even referring to memory leaks, which are usually just bugs in the code. I’m saying that a lot of software wastes memory when it could save memory without losing performances.

The memory I declare wasted is that memory that could be shared between processes, but it’s not. That’s a waste of memory because you end up using twice or more of the memory for the same goal, which is way sub-optimal. Ben Maurer (a GNOME contributor) wrote a nice script (which is in my overlay if you want; I should finish fixing a couple of things up in the ebuild and commit it to main tree already, the deps are already in main tree) that tells you, for a given process, how much memory is not shared between processes, the so-called “dirty RSS” (RSS stands for Resident Set Size, it’s the resident memory, so the memory that the process is actually using from your RAM).

Dirty RSS is caused by “copy-on-write” pages. What is a page, and what is a copy-on-write pages? Well, memory pages are the unit used to allocate memory to processes (and to threads, and kernel systems, but let’s not go too in deep there); when a process is given a page, it usually also get some permission on that, it might be readable, writable or executable. Trying not to get too in deep on this either (I could easily write a book on this, maybe I should, actually), the important thing is that read-only pages can easily be shared between processes, and can be mapped directly from a file on-disk. This means that two process can use both the same 4KB read-only page, using just 4KB of memory, while if the same content was present in a writable page, the two processes would have their own copy of it, and would require 8KB of memory. Maybe more importantly, if the page is mapped directly from a file on-disk, when the kernel need to make space for new allocated memory, it can just get rid of the page, and then re-load it from the original file, rather than writing it down on the swap file, and then load from that.

To make it easier to load the data from the files on disk, and reduce the memory usage, modern operating systems use copy-on-write. The pages are shared as long as they are not changed from the original; when a process tries to change the content of the page, it’s copied in a new empty, writable page, and the process gets exclusive access to the page, “eating” the memory. This is the reason why using PIC shared objects usually save memory, but that’s another story entirely.

So we should reduce the amount of copy-on-write pages, trying to favour read-only sharable pages. Great, but how? Well, the common way to do so is to make sure that you mark (in C) all the constants as constant, rather than defining them as variables even if you never change their value. Even better, mark them static and constant.

But it’s not so easy to check the whole codebase of a long-developed software to mark everything constant, so there’s the need to analyse the software post-facto and identify what should be worked on. To do so I used objdump (from binutils) up to now, it’s a nice tool to have raw information about ELF files, it’s not easy, but I grew used to it so I can easily grok its output.

Focusing on ELF files, which are the executable and library files in Linux, FreeBSD and Solaris (plus other Unixes), the copy-on-write pages are those belonging, mostly, to these sections: .data, .data.rel and .bss (actually, there are more sections, like .data.local and .data.rel.ro, but let’s just consider those prefixes for now).

.data section keeps the non-stack variables (which means anything declared as static but non-constant in C source) which were initialised in the source. This is probably the cause of most waste of memory: you define a static array in C, you don’t mark it constant properly (see this for string arrays), but you never touch it after definition.

.data.rel section keeps the non-stack variables that need to be relocated at runtime. For instance it might be a static structure with a string, or a pointer to another structure or an array. Often you can’t get rid of relocations, but they have a cost in term of CPU time used, and also a cost in memory usage, as the relocation will trigger for sure the copy-on-write… unless you use prelink, but as you’ll read on that link, it’s not always a complete solution. You usually can live with these, but if you can get rid of instances here, it’s a good thing.

.bss section keeps the uninitalised non-stack variables, for instance if you declare and define a static array, but don’t fill it at once, it will be added to the .bss section. That section is mapped on the zero page (a page entirely initialised to zero, as the name suggests), with a copy-on-write: as soon as you write to the variable, a new page is allocated, and thus memory is used. Usually, runtime-initialised tables falls into this section. It’s often possible to replace them (maybe optionally) with precalculated tables, saving memory at runtime.

My cowstats script analyse a series of object files (tomorrow I’ll work on an ar parser so that it can be ran on static libraries; unfortunately it’s not possible to run it on executables or shared libraries as they tend to hide the static symbols, which are the main cause of wasted memory), looks for the symbols present in those sections, and lists them to you, or in alternative it shows you some statistics (a simple table that tells you how many bytes are used in the three sections for the various object files it was called with). This way you can easily see what variables are causing copy-on-write pages to be requested, so that you can try to change them (or the code) to avoid wasting memory.

I wrote this script because Mike asked me if I had an automated way to identify which variables to work on, after a long series of patches (many of which I have to fix and re-submit) for FFmpeg to reduce the memory usage. It’s now available at https://www.flameeyes.eu/p/ruby-elf as it’s simply a Ruby script using my ELF parser for ruby started last May. It’s nice to see that something I did some time ago for a completely different reason now comes useful again ;)

I mailed the results on my current partly-patched libavcodec, they are quite scary, it’s over 1MB of copy-on-write pages. I’ll continue working so that the numbers will come near to zero. Tomorrow I’ll also try to run the script on xine-lib’s objects, as well as xine-ui. It should be interesting.

Just as a test, I also tried running the script over libvorbis.a (extracting the files manually, as for now I have no way to access those archives through Ruby), and here are the results:

cowstats.rb: lookup.o: no .symtab section found
File name  | .data size | .bss size  | .data.rel.* size
psy.o             22848            0            0
window.o          32640            0            0
floor1.o              0            8            0
analysis.o            4            0            0
registry.o           48            0            0
Totals:
    55540 bytes of writable variables.
    8 bytes of non-initialised variables.
    0 bytes of variables needing runtime relocation.
  Total 55548 bytes of variables in copy-on-write sections

(The warning tells me that the lookup.o file has no symbols defined at all; the reason for this is that the file is under one big #ifdef; the binutils tools might be improved to avoid packing those files at all, as they can’t be used for anything, bearing no symbol… although it might be that they still can carry .init sections, I admit my ignorance here).

Now, considering the focus of libvorbis (only Vorbis decoding), it’s scary to see that there are almost 55KB of memory in writable pages; especially since, looking down to it, I found that they are due to a few tables which are never modified but are not marked as constant.

The encoding library libvorbisenc is even worse:

File name   | .data size | .bss size  | .data.rel.* size
vorbisenc.o      1720896            0            0
Totals:
    1720896 bytes of writable variables.
    0 bytes of non-initialised variables.
    0 bytes of variables needing runtime relocation.
  Total 1720896 bytes of variables in copy-on-write sections

Yes that’s about 1.7 MB of writable pages brought in by libvorbisenc per every process which uses it. And I’m unfortunately going to tell you that any xine frontend (Amarok included) might load libvorbisenc, as libavcodec has a vorbis encoder which uses libvorbisenc. Yes it’s not nice at all!

Tomorrow I’ll see to prepare a patch for libvorbis (at least) and see if Xiph will not ignore me at least this time. Once the script will be able to act on static libraries, I might just run it on all the ones I have on my system and identify the ones that really need to be worked on. This of course will have not to hinder my current jobs (I’m considering this in-deep look at memory usage part of my job as I’m probably going to need it in a course I have to teach next month), as I really need money, especially to get a newer box before end of the year, Enterprise is getting slow.

Mike, I hope you’re reading this blog, I tried to explain the thing I’ve been doing in the best way possible :)

8 thoughts on “Introducing cowstats

  1. Thanks, it’s always a pleasure to read yours technical posts. I always learn something interesting.

    Like

  2. It does not hurt to be extra vigilant but if the process does not touch a COW page, then it is still shared and not copied for the process. Right ? In that case, a library can generate several COW pages which could have been otherwise but if those pages are not used, no extra memory is needed. And if the memory is used, then you probably have no chance of changing the variables to const anyway.

    Like

  3. The problem is that copy-on-write acts on the _page_ rather than on the single variable.So for instance you got two static variables that are actually changed, and between them there’s a 6KB table that is never touched. They’ll fall in two different pages.When you change the two variables, copy-on-write will trigger on _both_ pages, causing 8KB of memory to be used.If instead the table is on a read-only page, the two variables would fall into a single 4KB page, and just one cow will be triggered, not wasting 4KB of memory.

    Like

  4. Thanks for making my head hurt again. I personally find it a pleasure to see something with real technical content appearing on planet Gentoo.Keep those informative posts coming…

    Like

  5. Just for everybody’s information, I’ve fixed libvorbis and libvorbisenc, now there are just 4 bytes (a single variable) on cow pages, and the result is noticeable on amarok :)

    Like

  6. I have noticed your patch in our Trac. I’ll bring it to attention of the big kahuna. It’s the best I can promise you.

    Like

  7. I’m wondering: A table, which is not marked “const” but never modified after initialization, and is statically initialized (i.e. in the .data section), would never be copied anyway. Why is this a problem? The whole idea behind cow is to make pages shareable as long as they’re not written to. So yes, you can add “const” to a large table, but why would this change anything?

    Like

  8. See above, I already commented on it: if you got one variable, then a table bigger than the rest of the page, and another variable, changing both of the variables will trigger cow for two pages. If the table is not there in the middle, you’ll trigger cow of just one page.Sure you can _hope_ that the compiler reorders the variables so that the table is at the end of them, but the best thing is to make sure that what should be const goes into .rodata.Besides, I’m not sure if .data cow pages are disk-backed, that is, I don’t know if that .data source would just be discarded and read from disk if memory is needed, or if it goes on the swap; on a pessimistic view, I take the second option, and prefer to make explicit to the operating system that it can be discarded and read from disk.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s