Small differences don’t matter (to unpaper)

After my challenge with the fused multiply-add instructions I managed to find some time to write a new test utility. It’s written ad hoc for unpaper but it can probably be used for other things too. It’s trivial and stupid but it got the job done.

What it does is simple: it loads both a golden and a result image files, compares the size and format, and then goes through all the bytes to identify how many differences are there between them. If less than 0.1% of the image surface changed, it consider the test a pass.

It’s not a particularly nice system, especially as it requires me to bundle some 180MB of golden files (they compress to just about 10 MB so it’s not a big deal), but it’s a strict improvement compared to what I had before, which is good.

This change actually allowed me to explore one change that I abandoned before because it resulted in non-pixel-perfect results. In particular, unpaper now uses single-precision floating points all over, rather than doubles. This is because the slight imperfection caused by this change are not relevant enough to warrant the ever-so-slight loss in performance due to the bigger variables.

But even up to here, there is very little gain in performance. Sure some calculation can be faster this way, but we’re still using the same set of AVX/FMA instructions. This is unfortunate, unless you start rewriting the algorithms used for searching for edges or rotations, there is no gain to be made by changing the size of the code. When I converted unpaper to use libavcodec, I decided to make the code simple and as stupid as I could make it, as that meant I could have a baseline to improve from, but I’m not sure what the best way to improve it is, now.

I still have a branch that uses OpenMP for the processing, but since most of the filters applied are dependent on each other it does not work very well. Per-row processing gets slightly better results but they are really minimal as well. I think the most interesting parallel processing low-hanging fruit would be to execute processing in parallel on the two pages after splitting them from a single sheet of paper. Unfortunately, the loops used to do that processing right now are so complicated that I’m not looking forward to touch them for a long while.

I tried some basic profile-guided optimization execution, just to figure out what needs to be improved, and compared with codiff a proper release and a PGO version trained after the tests. Unfortunately the results are a bit vague and it means I’ll probably have to profile it properly if I want to get data out of it. If you’re curious here is the output when using rbelf-size -D on the unpaper binary when built normally, with profile-guided optimisation, with link-time optimisation, and with both profile-guided and link-time optimisation:

% rbelf-size -D ../release/unpaper ../release-pgo/unpaper ../release-lto/unpaper ../release-lto-pgo/unpaper
    exec         data       rodata        relro          bss     overhead    allocated   filename
   34951         1396        22284            0        11072         3196        72899   ../release/unpaper
   +5648         +312         -192           +0         +160           -6        +5922   ../release-pgo/unpaper
    -272           +0        -1364           +0         +144          -55        -1547   ../release-lto/unpaper
   +7424         +448        -1596           +0         +304          -61        +6519   ../release-lto-pgo/unpaper

It’s unfortunate that GCC does not give you any diagnostic on what it’s trying to do achieve when doing LTO, it would be interesting to see if you could steer the compiler to produce better code without it as well.

Anyway, enough with the microptimisations for now. If you want to make unpaper faster, feel free to send me pull requests for it, I’ll be glad to take a look at them!

Some more details about tables

I have a cold, it’s four days I’m laying in bed because of it, and I’m bored. Anyway I sent a few more patches to ffmpeg-devel today, although the fever did increase my mistake ratio.

Anyway, continuing to talk about tables, I’ve also mailed to ffmpeg-devel a simple command to check the amount of .bss tables that are present in libavcodec:

{objdump -t libavcodec/*.o | fgrep .bss | 
awk '{ print $5 }' | sed -e 's:^:0x:' 
-e 's:$: + :'; echo 0; } | irb

The result was surprisingly: 961KiB (plus the tables that I hardcoded locally and sent upstream). This means that every process using libavcodec will use up to that amount of memory of COW’d tables. A fully initialised libavcodec will use that amount of memory in COW’d tables.

Replacing all the tables with hardcoded tables will increase the size of libavcodec library by that size, but in turn, there will be no COW. This means faster startup for the library, as no COW is triggered for initialisation, and less memory used.

Tomorrow, if I feel better, I’ll be adding code for generating the tables to FFmpeg so that it would be less annoying to maintain the tables. I sincerely hope it can be integrated, so that Amarok will use less memory ;)

Precalculating tables

I’m continuing with my tentative to reduce the dirty RSS (Resident memory) in FFmpeg’s libraries. After a few more patches for libpostproc, I’ve moved on libavutil and libavcodec a second, trying to reduce the memory they do use.

The basic problem with libavcodec is the need for huge tables to speed up decoding. These tables are used to look up particular bytes (or shorts) so that their conversion to another given format need not to be calculated manually.

There are two kinds of tables usually used: prefilled (or precalculated), which are constant and goes to .rodata esction of executable files, and runtime initialised tables. You usually have to choose one or the other, although for smaller tables, choosing precalculated is the win-win choice, as the code used to generate it might be bigger than the actual table (this is the case for two tables in I-forgot-which xine plugin, which were runtime initialised before I replaced them.

FFmpeg, since a few days ago, has an option to choose between runtime-initialised and hardcoded tables, although that’s just for CRC code. Myself, I choose hardcoded, and you’ll now see why.

Runtime-initialised and hardcoded tables have different pros and cons. So it’s nice, most of the times, which means for all but the smallest tables, to be able to choose between the two.

The first difference between the two is the section they are mapped on in the resulting ELF file: pre-initialised tables are saved in .rodata, while runtime-initialised are either saved in .bss (which means they are mapped to a zero’d page, and COW’d as soon as the initialisation code is ran), or allocated dynamically at initialisation.

This means that the runtime-initialised tables are private to the process, they are then dirty RSS or heap memory used up. On the other hand, hardcoded tables are mapped out of .rodata from the on-disk ELF file, and shared between processes.

The problem of hardcoded tables is that they need to be actually saved inside the ELF file, so if (this is a real example) you got two 16KiB tables, for which the initialisation code is about 500 bytes, with hardcoded tables you save the 500 bytes but add 32KiB to the size of the executable. As you can see, the executable will be smaller if you don’t use hardcoded tables.

The size of the executable does not really change any cache-related parameter, as the tables are the same size, and does not mean that the processes will use more memory. On the other hand, they’ll use less memory because they can share the pages. It might change the time needed to load the file from disk, but that’s a one-time cost versus a per-execution cost of the initialisation.

On modern systems with big hard drives, it’s more than likely you wouldn’t see any difference in load time, and you most likely will not have trouble with the increased size. You might want to see the memory usage reduced though, especially for processes that might use xine somehow (think KDE4).

I’ve been writing a few patches to hardcode tables in FFmpeg today, hopefully I’ll be able to get them integrated soon. In my overlay I have a live SVN ebuild for FFmpeg, and it has an USE flag for the hardcoded tables. Probably the size of the libraries will be increased a lot since I’ve seen quite a few tables in .bss that might be hardcoded.

My target at the moment is to reduce the 40KiB dirty RSS I see for in my Amarok’s process. Hopefully this will make FFmpeg’s shared library support more viable (considering that the runtime relocations due to PIC makes FFmpeg quite slower, not counting the fact that it’s missing one register compared to the non-PIC version.

KDE4 users would be happy for that :)

For what concerns my personal life, today I have a very bad throat ache, but I received the smoking pipe I ordered, which is actually a nice way to relieve stress while thinking… no I don’t smoke, I have no intention to, I just bit it and keep doing what I’m doing. With the complications I had during hospitalisation, I’ll never try to smoke at all.