Small differences don’t matter (to unpaper)

After my challenge with the fused multiply-add instructions I managed to find some time to write a new test utility. It’s written ad hoc for unpaper but it can probably be used for other things too. It’s trivial and stupid but it got the job done.

What it does is simple: it loads both a golden and a result image files, compares the size and format, and then goes through all the bytes to identify how many differences are there between them. If less than 0.1% of the image surface changed, it consider the test a pass.

It’s not a particularly nice system, especially as it requires me to bundle some 180MB of golden files (they compress to just about 10 MB so it’s not a big deal), but it’s a strict improvement compared to what I had before, which is good.

This change actually allowed me to explore one change that I abandoned before because it resulted in non-pixel-perfect results. In particular, unpaper now uses single-precision floating points all over, rather than doubles. This is because the slight imperfection caused by this change are not relevant enough to warrant the ever-so-slight loss in performance due to the bigger variables.

But even up to here, there is very little gain in performance. Sure some calculation can be faster this way, but we’re still using the same set of AVX/FMA instructions. This is unfortunate, unless you start rewriting the algorithms used for searching for edges or rotations, there is no gain to be made by changing the size of the code. When I converted unpaper to use libavcodec, I decided to make the code simple and as stupid as I could make it, as that meant I could have a baseline to improve from, but I’m not sure what the best way to improve it is, now.

I still have a branch that uses OpenMP for the processing, but since most of the filters applied are dependent on each other it does not work very well. Per-row processing gets slightly better results but they are really minimal as well. I think the most interesting parallel processing low-hanging fruit would be to execute processing in parallel on the two pages after splitting them from a single sheet of paper. Unfortunately, the loops used to do that processing right now are so complicated that I’m not looking forward to touch them for a long while.

I tried some basic profile-guided optimization execution, just to figure out what needs to be improved, and compared with codiff a proper release and a PGO version trained after the tests. Unfortunately the results are a bit vague and it means I’ll probably have to profile it properly if I want to get data out of it. If you’re curious here is the output when using rbelf-size -D on the unpaper binary when built normally, with profile-guided optimisation, with link-time optimisation, and with both profile-guided and link-time optimisation:

% rbelf-size -D ../release/unpaper ../release-pgo/unpaper ../release-lto/unpaper ../release-lto-pgo/unpaper
    exec         data       rodata        relro          bss     overhead    allocated   filename
   34951         1396        22284            0        11072         3196        72899   ../release/unpaper
   +5648         +312         -192           +0         +160           -6        +5922   ../release-pgo/unpaper
    -272           +0        -1364           +0         +144          -55        -1547   ../release-lto/unpaper
   +7424         +448        -1596           +0         +304          -61        +6519   ../release-lto-pgo/unpaper

It’s unfortunate that GCC does not give you any diagnostic on what it’s trying to do achieve when doing LTO, it would be interesting to see if you could steer the compiler to produce better code without it as well.

Anyway, enough with the microptimisations for now. If you want to make unpaper faster, feel free to send me pull requests for it, I’ll be glad to take a look at them!

More threads is not always a good solution

You might remember that some time ago I took over unpaper as I used (and sort of use still) to pre-filter the documents I scan to archive in electronic form. While I’ve been very busy in the past few months, I’m now trying to clear out my backlog, and it turns out that a good deal of it involves unpaper. There are bugs to fix and features to implement, but also patches to evaluate.

One of the most recent patches I received is designed to help with the performance of the tool, which are definitely not what Jens, the original author, had in mind when he came up with the code. Better performance is something that about everybody would love to have at this point. Unfortunately, it turns out that the patch is not working out as intended.

The patch is two-fold: from one side, it makes (optional) use of OpenMP to parallelize the code in hope to speed it up. Given that most of the code is a bunch of loops, it seemed obvious that using multithreading would speed things out, right? Well, first try after applying it showed very easily that it slows down, at least on Excelsior, which is a 32-cores system. While it would take less than 10 seconds for the first test to run without OpenMP, it would take over a minute with it, spinning up all the cores for a 3000% CPU usage.

A quick test shows that forcing the number of threads to 8, rather than leaving it unbound, makes it actually faster than the non-OpenMP variant. This means that there are a few different variables in play that needs to be tuned for performance to improve. Without going into profiling the code, I can figure out a few things that can go wrong with unchecked multithreading:

  • extensive locking when each worker thread is running, either because they are all accessing the same memory page, or because the loop needs a “reduced” result (e.g. a value has to be calculated as the sum of values calculated within a parallelized loop); in the case of unpaper I’m sure both situations happen fairly often in the looping codepath.
  • cache trashing: as the worker threads jump around the memory area to process, it no longer is fetching a linear amount of memory;
  • something entirely more complicated.

Beside being obvious that doing the “stupid” thing and just making all the major loops parallel, this situation is bringing me to the point where I finally will make use of the Using OpenMP book that I got a few years back and I only started reading after figuring out that OpenMP was not ready yet for prime time. Nowadays OpenMP support in Linux improved to the point it’s probably worth taking another look at it, and I guess unpaper is going to be the test case for it. You can expect a series of blog posts on the topic at this point.

The first thing I noticed while reading the way OpenMP handles shared and private variables, is that the const indication is much stronger when using OpenMP. The reason is that if you tell the code that a given datum is not going to change (it’s a constant, not a variable), it can easily assume direct access from all the threads will work properly; the variable is shared by default among all of them. This is something that, for non-OpenMP programs, is usually inferred from the SSA form — I assume that for whatever reason, OpenMP makes SSA weaker.

Unfortunately, this also means that there is one nasty change tha tmight be required to make code friendlier to OpenMP, and that is a change in prototypes of functions. The parameters to a function are, for what OpenMP is concerned, variables, and that means that unless you declare them const, it won’t be sharing them by default. Within a self-contained program like unpaper, changing the signatures of the functions so that parameters are declared, for instance, const int is painless — but for a library it would be API breakage.

Anyway, just remember: adding multithreading is not the silver bullet you might think it is!

P.S.: I’m currently trying to gauge interest on a volume that would collect, re-edit, and organize all I written up to now on ELF. I’ve created a page on leanpub where you can note down your interest, and how much you think such a volume would be worth. If you would like to read such a book, just add your interest through the form; as soon as at least a dozen people are interested, I’ll spend all my free time working on the editing.

Gold readiness obstacle #5: libtool (part 2, OpenMP)

After some digression on the issues with the latest glibc version, it’s time for the fifth instalment of gold readiness for the Gentoo tree, which is completing the libtool issues that I noted a couple of days ago in part 1.

As I said, this relates to a known issue with libtool and OpenMP, to the point that there is a Gentoo bug open and upstream libtool package is already fixed to deal with this, it’s just not trickled down to a release and from there into Gentoo. Although I guess it might be a good idea to just apply this with elibtoolize as I’ve done for the configure fix for gold.

What is the problem? Well, the first problem is with the design of OpenMP language extensions, and with some other flags that implicitly enable those extensions. While these are flags that relate to the GCC frontend (gcc command), they not only change the semantics of the compiled code, they also change the semantics of the linking step, by adding a link to the OpenMP implementation library. This means that the frontend needs to know about this both at compile time as well as link time (where the frontend converts it to the proper linking flags for the link editor to pick up OpenMP).

Unfortunately, when using libtool, its script is parsing and mangling the options first (it’s for this reason that libtool had to be patched to get --as-needed working as intended). Up to now, it didn’t know that -fopenmp should have been passed to the linking frontend of gcc just the same.

Okay, in truth this is not much of an issue for gold only; it is just the same issue when using ld/bfd. But the switch to a link editor that has stricter underlinking rejection makes the issue much more apparent, in particular because, while libtool is usually involved in building the libraries, there is no reason (beside the slight chance of using static linking through libtool archives) to force its usage for building final executables, which means that a single -fopenmp at the final linking point would be quite enough.

After some time with Snow Leopard

You probably know that, as much as I am a Linux user, I’m an OS X user as well. I don’t usually develop for OS X, but I do use it quite a bit, even though my laptop broke last March, I bought an iMac to replace it (and now I also have my MacBook Pro back, although with the optical unit not working still; I’m now tempted to get a second harddrive instead of an optical unit, I can use the iMac’s DVD sharing for that instead).

And since I’m both a developer and an user, when the new release of OS X, Snow Leopard, was finally published, I ordered it right away. Two weeks into using the new version of OS X, I have some comments to say. And I’m going to say them here because this is something that various Free Software projects should probably learn from too.

The first point is nothing new, Apple already said that Snow Leopard is nothing totally new, but it’s rather a polished version of Leopard… with 64-bit under the hood. The 64-bit idea is not new to Linux and a lot of distributions already support it, and when it’s available, almost all system software uses it; there are still a few proprietary pieces that are not ported to 64-bit, especially for what concern games, and software like Skype, but most of the stuff is good for us so we really have nothing new to learn from OS X in that field.

I was expecting the Grand Central Dispatch thing to be a rebranded OpenMP sincerely (by the way, thanks to whoever sent me the book, it hasn’t arrived yet but I’ll make sure to put it to good use in Free Software once I have it!), instead it seems to be a totally different implementation, which Apple partly made available as open source free software (see this LWN article ); I’m not sure if I’m happy about it or not, given it’s already another implementation of an already-present idea. On the other hand, it’s certainly an area were Free Software could really learn; I don’t think OpenMP is that much used outside of CPU-intensive tasks, but as the Apple guys shown in their WWDC presentation, it’s something that even something like a mail client could make good use of.

I still have no idea what technique QuickTime X is using for the HTTP-based streaming, I’ll find out one day though, for now I’m still working on the new implementation of the lscube RTSP parser that should also support the already-present HTTP proxy passthrough; if it uses the same technique, that’s even better!

In the list of less-advertised changes, there are also things like better Google support in the iCal and Address Book: now for instance you can edit the Google calendars from inside the iCal application, which is kinda cool (all the changes are automatically available both locally and on Google Calendar itself), and you can sync your Address Book with Google Contacts. The former is something that supposedly should work with Evolution as well, although I think they really really have a long way to go before it works as well, and that’s not to say that iCal integration works perfectly… at all!

The latter instead is a bit strange, I already had the impression that Google Contacts is some very bad shit (it doesn’t store all information, the web interface is nearly unusable, and so on), but when I decided to enable the “Sync with Google” option in Address Book I probably made a big mistake: first the thing created lots of duplicates in my book, since I uploaded a copy of all them with the cellphone some time ago, and some entries were seen as duplicated rather than being the same thing (mostly for people with an associated honorific like “Dott.” for my doctors).; this is quite strange because the vCard files should have an Unique ID just for that reason, to make sure that they are not duplicated if moved between different services. In addition, the phone numbers went messed up since they added up (in Apple’s Address Book I keep them well edited – +39 041 09 80 841 – the Nokia removes the spaces, and it seems like Google Contacts sometimes drops the country code for no good reason at all).

Interestingly enough, though, while Leopard was known for the Mobile Me support, Snow Leopard adds quite a few more options for syncing data, probably because Mobile Me itself wasn’t really that much of a good deal for most people; it still didn’t support my Nokia E75 natively (but “my” plugin worked — a copy of the E71 plugin by Nokia with the phone name edited), and it doesn’t seem to support a generic SyncML provider (like Nokia’s Ovi service), but there is for instance a new “CardDAV” entry in the Address Book for instance; I wonder if it’s compatible with Evolution’s CalDAV-based address book), if so I might want to use that, I guess.

While the Apple showcase of Snow Leopard was aimed at criticising Microsoft’s release of Windows Vista with all the related changes in the interface, I wouldn’t be surprised if, when deciding how to proceed with the new version, they also counted in the critiques against KDE 4’s release. I hope that Gnome 3 won’t be anything like that, and would rather follow Apple’s approach of subtle, gentle changes, although I won’t count on it.

At any rate, the experience up to now was quite nice, nothing broke heavily, even Parallels Desktop worked fine after the update, which was actually surprising to me since I expected the kernel-level stuff to break a part with the update. I wish Linux would be as stable sometimes. But bottom-line, although with a few problems I still love Free Software better.

For A Parallel World: Programming Pointers n.1: OpenMP

A new sub-series of For A Parallel World, to give some development pointers on how to make software make better use of parallel computing resources available with modern multicore systems. Hope some people can find it interesting!

While I was looking for a (yet unfound) scripting language that could allow me to easily write conversion scripts with FFmpeg that execute in parallel, I’ve decided to finally take a look to OpenMP, which is something I wanted to do for quite a while. Thanks to some Intel documentation I tried it out on the old benchmark I used to compare glibc, glib and libavcodec/FFmpeg for what concerned byte swapping.

The code of the byte swapping routine was adapted to this (note the fact that the index is ssize_t, which means there is opening for breakage when len is bigger than SSIZE_T_MAX):

void test_bswap_file(const uint16_t *input, uint16_t *output, size_t len) {
  ssize_t i;

#ifdef _OPENMP
#pragma omp parallel for
  for (i = 0; i < len/2; i++) {
    output[i] = GUINT16_FROM_BE(input[i]);

and I compared the execution time, through the @rdtsc@ instruction, between the standard code and the OpenMP-powered one, with GCC 4.3. The result actually surprised me. I expected I would have needed to make this much more complex, like breaking it into multiple chunks, to be able to make it fast. Instead it actually worked nice running one cycle on each:

A bar graph showing the difference in speed between standard serial bswap and a simple OpenMP-powered variant

The results are the average of 100 runs for each size, the file is on disk accessed through VFS, it’s not urandom; the values are not important since they are the output of rdtsc() so they only work for comparison.

Now of course this is a very lame way to use OpenMP, there are much better more complex ways to make use of it but all in all, I see quite a few interesting patterns arising. One very interesting note is that I can make use of this in quite a few of the audio output conversion functions in xine-lib, if I was able to resume dedicating time to work on that. Unfortunately lately I’ve been quite limited in time, especially since I don’t have a stable job.

But anyway, it’s interesting to know; if you’re a developer interested in making your code faster on modern multicore systems, consider taking some time to play with OpenMP. With some luck I’m going to write more about it in the future on this blog.