More threads is not always a good solution

You might remember that some time ago I took over unpaper as I used (and sort of use still) to pre-filter the documents I scan to archive in electronic form. While I’ve been very busy in the past few months, I’m now trying to clear out my backlog, and it turns out that a good deal of it involves unpaper. There are bugs to fix and features to implement, but also patches to evaluate.

One of the most recent patches I received is designed to help with the performance of the tool, which are definitely not what Jens, the original author, had in mind when he came up with the code. Better performance is something that about everybody would love to have at this point. Unfortunately, it turns out that the patch is not working out as intended.

The patch is two-fold: from one side, it makes (optional) use of OpenMP to parallelize the code in hope to speed it up. Given that most of the code is a bunch of loops, it seemed obvious that using multithreading would speed things out, right? Well, first try after applying it showed very easily that it slows down, at least on Excelsior, which is a 32-cores system. While it would take less than 10 seconds for the first test to run without OpenMP, it would take over a minute with it, spinning up all the cores for a 3000% CPU usage.

A quick test shows that forcing the number of threads to 8, rather than leaving it unbound, makes it actually faster than the non-OpenMP variant. This means that there are a few different variables in play that needs to be tuned for performance to improve. Without going into profiling the code, I can figure out a few things that can go wrong with unchecked multithreading:

  • extensive locking when each worker thread is running, either because they are all accessing the same memory page, or because the loop needs a “reduced” result (e.g. a value has to be calculated as the sum of values calculated within a parallelized loop); in the case of unpaper I’m sure both situations happen fairly often in the looping codepath.
  • cache trashing: as the worker threads jump around the memory area to process, it no longer is fetching a linear amount of memory;
  • something entirely more complicated.

Beside being obvious that doing the “stupid” thing and just making all the major loops parallel, this situation is bringing me to the point where I finally will make use of the Using OpenMP book that I got a few years back and I only started reading after figuring out that OpenMP was not ready yet for prime time. Nowadays OpenMP support in Linux improved to the point it’s probably worth taking another look at it, and I guess unpaper is going to be the test case for it. You can expect a series of blog posts on the topic at this point.

The first thing I noticed while reading the way OpenMP handles shared and private variables, is that the const indication is much stronger when using OpenMP. The reason is that if you tell the code that a given datum is not going to change (it’s a constant, not a variable), it can easily assume direct access from all the threads will work properly; the variable is shared by default among all of them. This is something that, for non-OpenMP programs, is usually inferred from the SSA form — I assume that for whatever reason, OpenMP makes SSA weaker.

Unfortunately, this also means that there is one nasty change tha tmight be required to make code friendlier to OpenMP, and that is a change in prototypes of functions. The parameters to a function are, for what OpenMP is concerned, variables, and that means that unless you declare them const, it won’t be sharing them by default. Within a self-contained program like unpaper, changing the signatures of the functions so that parameters are declared, for instance, const int is painless — but for a library it would be API breakage.

Anyway, just remember: adding multithreading is not the silver bullet you might think it is!

P.S.: I’m currently trying to gauge interest on a volume that would collect, re-edit, and organize all I written up to now on ELF. I’ve created a page on leanpub where you can note down your interest, and how much you think such a volume would be worth. If you would like to read such a book, just add your interest through the form; as soon as at least a dozen people are interested, I’ll spend all my free time working on the editing.

The destiny of Ruby-Elf

After my quick last post I decided to look more into what I could do to save the one year of work I pushed into Ruby-Elf. After fighting a bit to understand how the new String class from Ruby 1.9 worked, I could get the testsuite to pass on Ruby 1.9, and cowstats to also report the correct data.

Unfortunately, this broke down Ruby 1.8 support, as far as I can see because IO#readpartial does not work that well on 1.8 even for on-disk files; similarly happens to JRuby.

After getting Ruby 1.9 to work the obvious next task was to make cowstats work with multiple threading. The final idea was to have a -j parameter akin to make’s but for now I only wanted to create one thread per file to scan. In theory, given native threading, all the threads would be executing at once, scheduled by the system schedule, allowing to saturate the 8 cores, reaching 800% of CPU time as a theoretical maximum.

Unfortunately reality soon kicked in, and the ruby19 process limited itself to 100%, which means a single core out of eight, which also means no parallel scan. A quick glance through the sources shows that while YARV (the Ruby 1.9 VM) lists three possible methods to achieve mutlithreading, only one is currently implemented, the second one. The first method is the old one, green threading, which basically means simulated threads, as the code never executes in parallel but uses an event-loop-like construct to switch the execution between different “threads”. The second method makes use of a giant lock, which in this case is called Giant VM Lock (GVL), and is called GIL (Giant Interpreter Lock) in Python, where the threads are scheduled by the operating system, which allows for more fair scheduling among execution threads, but still allows just one thread per VM to be executed in parallel. The third method is the one I was hoping for and allows for multiple threads to be executed at the same time on different cores on the same virtual machine; instead of having a single lock on the whole VM, the locks are sparse around the code to just lock the needed resources for each thread.

I also checked this out on JRuby, to compare; unfortunately JRuby in portage cannot handle the code as I changed it to work with Ruby 1.9, so I have been unable to actually benchmark a working run of cowstats with it; but I could see that the CPU used by JRuby spiked at 250%, which means it at least is able to execute the threads quite independently; which proves that Ruby can be parallelised up to that point just fine.

So what is the fuss about Ruby 1.9 new native threading support if multiple threads cannot be executed in parallel? Well it still allows for a single process to spawn multiple VMs and execute parallel threads on them, isolated one from the other. Which happens to be useful for Ruby on Rails web applications. If you think well about it, the extra complexity added to deal with binary files is also to address some interesting problems that come up in environment where multiple encodings can often be used, which is, web applications. Similarly the JRuby approach, which is very fast once the JVM is loaded, works fine for applications where you start up once and then proceed to elaborate for a long time, which again fits web application and little more.

I’m afraid to say that what we’re going to see in the next and not-so-next future is for Ruby to lose the general-purpose support and just focus more and more on the web application side of the fence. Which is sad since I really cannot think of anything else I would like to rewrite my tools in, beside, maybe, C# (if it could be compiled in ELF — I should try Vala for that). I feel like my favourite general-purpose language is slipping away, and I should stop worrying and working on that.

Multi-core systems and modern desktop usage

Both Ted Tso on Kernel Planet and LWN refer to an interview to Knuth, TeX author, in which he seems to criticise the choice of multi-core processors.

One quote which I find very short-sighted is:

Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. Surely, for example, multiple processors are no help to TeX….

While I agree that it doesn’t help TeX, parallelism is implicit in almost all modern desktop uses (and for what concern servers, I think there is little doubt about its usefulness). Find me a modern desktop when someone is running just one thread that requires CPU time.

Most people (that I know of at least) tend to use their desktop for multiple tasks: reading mail, surfing the web, listening to music, watching videos. Some of these tasks are inherently well-suited for multithreading, but there is one other thing to consider: you usually do all these things at once.

Well, maybe not exactly as once, although it is true for multimedia as you rarely do that exclusively, but you leave mail loading while you surf. And you often leave multiple pages loading at once while you surf, thanks to tabbed browsing. Even when you’re writing something with (La)TeX, you usually have at least the editor and the TeX environment running at once.

Then start to consider other things. You’re connected to the network, almost always in some cases, you’ve got disks that need to be synced properly, you have processes running in background, keeping things in cache, working for you without you seeing them. And you’ve got a mouse that you use, which needs to move a cursor. And wonder what? Xorg is trying to make that take its own thread.

All these things leads me to one conclusion: even if your main applications are not designed to work in parallel, most of the time adding cores will make things running more smoothly. And I’m sure that in the future we’ll see more multithreaded applications anyway, as multithreading, while not suitable for text processing (by design), works pretty well for a lot of other fields.

Myself, I’m waiting for the new Ruby with actual multithreading to port my ruby-elf tools: most of that kind of processing can work in parallel. I’d just need a decent new box. I’m looking for an EU supplier carrying Opterons, but it seems to be difficult to find them, as a person rather than a company.