I’m told Rust is great, where are the graphics libraries?

While I’m still a bit sour that Mozilla decided to use the same name for their language as an old project of mine (which is not a new thing for Mozilla anyway, if someone remembers the days of Phoenix and Firebird), I have been looking from the sideline as the Rust language as a way forward to replace so many applications of embedded C, with a significantly safer alternative.

I have indeed been happy to see so much UEFI work happening in Rust, because it seems to me like we came far enough that we can sacrifice some of the extreme performance of C for some safety.

But one thing that I still have not seen is a good selection of graphics libraries, and that is something that I’m fairly disappointed by. Indeed, I have been told that there are Rust bindings for the classic C graphics libraries — which is pointless, as then the part that needs safety (the parsing) is still performed in C!

The reason why I’m angry about this is that I still have one project, unpaper, which I inherited as a big chunk of C and could definitely be rewritten into a safer language. But I would rather not do so in a higher level language like Python due to the already slow floating point calculations and huge memory usage.

Right now, unpaper is using libav, or ffmpeg, or something with their interface, depending on how much they fought this year. This is painful, but given that each graphic library implements interfaces in different ways, I couldn’t find a better and safe way to implement graphics processing. I was hoping that with all the focus on Rust out there, particularly from Mozilla, implementing graphics parsing libraries would be high in the list of priorities.

I think it’s librsvg that was ported to Rust — which was probably a great idea to prioritize, given it is exactly the type of format where C performs very poorly: string parsing. But I’m surprised nobody tried to make an API-compatible libpng or libtiff. It sounds to me like Rust is the perfect language for this type of work.

At any rate, if anyone finally decides to implement a generic graphic file input/output library, with at least support for TIFF, PNM and PNG, I’d love to know. And after that I would be happy to port unpaper to it — or if someone wants to take unpaper code as the basis to reimplement it as a proof of concept, that’d be awesome.

The problem for a lot of these libraries is that you have to maintain support for a long list of quirks and extensions that over time piled up on the formats. And while you can easily write tests to maintain bit-wise compatibility with the “original” C language based libraries for what concerns bitmap rendering (even for non-bitmap graphics such as JPEG and WebP), there are more things that are not obvious to implement, such as colour profiles, and metadata in general.

Actually, I think that there is a lot of space here to build up a standard set of libraries for graphics libraries and metadata, since there’s at least some overlapping between these, and having a bigger group of people working on separate, but API-similar libraries for various graphic formats would be a significant advantage for Rust over other languages.

Small differences don’t matter (to unpaper)

After my challenge with the fused multiply-add instructions I managed to find some time to write a new test utility. It’s written ad hoc for unpaper but it can probably be used for other things too. It’s trivial and stupid but it got the job done.

What it does is simple: it loads both a golden and a result image files, compares the size and format, and then goes through all the bytes to identify how many differences are there between them. If less than 0.1% of the image surface changed, it consider the test a pass.

It’s not a particularly nice system, especially as it requires me to bundle some 180MB of golden files (they compress to just about 10 MB so it’s not a big deal), but it’s a strict improvement compared to what I had before, which is good.

This change actually allowed me to explore one change that I abandoned before because it resulted in non-pixel-perfect results. In particular, unpaper now uses single-precision floating points all over, rather than doubles. This is because the slight imperfection caused by this change are not relevant enough to warrant the ever-so-slight loss in performance due to the bigger variables.

But even up to here, there is very little gain in performance. Sure some calculation can be faster this way, but we’re still using the same set of AVX/FMA instructions. This is unfortunate, unless you start rewriting the algorithms used for searching for edges or rotations, there is no gain to be made by changing the size of the code. When I converted unpaper to use libavcodec, I decided to make the code simple and as stupid as I could make it, as that meant I could have a baseline to improve from, but I’m not sure what the best way to improve it is, now.

I still have a branch that uses OpenMP for the processing, but since most of the filters applied are dependent on each other it does not work very well. Per-row processing gets slightly better results but they are really minimal as well. I think the most interesting parallel processing low-hanging fruit would be to execute processing in parallel on the two pages after splitting them from a single sheet of paper. Unfortunately, the loops used to do that processing right now are so complicated that I’m not looking forward to touch them for a long while.

I tried some basic profile-guided optimization execution, just to figure out what needs to be improved, and compared with codiff a proper release and a PGO version trained after the tests. Unfortunately the results are a bit vague and it means I’ll probably have to profile it properly if I want to get data out of it. If you’re curious here is the output when using rbelf-size -D on the unpaper binary when built normally, with profile-guided optimisation, with link-time optimisation, and with both profile-guided and link-time optimisation:

% rbelf-size -D ../release/unpaper ../release-pgo/unpaper ../release-lto/unpaper ../release-lto-pgo/unpaper
    exec         data       rodata        relro          bss     overhead    allocated   filename
   34951         1396        22284            0        11072         3196        72899   ../release/unpaper
   +5648         +312         -192           +0         +160           -6        +5922   ../release-pgo/unpaper
    -272           +0        -1364           +0         +144          -55        -1547   ../release-lto/unpaper
   +7424         +448        -1596           +0         +304          -61        +6519   ../release-lto-pgo/unpaper

It’s unfortunate that GCC does not give you any diagnostic on what it’s trying to do achieve when doing LTO, it would be interesting to see if you could steer the compiler to produce better code without it as well.

Anyway, enough with the microptimisations for now. If you want to make unpaper faster, feel free to send me pull requests for it, I’ll be glad to take a look at them!

The subtlety of modern CPUs, or the search for the phantom bug

Yesterday I have released a new version of unpaper which is now in Portage, even though is dependencies are not exactly straightforward after making it use libav. But when I packaged it, I realized that the tests were failing — but I have been sure to run the tests all the time while making changes to make sure not to break the algorithms which (as you may remember) I have not designed or written — I don’t really have enough math to figure out what’s going on with them. I was able to simplify a few things but I needed Luca’s help for the most part.

Turned out that the problem only happened when building with -O2 -march=native so I decided to restrict tests and look into it in the morning again. Indeed, on Excelsior, using -march=native would cause it to fail, but on my laptop (where I have been running the test after every single commit), it would not fail. Why? Furthermore, Luca was also reporting test failures on his laptop with OSX and clang, but I had not tested there to begin with.

A quick inspection of one of the failing tests’ outputs with vbindiff showed that the diffs would be quite minimal, one bit off at some non-obvious interval. It smelled like a very minimal change. After complaining on G+, Måns pushed me to the right direction: some instruction set that differs between the two.

My laptop uses the core-avx-i arch, while the server uses bdver1. They have different levels of SSE4 support – AMD having their own SSE4a implementation – and different extensions. I should probably have paid more attention here and noticed how the Bulldozer has FMA4 instructions, but I did not, it’ll show important later.

I decided to start disabling extensions in alphabetical order, mostly expecting the problem to be in AMD’s implementation of some instructions pending some microcode update. When I disabled AVX, the problem went away — AVX has essentially a new encoding of instructions, so enabling AVX causes all the instructions otherwise present in SSE to be re-encoded, and is a dependency for FMA4 instructions to be usable.

The problem was reducing the code enough to be able to figure out if the problem was a bug in the code, in the compiler, in the CPU or just in the assumptions. Given that unpaper is over five thousands lines of code and comments, I needed to reduce it a lot. Luckily, there are ways around it.

The first step is to look in which part of the code the problem appears. Luckily unpaper is designed with a bunch of functions that run one after the other. I started disabling filters and masks and I was able to limit the problem to the deskewing code — which is when most of the problems happened before.

But even the deskewing code is a lot — and it depends on at least some part of the general processing to be run, including loading the file and converting it to an AVFrame structure. I decided to try to reduce the code to a standalone unit calling into the full deskewing code. But when I copied over and looked at how much code was involved, between the skew detection and the actual rotation, it was still a lot. I decided to start looking with gdb to figure out which of the two halves was misbehaving.

The interface between the two halves is well-defined: the first return the detected skew, and the latter takes the rotation to apply (the negative value to what the first returned) and the image to apply it to. It’s easy. A quick look through gdb on the call to rotate() in both a working and failing setup told me that the returned value from the first half matched perfectly, this is great because it meant that the surface to inspect was heavily reduced.

Since I did not want to have to test all the code to load the file from disk and decode it into a RAW representation, I looked into the gdb manual and found the dump commands that allows you to dump part of the process’s memory into a file. I dumped the AVFrame::data content, and decided to use that as an input. At first I decided to just compile it into the binary (you only need to use xxd -i to generate C code that declares the whole binary file as a byte array) but it turns out that GCC is not designed to compile efficiently a 17MB binary blob passed in as a byte array. I then opted in for just opening the raw binary file and fread() it into the AVFrame object.

My original plan involved using creduce to find the minimal set of code needed to trigger the problem, but it was tricky, especially when trying to match a complete file output to the md5. I decided to proceed with the reduction manually, starting from all the conditional for pixel formats that were not exercised… and then I realized that I could split again the code in two operations. Indeed while the main interface is only rotate(), there were two logical parts of the code in use, one translating the coordinates before-and-after the rotation, and the interpolation code that would read the old pixels and write the new ones. This latter part also depended on all the code to set the pixel in place starting from its components.

By writing as output the calls to the interpolation function, I was able to restrict the issue to the coordinate translation code, rather than the interpolation one, which made it much better: the reduced test case went down to a handful of lines:

void rotate(const float radians, AVFrame *source, AVFrame *target) {
    const int w = source->width;
    const int h = source->height;

    // create 2D rotation matrix
    const float sinval = sinf(radians);
    const float cosval = cosf(radians);
    const float midX = w / 2.0f;
    const float midY = h / 2.0f;

    for (int y = 0; y < h; y++) {
        for (int x = 0; x < w; x++) {
            const float srcX = midX + (x - midX) * cosval + (y - midY) * sinval;
            const float srcY = midY + (y - midY) * cosval - (x - midX) * sinval;
            externalCall(srcX, srcY);

Here externalCall being a simple function to extrapolate the values, the only thing it does is printing them on the standard error stream. In this version there is still reference to the input and output AVFrame objects, but as you can notice there is no usage of them, which means that now the testcase is self-contained and does not require any input or output file.

Much better but still too much code to go through. The inner loop over x was simple to remove, just hardwire it to zero and the compiler still was able to reproduce the problem, but if I hardwired y to zero, then the compiler would trigger constant propagation and just pre-calculate the right value, whether or not AVX was in use.

At this point I was able to execute creduce; I only needed to check for the first line of the output to match the “incorrect” version, and no input was requested (the radians value was fixed). Unfortunately it turns out that using creduce with loops is not a great idea, because it is well possible for it to reduce away the y++ statement or the y < h comparison for exit, and then you’re in trouble. Indeed it got stuck multiple times in infinite loops on my code.

But it did help a little bit to simplify the calculation. And with again a lot of help by Måns on making sure that the sinf()/cosf() functions would not return different values – they don’t, also they are actually collapsed by the compiler to a single call to sincosf(), so you don’t have to write ugly code to leverage it! – I brought down the code to

extern void externCall(float);
extern float sinrotation();
extern float cosrotation();

static const float midX = 850.5f;
static const float midY = 1753.5f;

void main() {
    const float srcX = midX * cosrotation() - midY * sinrotation();

No external libraries, not even libm. The external functions are in a separate source file, and beside providing fixed values for sine and cosine, the externCall() function only calls printf() with the provided value. Oh if you’re curious, the radians parameter became 0.6f, because 0, 1 and 0.5 would not trigger the behaviour, but 0.6 which is the truncated version of the actual parameter coming from the test file, would.

Checking the generated assembly code for the function then pointed out the problem, at least to Måns who actually knows Intel assembly. Here follows a diff of the code above, built with -march=bdver1 and with -march=bdver1 -mno-fma4 — because turns out the instruction causing the problem is not an AVX one but an FMA4, more on that after the diff.

        movq    -8(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmovss  -20(%rbp), %xmm2
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovss  -20(%rbp), %xmm1
+       vfmsubss        %xmm0, .LC0(%rip), %xmm1, %xmm0
        .cfi_def_cfa 7, 8
-       vsubss  %xmm0, %xmm1, %xmm0
        jmp     externCall@PLT

It’s interesting that it’s changing the order of the instructions as well, as well as the constants — for this diff I have manually swapped .LC0 and .LC1 on one side of the diff, as they would just end up with different names due to instruction ordering.

As you can see, the FMA4 version has one instruction less: vfmsubss replaces both one of the vmulss and the one vsubss instruction. vfmsubss is a FMA4 instruction that performs a Fused Multiply and Subtract operation — midX * cosrotation() - midY * sinrotation() indeed has a multiply and subtract!

Originally, since I was disabling the whole AVX instruction set, all the vmulss instructions would end up replaced by mulss which is the SSE version of the same instruction. But when I realized that the missing correspondence was vfmsubss and I googled for it, it was obvious that FMA4 was the culprit, not the whole AVX.

Great, but how does that explain the failure on Luca’s laptop? He’s not so crazy so use an AMD laptop — nobody would be! Well, turns out that Intel also have their Fused Multiply-Add instruction set, just only with three operands rather than four, starting from Haswell CPUs, which include… Luca’s laptop. A quick check on my NUC which also has a Haswell CPU confirms that the problem exists also for the core-avx2 architecture, even though the code diff is slightly less obvious:

        movq    -24(%rbp), %rax
        xorq    %fs:40, %rax
        jne     .L6
-       vmulss  .LC1(%rip), %xmm0, %xmm0
-       vmovd   %ebx, %xmm2
-       vmulss  .LC0(%rip), %xmm2, %xmm1
+       vmulss  .LC1(%rip), %xmm0, %xmm0
+       vmovd   %ebx, %xmm1
+       vfmsub132ss     .LC0(%rip), %xmm0, %xmm1
        addq    $24, %rsp
+       vmovaps %xmm1, %xmm0
        popq    %rbx
-       vsubss  %xmm0, %xmm1, %xmm0
        popq    %rbp
        .cfi_def_cfa 7, 8

Once again I swapped .LC0 and .LC1 afterwards for consistency.

The main difference here is that the instruction for fused multiply-subtract is vfmsub132ss and a vmovaps is involved as well. If I read the documentation correctly this is because it stores the result in %xmm1 but needs to move it to %xmm0 to pass it to the external function. I’m not enough of an expert to tell whether gcc is doing extra work here.

So why is this instruction causing problems? Well, Måns knew and pointed out that the result is now more precise, thus I should not work it around. Wikipedia, as linked before, points also out why this happens:

A fused multiply–add is a floating-point multiply–add operation performed in one step, with a single rounding. That is, where an unfused multiply–add would compute the product b×c, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire sum a+b×c to its full precision before rounding the final result down to N significant bits.

Unfortunately this does mean that we can’t have bitexactness of images for CPUs that implement fused operations. Which means my current test harness is not good, as it compares the MD5 of the output with the golden output from the original test. My probable next move is to use cmp to count how many bytes differ from the “golden” output (the version without optimisations in use), and if the number is low, like less than 1‰, accept it as valid. It’s probably not ideal and could lead to further variation in output, but it might be a good start.

Optimally, as I said a long time ago I would like to use a tool like pdiff to tell whether there is actual changes in the pixels, and identify things like 1-pixel translation to any direction, which would be harmless… but until I can figure something out, it’ll be an imperfect testsuite anyway.

A huge thanks to Måns for the immense help, without him I wouldn’t have figured it out so quickly.

unpaper and libav status update

The other day I wrote about unpaper and the fact that I was working on making it use libav for file input. I have now finished converting unpaper (in a branch) so that it does not use its own image structure, but rather the same AVFrame structure that libav uses internally and externally. This meant not only supporting stripes, but using the libav allocation functions and pixel formats.

This also enabled me to use libav for file output as well as input. While for the input I decided to add support for formats that unpaper did not read before, for output at the moment I’m sticking with the same formats as before. Mostly because the one type of output file I’d like to support is not currently supported by libav properly, so it’ll take me quite a bit longer to be able to use it. For the curious, the format I’m referring to is multipage TIFF. Right now libav only supports single-page TIFF and it does not support JPEG-compressed TIFF images, so there.

Originally, I planned to drop compatibility with previous unpaper version, mostly because to drop the internal structure I was going to lose the input format information for 1-bit black and white images. At the end I was actually able to reimplement the same feature in a different way, and so I restored that support. The only compatibility issue right now is that the -depth parameter is no longer present, mostly because it and -type constrained the same value (the output format).

To reintroduce the -depth parameter, I want to support 16-bit gray. Unfortunately to do so I need to make more fundamental changes to the code, as right now it expects to be able to get the full value at most at 24 bit — and I’m not sure how to scale a 16-bit grayscale to 24-bit RGB and maintain proper values.

While I had to add almost as much code to support the libav formats and their conversion as there was there to load the files, I think this is still a net win. The first point is that there is no format parsing code in unpaper, which means that as long as the pixel format is something that I can process, any file that libav supports now or will support in the future will do. Then there is the fact that I ended up making the code “less smart” by removing codepath optimizations such as “input and output sizes match, so I won’t be touching it, instead I’ll copy one structure on top of the other”, which means that yes, I probably lost some performance, but I also gained some sanity. The code was horribly complicated before.

Unfortunately, as I said in the previous post, there are a couple of features that I would have preferred if they were implemented in libav, as that would mean they’d be kept optimized without me having to bother with assembly or intrinsics. Namely pixel format conversion (which should be part of the proposed libavscale, still not reified), and drawing primitives, including bitblitting. I think part of this is actually implemented within libavfilter but as far as I know it’s not exposed for other software to use. Having optimized blitting, especially “copy this area of the image over to that other image” would be definitely useful, but it’s not a necessary condition for me to release the current state of the code.

So current work in progress is to support grayscale TIFF files (PAL8 pixel format), and then I’ll probably turn to libav and try to implement JPEG-encoded TIFF files, if I can find the time and motivation to do so. What I’m afraid of is having to write conversion functions between YUV and RGB, I really don’t look forward to that. In the mean time, I’ll keep playing Tales of Graces f because I love those kind of games.

Also, for those who’re curious, the development of this version of unpaper is done fully on my ZenBook — I note this because it’s the first time I use a low-power device to work on a project that actually requires some processing power to build, but the results are not bad at all. I only had to make sure I had swap enabled: 4GB of RAM are no longer enough to have Chrome open with a dozen tabs, and a compiler in the background.

unpaper and libav

I’ve resumed working on unpaper since I have been using it more than a couple of times lately and there has been a few things that I wanted to fix.

What I’ve been working on now is a way to read input files in more formats; I was really aggravated by the fact that unpaper implemented its own loading of a single set of file formats (the PPM “rawbits”); I went on to look into libraries that abstract access to image formats, but I couldn’t find one that would work for me. At the end I settled for libav even though it’s not exactly known for being an image processing library.

My reasons to choose libav was mostly found in the fact that, while it does not support all the formats I’d like to have supported in unpaper (PS and PDF come to mind), it does support the formats that it supports now (PNM and company), and I know the developers well enough that I can get bugs and features fixed or implemented as needed.

I have now a branch can read files by using libav. It’s a very naïve implementation of it though: it reads the image into an AVFrame structure and then convert that into unpaper’s own image structure. It does not even free up the AVFrame, mostly because I’d actually like to be able to use AVFrame instead of unpaper’s structure. Not only to avoid copying memory when it’s not required (libav has functions to do shallow-copy of frames and mark them as readable when needed), but also because the frames themselves already contain all the needed information. Furthermore, libav 12 is likely going to include libavscale (or so Luca promised!) so that the on-load conversion can also be offloaded to the library.

Even with the naïve implementation that I implemented in half an afternoon, unpaper not only supports the same input file as before, but also PNG (24-bit non-alpha colour files are loaded the same way as PPM, 1-bit black and white is inverted compared to PBM, while 8-bit grayscale is actually 16-bit with half of it defining the alpha channel) and very limited TIFF support (1-bit is the same as PNG; 8-bit is paletted so I have not implemented it yet, and as for colour, I found out that libav does not currently support JPEG-compressed TIFF – I’ll work on that if I can – but otherwise it is supported as it’s simply 24bpp RGB).

What also needs to be done is to write out the file using libav too. While I don’t plan to allow writing files in any random format with unpaper, I wouldn’t mind being able to output through libav. Right now the way this is implemented, the code does explicit conversion back or forth between black/white, grayscale and colour at save time, and this is nothing different than the same conversion that happens at load time, and should rather be part of libavscale when that exists.

Anyway, if you feel like helping with this project, the code is on GitHub and I’ll try to keep it updated soon.

More threads is not always a good solution

You might remember that some time ago I took over unpaper as I used (and sort of use still) to pre-filter the documents I scan to archive in electronic form. While I’ve been very busy in the past few months, I’m now trying to clear out my backlog, and it turns out that a good deal of it involves unpaper. There are bugs to fix and features to implement, but also patches to evaluate.

One of the most recent patches I received is designed to help with the performance of the tool, which are definitely not what Jens, the original author, had in mind when he came up with the code. Better performance is something that about everybody would love to have at this point. Unfortunately, it turns out that the patch is not working out as intended.

The patch is two-fold: from one side, it makes (optional) use of OpenMP to parallelize the code in hope to speed it up. Given that most of the code is a bunch of loops, it seemed obvious that using multithreading would speed things out, right? Well, first try after applying it showed very easily that it slows down, at least on Excelsior, which is a 32-cores system. While it would take less than 10 seconds for the first test to run without OpenMP, it would take over a minute with it, spinning up all the cores for a 3000% CPU usage.

A quick test shows that forcing the number of threads to 8, rather than leaving it unbound, makes it actually faster than the non-OpenMP variant. This means that there are a few different variables in play that needs to be tuned for performance to improve. Without going into profiling the code, I can figure out a few things that can go wrong with unchecked multithreading:

  • extensive locking when each worker thread is running, either because they are all accessing the same memory page, or because the loop needs a “reduced” result (e.g. a value has to be calculated as the sum of values calculated within a parallelized loop); in the case of unpaper I’m sure both situations happen fairly often in the looping codepath.
  • cache trashing: as the worker threads jump around the memory area to process, it no longer is fetching a linear amount of memory;
  • something entirely more complicated.

Beside being obvious that doing the “stupid” thing and just making all the major loops parallel, this situation is bringing me to the point where I finally will make use of the Using OpenMP book that I got a few years back and I only started reading after figuring out that OpenMP was not ready yet for prime time. Nowadays OpenMP support in Linux improved to the point it’s probably worth taking another look at it, and I guess unpaper is going to be the test case for it. You can expect a series of blog posts on the topic at this point.

The first thing I noticed while reading the way OpenMP handles shared and private variables, is that the const indication is much stronger when using OpenMP. The reason is that if you tell the code that a given datum is not going to change (it’s a constant, not a variable), it can easily assume direct access from all the threads will work properly; the variable is shared by default among all of them. This is something that, for non-OpenMP programs, is usually inferred from the SSA form — I assume that for whatever reason, OpenMP makes SSA weaker.

Unfortunately, this also means that there is one nasty change tha tmight be required to make code friendlier to OpenMP, and that is a change in prototypes of functions. The parameters to a function are, for what OpenMP is concerned, variables, and that means that unless you declare them const, it won’t be sharing them by default. Within a self-contained program like unpaper, changing the signatures of the functions so that parameters are declared, for instance, const int is painless — but for a library it would be API breakage.

Anyway, just remember: adding multithreading is not the silver bullet you might think it is!

P.S.: I’m currently trying to gauge interest on a volume that would collect, re-edit, and organize all I written up to now on ELF. I’ve created a page on leanpub where you can note down your interest, and how much you think such a volume would be worth. If you would like to read such a book, just add your interest through the form; as soon as at least a dozen people are interested, I’ll spend all my free time working on the editing.

Unpaper (un)planning

So after just shy of an year of me forking unpaper to save it from the possible shutdown of BerliOS, and to update it to fit better in a 2012 distribution, I guess it’s time to have some ideas on how to move forward.

Now, while I haven’t done as much work on it as I’d have liked to begin with, we have a new series of packages, version 0.4.x, which has been packaged in Gentoo since day one (obviously), and is now available in Debian as well; these versions have a proper, Autotools-based build system, source code split in multiple files and cleaned up a little bit, and have a rewritten option parsing, although an imperfect one still.

The main issue with the option parsing is that the original parameters are not compatible with your average Unix long options — the end result is that I had to come up with enough hacks to have them parsed properly, and I can’t get the command to behave like a Unix command unless I also break compatibility with its original parameters. Which is something that might very well happen.

But there is another issue in all of this: the code still uses an handmade parser of netpbm-style image files, which is quite nasty to be honest, and likely prone to issues (I haven’t tried to see if I could overflow it, but I wouldn’t be surprised if it was possible; error handling also needs lots of work). It’s obvious that what I would like to do is replace the whole image loading and saving to use an external library of some kind, which also means that it would be possible to support more input and output formats at the same time — which is an especially good thing because right now during the testing phase I have to convert some PNGs to PNM to process.

My first try has been using gdk-pixbuf: the reason is that it’s already a commonly-used library, it has been split off Gtk lately, and using it and the rest of Glib it means that unpaper then has very little of its own utility functions, as Glib provides most of it, file access, option parsing and most importantly threadpools, which I wanted to use to be able to implement --jobs inside unpaper itself. Unfortunately the gdk-pixbuf library doesn’t seem to be extremely well suited for image processing as much as it is for decoding; saving files is clumsy and even the in-memory representation doesn’t feel extremely easy to play with.

The other solution I had in mind was to use libav for loading and saving — for sure its decoders and encoders are as optimised as it makes sense and will keep so, and it’s not so difficult to foresee that if the code is rewritten to be compatible with libav itself it could just become a piece of libavfilter, making unpaper just a frontend. The problem is that this still requires a lot of work, I’m pretty sure.

And there is one more issue here to say something about: for a while I won’t have as much motivation to work on unpaper for the simple reason that I won’t have a scanner! Since I couldn’t get an US Visa at this time, I won’t be able to actually move to the United States anytime soon, which put a hold to all my plans to ship or sell my stuff here and then buy what remains there. For this reason I won’t have as much use for unpaper as I had up to now. I don’t doubt at some point I’ll have again a scanner and the need to archive data (actually I’m quite sure this will happen sooner rather than later), so I will keep looking into improving unpaper in the mean time still, but it might take a backseat to other things.

The future of Unpaper

You might have read it already, or maybe you haven’t, but it looks like Berlios is going to close down at the end of the year. I wouldn’t go as far as calling it the end of an era, but we’re pretty close. Even I have a few projects that are (still) hosted on Berlios and I need to migrate.

Interestingly, Berlios is also the original hosting for unpaper which makes me forking it a couple of months ago a very good move. Especially since if I waited too long, the repository wouldn’t have been available.. even if it’s true that the repository didn’t really contain anything useful, as it was just an import of the 0.3 sources.

At any rate, since Jens didn’t reply to my inquiries, I’ve decided to start working on a more proper takeover of the project. I have created an unpaper project page on my website, to which I switched the live ebuild’s HOMEPAGE and GitHub’s website, as well as created an ohloh project to track the development.

Oh, and I released version 0.4. Yeah I guess this was the first thing to write about, but I wanted to make it less obvious.

The new release is basically simply the first cleanup I worked on, so new build system, no changes in parameters, man page and so on. Right after releasing 0.4 I merged the changes from the new contributor, Felix Janda, who not only took the time to break up the code in multiple source files, but also improved the blur filter.

Now, the next release is likely going to be 2.0; why skipping the 1.0 release? Well, it’s not really skipping. Before Berlios shuts down I wanted to copy down the previous list of downloads so I can mirror those as well, and what I found is that… the original version number series started with 1.0, so it’s mucked up badly; in Gentoo we have no problem since 0.3 was the only one we had, but for the sake of restoring consistence, the next version is going to be 2.0.

What is going to happen with that release? Well, for sure I want to rewrite the command-line options parsing. Right now it’s very custom, as long options are sometimes prefixed with one, sometimes with two dashes, and in general it doesn’t fit the usual unix command lines; not counting the fact that the parsing is done as you go with a series of strcmp() calls, which is not what one expects, usually. I intend to rewite this with getopt_long() but the problem with that is that it will break the command-line compatibility with unpaper 0.3, which is not something I’m happy about. But we’ve got to do that sooner or later if we want a more well-blent tool.

I hope to also be able to hook into the code a different way to load and save images, using some already present image decoding and encoding library, so that it can digest images in different formats than simple PNM. In particular, I’d like to be able to execute as a single pass the conversion from multiple files to a multi-page TIFF document, which requires quite a bit of work indeed. But one can dream, can’t I?

In the mean time, I hope to find some time this week to find a way to generate man pages on my server so that I can publish more complete documentation for both Ruby-Elf and unpaper itself. This is likely going to be difficult, since I’m starting some new tasks next week but.. you never know.

Unpaper fork, part 2

Last month I posted a call to action hoping for help with cleaning up the unpaper code, as the original author has not been updating it since 2007, and it had a few issues. While I have seen some interest in said fork and cleanup, nobody stepped up with help, so it is proceeding, albeit slowly.

What is available now in my GitHub repository is mostly cleaned up, although still not extremely more optimised than the original — I actually removed one of the “optimisations” I added since the fork: the usage of sincosf() function. As Freddie pointed out in the other post’s comments, the compiler has a better chance of optimising this itself; indeed both GCC and PathScale’s compiler optimise two sin and cos calls with the same argument into a single sincos call, which is good. And using two separate calls allows declaring the temporary used to store the results as constant.

And indeed today I started rewriting the functions so that temporaries are declared as constant as possible, and with the most limited scope as it’s applicable to theme. This was important to me for one reason: I want to try making use of OpenMP to improve its performance on modern multicore systems. Since most of the processing is applied independently to each pixel, it should be possible for many iterative cycles to be executed in parallel.

It would also be a major win in my book if the processing of input pages was feasible in parallel as well: my current scan script has to process the scanned sheets in parallel itself, calling many copies of unpaper, just to process the sheets faster (sometimes I scan tens of sheets, such as bank contracts and similar). I just wonder if it makes sense to simply start as many threads as possible, each one handling one sheet, or if that would risk to hog the scheduler.

Finally there is the problem of testing. Freddie also pointed me at the software I remembered to check the differences between two image files: pdiff — which is used by the ChromiumOS build process, by the way. Unfortunately I then remembered why I didn’t like it: it uses the FreeImage library, which bundles a number of other image format libraries, and upstream refuses to apply sane development to it.

What would be nice for this would be to either modify pdiff to use a different library – such as libav! – to access the image data, or to find or write something similar that does not require such stupidly-designed libraries.

Speaking about image formats, it would be interesting to get unpaper to support other image format beside PNM; this way you wouldn’t have to keep converting from and to the other formats when processing. One idea that Luca gave me was to make use of libav itself to handle that part: it already supports PNM, PNG, JPEG and TIFF, so it would provide most of the features it’d be needing.

In the mean time, please let me know if you like how this is doing — and remember that this blog, the post and me are Flattr enabled!

Forking unpaper, call for action

You might or might not know the tool by the name of unpaper that has been in Gentoo’s main tree for a while. If you don’t know it and you scan a lot, please go look at it now, it is sweet.

But sweet or not, the tool itself had quite a few shortcomings; one of these was recently brought to my attention as unsafe use of sprintf() that was fixed by upstream after 0.3 release, but which never arrived to a release.

When looking at fixing that one issue, I ended up deciding for a slightly more drastic approach: I forked the code, imported it to github and started hacking at it. This both because the package lacked a build system, and because the tarball provided didn’t correspond with the sources on CVS (nor with those on SVN for what it’s worth).

For those who wonder why I got involved in this while this is obviously outside my usual interest area, I’m using unpaper almost daily on my paperless quest that is actually paying its fruits (my accountant is pleasantly surprised by how little time it takes to me to find the paperwork he needs). And if I can shave even fractions of seconds from a single unpaper process it can improve my workflow considerably.

What I have now in my repository is an almost identical version that has passed through some improvements: the build system is autotools (properly written), that works quite fine even for a single-source package, as it can find a couple of features that would otherwise be ignored. The code does not have the allocation waste that it did before, as I’ve removed a number of pointers to characters with preprocessor macros, and I started looking at a few strange things in the code.

For instance, it now no longer opens the file, seek to the end, then rewind to the start to find the file’s size, which was especially unhelpful since the variable where the file’s size was saved was never read from but the stdio calls have side effects, so the compiler couldn’t drop them by itself.

And when it is present, it will use sincosf() rather than calling sin() and cos() separately.

I also stopped the code from copying a string from a prebuilt table, and parse it at runtime to get the represented float value.. multiple times. This was mostly tied with the page size parsing, which I have basically rewritten, also avoiding looping twice over the two sizes with two separate loops. Duh!

I also originally overlooked the fact that the repository had some pre-defined self-tests that were never packaged and thus couldn’t be used for custom builds before; this is also fixed now, and make check runs the tests just fine. Unfortunately what this does not do is comparing the output with some known-good output, I need an image compare tool to do so; for now it only ensures that unpaper behaves as expected with the commandline it is provided, better than nothing.

At any rate this is obviously only the beginning: there are bugs open on the berlios project page that I should probably look into fixing, and I have already started writing a list of TODO tasks that should be taken care of at some point or another. If you’re interested in helping out, please clone the repository and see what you can do. Testing is also very much appreciated.

I haven’t decided when to do a release, for now I’m hoping that Jens will join the fork and publish the releases on berlios based on the autotools build system. There’s a live ebuild in main tree for testing (app-text/unpaper-9999), so I’d be grateful if you could try it on a few different systems. Please enable FEATURES=test for it so that if something breaks we’ll know son enough. If you’re a maintainer packaging unpaper on other distributions, feel free to get in touch with me and tell me if you’ve other patches to provide (I should mail the Debian maintainer at this point I guess).