You might or might not know the tool by the name of unpaper that has been in Gentoo’s main tree for a while. If you don’t know it and you scan a lot, please go look at it now, it is sweet.
But sweet or not, the tool itself had quite a few shortcomings; one of these was recently brought to my attention as unsafe use of
sprintf() that was fixed by upstream after 0.3 release, but which never arrived to a release.
When looking at fixing that one issue, I ended up deciding for a slightly more drastic approach: I forked the code, imported it to GitHub and started hacking at it. This both because the package lacked a build system, and because the tarball provided didn’t correspond with the sources on CVS (nor with those on SVN for what it’s worth).
For those who wonder why I got involved in this while this is obviously outside my usual interest area, I’m using
unpaper almost daily on my paperless quest that is actually paying its fruits (my accountant is pleasantly surprised by how little time it takes to me to find the paperwork he needs). And if I can shave even fractions of seconds from a single unpaper process it can improve my workflow considerably.
What I have now in my repository is an almost identical version that has passed through some improvements: the build system is autotools (properly written), that works quite fine even for a single-source package, as it can find a couple of features that would otherwise be ignored. The code does not have the allocation waste that it did before, as I’ve removed a number of pointers to characters with preprocessor macros, and I started looking at a few strange things in the code.
For instance, it now no longer opens the file, seek to the end, then rewind to the start to find the file’s size, which was especially unhelpful since the variable where the file’s size was saved was never read from but the stdio calls have side effects, so the compiler couldn’t drop them by itself.
And when it is present, it will use
sincosf() rather than calling
I also stopped the code from copying a string from a prebuilt table, and parse it at runtime to get the represented float value.. multiple times. This was mostly tied with the page size parsing, which I have basically rewritten, also avoiding looping twice over the two sizes with two separate loops. Duh!
I also originally overlooked the fact that the repository had some pre-defined self-tests that were never packaged and thus couldn’t be used for custom builds before; this is also fixed now, and
make check runs the tests just fine. Unfortunately what this does not do is comparing the output with some known-good output, I need an image compare tool to do so; for now it only ensures that
unpaper behaves as expected with the commandline it is provided, better than nothing.
At any rate this is obviously only the beginning: there are bugs open on the berlios project page that I should probably look into fixing, and I have already started writing a list of TODO tasks that should be taken care of at some point or another. If you’re interested in helping out, please clone the repository and see what you can do. Testing is also very much appreciated.
I haven’t decided when to do a release, for now I’m hoping that Jens will join the fork and publish the releases on berlios based on the autotools build system. There’s a live ebuild in main tree for testing (
app-text/unpaper-9999), so I’d be grateful if you could try it on a few different systems. Please enable
FEATURES=test for it so that if something breaks we’ll know son enough. If you’re a maintainer packaging
unpaper on other distributions, feel free to get in touch with me and tell me if you’ve other patches to provide (I should mail the Debian maintainer at this point I guess).