You might or might not know the tool by the name of unpaper that has been in Gentoo’s main tree for a while. If you don’t know it and you scan a lot, please go look at it now, it is sweet.
But sweet or not, the tool itself had quite a few shortcomings; one of these was recently brought to my attention as unsafe use of
sprintf() that was fixed by upstream after 0.3 release, but which never arrived to a release.
When looking at fixing that one issue, I ended up deciding for a slightly more drastic approach: I forked the code, imported it to GitHub and started hacking at it. This both because the package lacked a build system, and because the tarball provided didn’t correspond with the sources on CVS (nor with those on SVN for what it’s worth).
For those who wonder why I got involved in this while this is obviously outside my usual interest area, I’m using
unpaper almost daily on my paperless quest that is actually paying its fruits (my accountant is pleasantly surprised by how little time it takes to me to find the paperwork he needs). And if I can shave even fractions of seconds from a single unpaper process it can improve my workflow considerably.
What I have now in my repository is an almost identical version that has passed through some improvements: the build system is autotools (properly written), that works quite fine even for a single-source package, as it can find a couple of features that would otherwise be ignored. The code does not have the allocation waste that it did before, as I’ve removed a number of pointers to characters with preprocessor macros, and I started looking at a few strange things in the code.
For instance, it now no longer opens the file, seek to the end, then rewind to the start to find the file’s size, which was especially unhelpful since the variable where the file’s size was saved was never read from but the stdio calls have side effects, so the compiler couldn’t drop them by itself.
And when it is present, it will use
sincosf() rather than calling
I also stopped the code from copying a string from a prebuilt table, and parse it at runtime to get the represented float value.. multiple times. This was mostly tied with the page size parsing, which I have basically rewritten, also avoiding looping twice over the two sizes with two separate loops. Duh!
I also originally overlooked the fact that the repository had some pre-defined self-tests that were never packaged and thus couldn’t be used for custom builds before; this is also fixed now, and
make check runs the tests just fine. Unfortunately what this does not do is comparing the output with some known-good output, I need an image compare tool to do so; for now it only ensures that
unpaper behaves as expected with the commandline it is provided, better than nothing.
At any rate this is obviously only the beginning: there are bugs open on the berlios project page that I should probably look into fixing, and I have already started writing a list of TODO tasks that should be taken care of at some point or another. If you’re interested in helping out, please clone the repository and see what you can do. Testing is also very much appreciated.
I haven’t decided when to do a release, for now I’m hoping that Jens will join the fork and publish the releases on berlios based on the autotools build system. There’s a live ebuild in main tree for testing (
app-text/unpaper-9999), so I’d be grateful if you could try it on a few different systems. Please enable
FEATURES=test for it so that if something breaks we’ll know son enough. If you’re a maintainer packaging
unpaper on other distributions, feel free to get in touch with me and tell me if you’ve other patches to provide (I should mail the Debian maintainer at this point I guess).
I actually was messing around with unpaper and a bunch of other tools this weekend. I’ll certainly give it a try once I get my rig working better. (I don’t scan stuff often, and was trying to improvise with a DSLR. However, I’m thinking that to do it right I might as well just buy a scanner and keep it in a closet most of the time or something…)
Hi Diego, so, is scanning under Linux getting now less of a PITA, or are still only (very) few devices fully supported?
Recent EPSON scanners seem to have almost all their features supported with the AVASYS-provided drivers — although they still use binary blobs most of their backend is opensourced anyway.But it’s all hit and miss I’m afraid.
HP OfficeJet also works fine in my experience…
The MFP? Yeah they tend to work nicely thanks to the hpaio plugin. But how well they work mechanically depends on the pricetag I’m afraid.
> And when it is present, it will use sincosf() rather than calling sin() and cos() separately.This is something best left to the compiler. For quite some time GCC has been performing this optimisation automatically. sincos() is usually uglier than its separated counterparts — if only because it requires variables to be defined to hold the results.> Unfortunately what this does not do is comparing the output with some known-good output, I need an image compare tool to do so;pdiff (http://pdiff.sourceforge.net/) is quite possibly what you’re after — although there are currently no Gentoo ebuilds.
Thanks Freddie!I’ll check pdiff and see to make an ebuild for it if it’s what I need.As for @sincosf@, yeah I noticed that GCC can optimise that itself, I didn’t know that. I’ll probably back off that as it should make the buildsystem nicer as well. Although it happens that the check for it actually helped me finding a bug in pathscale’s compiler 😉