unpaper: re-designing an interface

This is going to be part of a series of post that will appear over the next few months with plans, but likely no progress, to move unpaper forward. I have picked up unpaper many years ago, and I’ve ran with it for a while, but beside a general “code health” pass over it, and back and forth on how to parse images, I’ve not managed to move the project forward significantly at all. And in the spirit of what I wrote earlier, I would like to see someone else pick up the project. It’s the reason why I create an organization on GitHub to keep the repository in.

For those who are not aware, unpaper is a project I did not start — it was originally written by Jens Gulden, who I understand worked on its algorithms as part of university. It’s a tool that processes scanned images of text pages, to make them easier to OCR them, and it’s often used as a “processing step” for document processing tools, including my own.

While the tool works… acceptably well… it does have a number of issues that always made me feel fairly uneasy. For instance, the command line flags are far from standard, and can’t be implemented with a parser library, relying instead on a lot of custom parsing, and including a very long and complicated man page.

There’s also been a few requests of moving the implementation to a shared library that could be used directly, but I don’t feel like it’s worth the hassle, because the current implementation is not really thread-safe, and that would be a significant rework to make it so.

So I have been having a bit of a thought out about it. The first problem is that re-designing the command line interface would mean breaking all of the programmatic users, so it’s not an easy decision to take Then there’s been something else that I learnt about that made me realize I think I know how to solve this, although it’s not going to be easy.

If you’ve been working exclusively on Linux and Unix-like systems, and still shy away from learning about what Microsoft is doing (which, to me, is a mistake), you might have missed PowerShell and its structured objects. To over-simplify, PowerShell piping doesn’t just pipe text from one command to another, but structured objects that are kept structured in and out.

While PowerShell is available for Linux nowadays, I do not think that tying unpaper to it is a sensible option, so I’m not even suggesting that. But I also found out that the ip command (from iproute2) has recently received a -J option, which, instead of printing the usual complex mix of parsable and barely human readable output, generates a JSON document with the same information. This makes it much easier to extract the information you need, particularly with a tool like jq available, that allows “massaging” the data on the command line easily. I have actually used this “trick” at work recently. It’s a very similar idea to RPC, but with a discrete binary.

So with this knowledge in my head, I have a fairly clear idea of what I would like to have as an interface for a future unpaper.

First of all, it should be two separate command line tools — they may both be written in C, or the first one might be written in Python or any other language. The job of this language-flexible tool is to be the new unpaper command line executable. It should accept exactly the same command line interface of the current binary, but implement none of the algorithm or transformation logic.

The other tool should be written in C, because it should just contain all the current processing code. But instead of having to do complex parsing of the command line interface, it should instead read on the standard input a JSON document providing all of the parameters for the “job”.

Similarly, there’s some change needed to the output of the programs. Some of the information, particularly debugging, that is currently printed on the stderr stream should stay exactly where it is, but all of the standard output, well, I think it makes significantly more sense to have another JSON document from the processing tool, and convert that to human-readable form in the interface.

Now, with a proper documentation of the JSON schema, it means that the software using unpaper as a processing step can just build their job document, and skip the “human” interface. It would even make it much easier to write extensions in Python, Ruby, and any other language, as it would allow exposing a job configuration generator following the language’s own style.

Someone might wonder why I’m talking about JSON in particular — there’s dozens of different structured data formats that could be used, including protocol buffers. As I said a number of months ago, the important part in a data format is its schema, so the actual format wouldn’t be much of a choice. But on the other hand, JSON is a very flexible format that has good implementations in most languages, including C (which is important, since the unpaper algorithms are implemented in C, and – short of moving to Rust – I don’t want to change language).

But there’s something even more important than the language support, which I already noted above: jq. This is an astounding work of engineering, making it so much easier to handle JSON documents, particularly inline between programs. And that is the killer reason to use JSON for me. Because that gives even more flexibility to an interface that, for the longest time, felt too rigid to me.

So if you’re interested to contribute to an open source project, with no particular timeline pressure, and you’re comfortable with writing C — please reach out, whether it is to ask questions for clarification, or with a pull request to implement this idea altogether.

And don’t forget, there’s still the Meson conversion project which I also discussed previously. For that one, some of the tasks are not even C projects! It needs someone to take the time to rewrite the man page in Sphinx, and also someone to rewrite the testing setup to be driven by Python, rather than the current mess of Automake and custom shell scripts.


I’m still not sure why I work with Rails, if it’s really for the quality of some projects I found, such as Typo, if it is because I hate most of the alternatives even more (PHP, Python/Turbogears/ToscaWidgets), or because I’m masochist.

After we all seemed to settle in with Rails 2.3.5 as the final iteration of the Rails 2 series, and were ready to face a huge absurd mess with Rails 3 once released, the project decided to drop a further bombshell on us in the form of Rails 2.3.6, and .7, and finally .8 that seemed to work more or less correctly. These updates weren’t simple bugfixes, because they actually went much further: they changed the supported version of Rack from 1.0.1 to 1.1.0 (it changes basically the whole internal engine of the framework!), the version if tmail and of i18n. It also changed the tzinfo version, but that’s almost pointless when considered in Gentoo since we actually use it unbundled and keep it up-to-date when new releases are made.

But most likely the biggest trouble with the new Rails version is implementing an anti-XSS interface compatible with Rails 3; this caused quite a stir because almost all Rails applications needed to be adapted for that to work. Typo for instance is still not compatible with 2.3.8, as far as I know! Rails-extensions and other Rails-tied libraries also had to be updated, and when we’ve been lucky enough, upstream kept them compatible with both the old and the new interfaces.

At any rate, my current job requires me to work with Rails and with some Rails extensions; and since I’m the kind of person who steps up to do something more even though it’s not paid for, I made sure I had ebuilds, that I could run test with, for all of them. This actually turned out more than once useful, and as it happens, today was another of those days when I’m glad I’m developing Gentoo.

The first problem appeared when it came time to update to the new (minor) version of oauth2 (required to implement proper Facebook-connected login with their changes last April); for ease of use with their Javascript framework, the new interface uses almost exclusively the JSON format; and in Ruby, there is no shortage of JSON interpreters. Indeed, beside the original implementation in ActiveSupport (part of Rails) there is the JSON gem, which provides both a “pure ruby” implementation and a compiled implementation (to be honest, the compiled, C-based implementation, was left broken in Gentoo for a while by myself, I’m sorry about that and as soon as I noticed I corrected it); then there is the one I already discussed briefly together with the problems related to Rails 2.3.8: yajl-ruby. Three is better than one, no? I beg you to differ!

To make it feasible to choose between the different implementations as wanted, the oauth2 developers created a new gem, multi_json, that allows to switch between the different implementations. No it doesn’t even try to give a single compatible exception interface, so don’t ask, please. It was time to pack the new gem then, but that caused a bit of trouble by itself: beside the (unfortunately usual) trouble with the Rakefile demanding RSpec to even just declare the targets (so also to build documentation) or the spec target having a dependency over the Jeweler-provided check_dependencies, the testsuite failed on both Ruby 1.9 and JRuby quite soon; the problem? It forced testing with all the supported JSON engines; but Ruby 1.9 lacks ActiveSupport (no, Rails 2.3.8 does not work with Ruby 1.9, stop asking), and JRuby lacks yajl-ruby since that’s a C-based extension. A few changes later and the testsuite reports pending tests when the tested engine is not found as it should have from the start. But a further problem appears in the form of oauth2 test failures: the JSON C-based extension gets identified and loaded but the wrong constant is used to load the engine, which results in multi_json to crap on itself. D’oh!

This was already reported on the GitHub page, on the other hand I resolved to fix it in a different way, possibly more complete; unfortunately I didn’t get it entirely right because the order in which the engines are tested is definitely important to upstream (and yes it seems to be more of a popularity contest than an actual technical behaviour contest, but nevermind that for now). Fixed that, another problem with oauth2 appeared, and that turned out to be caused by the ActiveSupport JSON parser; while the other parsers seems to validate the content they are provided with, AS follows the “Garbage-in, Garbage-out” idea: it does not give any exception if the content given is not JSON; this wouldn’t be so bad if it wasn’t that the oauth2 code actually relied on this to be able to choose between JSON and Form-Encoded parameters given. D’oh! One more fix.

Speaking about this, Headius if you’re reading me you should consider adding a pure-Java JSON parser to multi_json just for the sake of pissing off MRi/REE guys… and to provide a counter-test of a not-available engine.

At least, the problems between multi_json and oauth2 had the decency to happen only when non-standard setup were used (Gentoo is slightly non-standard because of auto_gem), so it’s not entirely upstream’s fault to not have noticed them. Besides, kudos to Michael (upstream) who already released the new gems while I was writing this!

Another gem that I should be using and was in need of a bump was facebooker (it’s in the ruby-overlay rather than in main tree); this one was already bumped recently for compatibility with Rails 2.3.8, but keeping itself compatible with the older release, luckily. Fixing the testsuite for the 1.0.70 version has been easy: it was forcing Rails 2.3.8 in the lack of multi_rails, but I wanted it to work with 2.3.5 as well, so I dropped that forcing. With the new release (1.0.71) the problem became worse because the tests started to fail. I was afraid the problem was in dropped 2.3.5 compatibility but that wasn’t the case.

First of all, I worked out properly the problem of forcing 2.3.8 version of Rails by making it abide to RAILS_VERSION during testing; this allows me not to edit the ebuild for each new version, as I just have to tinker with the environment variables to get them right. Then I proceeded with a (not too short) debugging session, which finally catapulted me to a change in the latest version. Since one function call now fails with Rack 1.1.0 for non-POST requests, the code was changed to ignore the other requests; the testsuite on the other hand tested with both POST and GET requests (which makes me assume that it should work in both cases). The funny part was that the (now failing) function above was only used to provide a parameter to another method (which then checked some further parameters)… but that particular argument was no longer used. So my fix was to remove the introduced exclusion for non-POST, remove the function call, and finally remove the argument. Voilà, the testsuite now passes all green.

All the fixes are in my github page and the various upstream developers are notified about them; hopefully soon new releases of all three will provide the same fixes to those of you who don’t use Gentoo for installing the gems (like, say, if you’re using Heroku to host your application). On the other hand, if you’re a Gentoo user and want to gloat about it, you can tell Rails developer working with OS X that if the Ruby libraries aren’t totally broken is also because we thoroughly test them, more than upstream does!