This is going to be part of a series of post that will appear over the next few months with plans, but likely no progress, to move unpaper forward. I have picked up unpaper many years ago, and I’ve ran with it for a while, but beside a general “code health” pass over it, and back and forth on how to parse images, I’ve not managed to move the project forward significantly at all. And in the spirit of what I wrote earlier, I would like to see someone else pick up the project. It’s the reason why I create an organization on GitHub to keep the repository in.
For those who are not aware, unpaper is a project I did not start — it was originally written by Jens Gulden, who I understand worked on its algorithms as part of university. It’s a tool that processes scanned images of text pages, to make them easier to OCR them, and it’s often used as a “processing step” for document processing tools, including my own.
While the tool works… acceptably well… it does have a number of issues that always made me feel fairly uneasy. For instance, the command line flags are far from standard, and can’t be implemented with a parser library, relying instead on a lot of custom parsing, and including a very long and complicated man page.
There’s also been a few requests of moving the implementation to a shared library that could be used directly, but I don’t feel like it’s worth the hassle, because the current implementation is not really thread-safe, and that would be a significant rework to make it so.
So I have been having a bit of a thought out about it. The first problem is that re-designing the command line interface would mean breaking all of the programmatic users, so it’s not an easy decision to take Then there’s been something else that I learnt about that made me realize I think I know how to solve this, although it’s not going to be easy.
If you’ve been working exclusively on Linux and Unix-like systems, and still shy away from learning about what Microsoft is doing (which, to me, is a mistake), you might have missed PowerShell and its structured objects. To over-simplify, PowerShell piping doesn’t just pipe text from one command to another, but structured objects that are kept structured in and out.
While PowerShell is available for Linux nowadays, I do not think that tying unpaper to it is a sensible option, so I’m not even suggesting that. But I also found out that the ip command (from iproute2) has recently received a -J option, which, instead of printing the usual complex mix of parsable and barely human readable output, generates a JSON document with the same information. This makes it much easier to extract the information you need, particularly with a tool like jq available, that allows “massaging” the data on the command line easily. I have actually used this “trick” at work recently. It’s a very similar idea to RPC, but with a discrete binary.
So with this knowledge in my head, I have a fairly clear idea of what I would like to have as an interface for a future unpaper.
First of all, it should be two separate command line tools — they may both be written in C, or the first one might be written in Python or any other language. The job of this language-flexible tool is to be the new unpaper command line executable. It should accept exactly the same command line interface of the current binary, but implement none of the algorithm or transformation logic.
The other tool should be written in C, because it should just contain all the current processing code. But instead of having to do complex parsing of the command line interface, it should instead read on the standard input a JSON document providing all of the parameters for the “job”.
Similarly, there’s some change needed to the output of the programs. Some of the information, particularly debugging, that is currently printed on the stderr stream should stay exactly where it is, but all of the standard output, well, I think it makes significantly more sense to have another JSON document from the processing tool, and convert that to human-readable form in the interface.
Now, with a proper documentation of the JSON schema, it means that the software using unpaper as a processing step can just build their job document, and skip the “human” interface. It would even make it much easier to write extensions in Python, Ruby, and any other language, as it would allow exposing a job configuration generator following the language’s own style.
Someone might wonder why I’m talking about JSON in particular — there’s dozens of different structured data formats that could be used, including protocol buffers. As I said a number of months ago, the important part in a data format is its schema, so the actual format wouldn’t be much of a choice. But on the other hand, JSON is a very flexible format that has good implementations in most languages, including C (which is important, since the unpaper algorithms are implemented in C, and – short of moving to Rust – I don’t want to change language).
But there’s something even more important than the language support, which I already noted above: jq. This is an astounding work of engineering, making it so much easier to handle JSON documents, particularly inline between programs. And that is the killer reason to use JSON for me. Because that gives even more flexibility to an interface that, for the longest time, felt too rigid to me.
So if you’re interested to contribute to an open source project, with no particular timeline pressure, and you’re comfortable with writing C — please reach out, whether it is to ask questions for clarification, or with a pull request to implement this idea altogether.
And don’t forget, there’s still the Meson conversion project which I also discussed previously. For that one, some of the tasks are not even C projects! It needs someone to take the time to rewrite the man page in Sphinx, and also someone to rewrite the testing setup to be driven by Python, rather than the current mess of Automake and custom shell scripts.
Ooooh, JSON from iproute2! Once that has been aroudn for long enough that it can be relied on, the nasty NASTY parsing one thing I wrote can finally be moved to “declare a suitable data structure, tag it for JSON-marshalling, extract from there” instead of the somewhat gnarly hand-coded “parse the output of ‘ip route list'” that is currently embedded in it.
And, thankfully, the interesting bit is not “parse the machine’s routing table”, it is “cross-correlate routing tables from N machines, and find any outliers, so an alert can be raised”.