unpaper: re-designing an interface

This is going to be part of a series of post that will appear over the next few months with plans, but likely no progress, to move unpaper forward. I have picked up unpaper many years ago, and I’ve ran with it for a while, but beside a general “code health” pass over it, and back and forth on how to parse images, I’ve not managed to move the project forward significantly at all. And in the spirit of what I wrote earlier, I would like to see someone else pick up the project. It’s the reason why I create an organization on GitHub to keep the repository in.

For those who are not aware, unpaper is a project I did not start — it was originally written by Jens Gulden, who I understand worked on its algorithms as part of university. It’s a tool that processes scanned images of text pages, to make them easier to OCR them, and it’s often used as a “processing step” for document processing tools, including my own.

While the tool works… acceptably well… it does have a number of issues that always made me feel fairly uneasy. For instance, the command line flags are far from standard, and can’t be implemented with a parser library, relying instead on a lot of custom parsing, and including a very long and complicated man page.

There’s also been a few requests of moving the implementation to a shared library that could be used directly, but I don’t feel like it’s worth the hassle, because the current implementation is not really thread-safe, and that would be a significant rework to make it so.

So I have been having a bit of a thought out about it. The first problem is that re-designing the command line interface would mean breaking all of the programmatic users, so it’s not an easy decision to take Then there’s been something else that I learnt about that made me realize I think I know how to solve this, although it’s not going to be easy.

If you’ve been working exclusively on Linux and Unix-like systems, and still shy away from learning about what Microsoft is doing (which, to me, is a mistake), you might have missed PowerShell and its structured objects. To over-simplify, PowerShell piping doesn’t just pipe text from one command to another, but structured objects that are kept structured in and out.

While PowerShell is available for Linux nowadays, I do not think that tying unpaper to it is a sensible option, so I’m not even suggesting that. But I also found out that the ip command (from iproute2) has recently received a -J option, which, instead of printing the usual complex mix of parsable and barely human readable output, generates a JSON document with the same information. This makes it much easier to extract the information you need, particularly with a tool like jq available, that allows “massaging” the data on the command line easily. I have actually used this “trick” at work recently. It’s a very similar idea to RPC, but with a discrete binary.

So with this knowledge in my head, I have a fairly clear idea of what I would like to have as an interface for a future unpaper.

First of all, it should be two separate command line tools — they may both be written in C, or the first one might be written in Python or any other language. The job of this language-flexible tool is to be the new unpaper command line executable. It should accept exactly the same command line interface of the current binary, but implement none of the algorithm or transformation logic.

The other tool should be written in C, because it should just contain all the current processing code. But instead of having to do complex parsing of the command line interface, it should instead read on the standard input a JSON document providing all of the parameters for the “job”.

Similarly, there’s some change needed to the output of the programs. Some of the information, particularly debugging, that is currently printed on the stderr stream should stay exactly where it is, but all of the standard output, well, I think it makes significantly more sense to have another JSON document from the processing tool, and convert that to human-readable form in the interface.

Now, with a proper documentation of the JSON schema, it means that the software using unpaper as a processing step can just build their job document, and skip the “human” interface. It would even make it much easier to write extensions in Python, Ruby, and any other language, as it would allow exposing a job configuration generator following the language’s own style.

Someone might wonder why I’m talking about JSON in particular — there’s dozens of different structured data formats that could be used, including protocol buffers. As I said a number of months ago, the important part in a data format is its schema, so the actual format wouldn’t be much of a choice. But on the other hand, JSON is a very flexible format that has good implementations in most languages, including C (which is important, since the unpaper algorithms are implemented in C, and – short of moving to Rust – I don’t want to change language).

But there’s something even more important than the language support, which I already noted above: jq. This is an astounding work of engineering, making it so much easier to handle JSON documents, particularly inline between programs. And that is the killer reason to use JSON for me. Because that gives even more flexibility to an interface that, for the longest time, felt too rigid to me.

So if you’re interested to contribute to an open source project, with no particular timeline pressure, and you’re comfortable with writing C — please reach out, whether it is to ask questions for clarification, or with a pull request to implement this idea altogether.

And don’t forget, there’s still the Meson conversion project which I also discussed previously. For that one, some of the tasks are not even C projects! It needs someone to take the time to rewrite the man page in Sphinx, and also someone to rewrite the testing setup to be driven by Python, rather than the current mess of Automake and custom shell scripts.

Free Idea: structured access logs for Apache HTTPD

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

I have been commenting on Twitter a bit about the lack of decent tooling to deal with Apache HTTPD’s Combined Logging Format (inherited from NCSA). For those who do not know about it, this is hte format used by standard access_log files, which include information about requests, including the source IP, the time, the requested path, the status code and the User-Agent used.

These logs are useful for debugging but are also consumed by tools such as AWStats to produce useful statistics about the request patterns of a website. I used these extensively when writing my ModSecurity rulesets, and I still keep an eye out on them for instance to report wasteful feed readers.

The files are simple text files, and that makes it easy to act on them: you can use tail and grep, and logrotate needs no special code beside moving the file and reloading Apache to have it re-open the paths. This makes it hard to query for particular entries in fields, such as to get the list of User-Agent strings present in a log. Some of the suggestions I got over Twitter to solve this were to use awk, but as it happens, these logs are not actually parseable with a straightforward field separation.

Lacking finding a good set of tools to handle these formats directly, I have been complaining that we should probably start moving away from simple text files into more structured log formats. Indeed, I know that there used to be at least some support for logging directly to MySQL and other relational databases, and that there are more complicated machinery often used by companies and startups that process these access logs into analysis software and so on. But all of these tend to be high overhead, much more than what I or someone else with a small personal blog would care about implementing.

Instead I think it’s time to start using structured file logs. A few people including thresh from VideoLAN suggested using JSON to write the log files. This is not a terrible idea, as the format is at least well understood and easy to interface with most other software, but honestly I would prefer something with an actual structure, a schema that can be followed. Of course I’m not meaning XML, and I would rather suggest having a standardized schema for proto3. Part of that I guess is because I’m used to use this at work, but also because I like the idea of being able to just define my schema and have it generate the code to parse the messages.

Unfortunately currently there is no support or library to access a sequence of protocol buffer messages. Using a single message with repeated sub-messages would work, but it is not append-friendly so there is no way to just keep writing this to a file, and being able to truncate and resume writing to it, which is a property needed for a proper structured log format to actually fit in the space previously occupied by text formats. This is something I don’t usually have to deal with at work, but I would assume that a simple LV (Length-Value) or LVC (Length-Value-Checksum) encoding would be okay to solve this problem.

But what about other properties of the current format? Well, the obvious answer is that, assuming your structured log contains at least as much information (but possibly more) as the current log, you can always have tools that convert on the fly to the old format. This would for instance allow to have a special tail-like command and a grep-like command that provides compatibility with the way the files are currently looked at manually by your friendly sysadmin.

Having more structured information would also allow easier, or deeper analysis of the logs. For instance you could log the full set of headers (like ModSecurity does) instead of just the referrer and User-Agent. And allow for customizing the output on the conversion side rather than lose the details when writing.

Of course this is just one possible way to solve this problem, and just because I would prefer working with technologies that I’m already friendly with it does not mean I wouldn’t take another format that is similarly low-dependency and easy to deal with. I’m just thinking that the change-averse solution of not changing anything and keeping logs in text format may be counterproductive in this situation.

Free Idea: a filtering HTTP proxy for securing web applications

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

Going back to a previous topic I wrote about, and the fact that I’m trying to set up a secure WordPress instance, I would like to throw out another idea I won’t have time to implement myself any time soon.

When running complex web applications, such as WordPress, defense-in-depth is a good security practice. This means that in addition to locking down what the code itself can do on to the state of the local machine, it also makes sense to limit what it can do to the external state and the Internet at large. Indeed, even if you cannot drop a shell on a remote server, there is value (negative for the world, positive for the attacker) to at least being able to use it form DDoS (e.g. through an amplification attack).

With that in mind, if your app does not require network at all, or the network dependency can be sacrificed (like I did for Typo), just blocking the user from making outgoing connection with iptables would be enough. The --uid-owner option makes it very easy to figure out who’s trying to open new connections, and thus stop a single user transmitting unwanted traffic. Unfortunately, this does not always work because sometimes the application really needs network support. In the case of WordPress, there is a definite need to contact the WordPress servers, both to install plugins and to check if it should self-update.

You could try to limit access to what the user can access by hosts. But that’s not easy to implement right either. Take WordPress as an example still: if you wanted to limit access to the WordPress infrastructure, you would effectively have to allow it accessing *.wordpress.org, and this can’t really be done in iptables, at far as I know, since those connections go to IP literal addresses. You could rely on FcRDNS to verify the connections, but that can be slow, and if you happen to have access to poison the DNS cache of the server, you’re effectively in control of this kind of ACL. I ignored the option of just using “standard” reverse DNS resolution, because in that case you don’t even need to poison DNS, you can just decide what your IP will reverse-resolve to.

So what you need to do is actually filter at the connection-request level, which is what proxies are designed for. I’ll be assuming we want to have a non-terminating proxy (because terminating proxies are hard), but even in that case you can now know what (forward) address you want to connect to, and in that case *.wordpress.org becomes a valid ACL to use. And this is something you can actually do relatively easily with Squid, for instance. Indeed, this is the whole point of tools such as ufdbguard (which I used to maintain for Gentoo), and the ICP protocol. But Squid is particularly designed as a caching proxy, it’s not lightweight at all, and it can easily become a liability to have it in your server stack.

Up to now, what I have used to reduce the surface of attacks of my webapps is set them behind a tinyproxy, which does not really allow for per-connection ACLs. This only provides isolation against random non-proxied connections, but it’s a starting point. And here is where I want to provide a free idea for anyone who has the time and would like to provide better security tools for srver-side defense-in-depth.

A server-side proxy for this kind of security usage would have to be able to provide ACLs, with both positive and negative lists. You may want to provide all access to *.wordpress.org, but at the same time block all non-TLS-encrypted traffic, to avoid the possibility of downgrade (given that WordPress has a silent downgrade for requests to api.wordpress.org, that I talked about before).

Even better, such a proxy should have the ability to distinguish the ACLs based on which user (i.e. which webapp) is making the request. The obvious way would be to provide separate usernames to authenticate to the proxy — which again Squid can do, but it’s designed for clients for which the validation of username and password is actually important. Indeed, for this target usage, I would ignore the password altogether, and just use the user at face value, since the connection should always only be local. I would be even happier if instead of pseudo-authenticating to the proxy, the proxy could figure out which (local) user the connection came from, by inspecting the TCP socket connection, kind of like querying the ident protocol used to work for IRC.

So to summarise, what I would like to have is an HTTP(S) proxy that focuses on securing server-side web applications. Does not have to support TLS transport (because it should only accept local connections), nor it should be a terminating proxy. It should support ACLs that allow/deny access to a subset of hosts, possibly per-user, without needing a user database of any sort, and even better if it can tell by itself which user the connection came from. I’m more than happy if someone tells me this already exists, or if not, someone starts writing this… thank you!

Free Idea: Free Software stack for audiobooks

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

This is clearly not a new idea, as I posted about something very similar over eight years ago. At the time I was looking for a way of encoding audibooks coming from audio CD in a format that was compatible with the iPod Classic. Since then, Apple appears to have done their best to make the audiobooks experience on iOS the worst possible, to the point that I don’t really use my iPod Touch as my primary audiobook player any more.

As an aside to the free idea, which can probably give a bit more context for you all, let me describe the problems I have with the current approach to audiobooks by Apple. A few iOS major versions ago, they decided to move the audiobooks handling from the Music app to the iBooks app; this would be reasonable, given that they are books, and it was always a bit strange to have them in a separate application, but it also meant you lost the ability to build playlists with them.

Playlists with audiobooks are great, because they allow you to “stitch” multiple books of the same series, so that you can play them for hours on end, for instance if you need them to sleep. I used to have a playlist for the Hitchhikers’ Guide to the Galaxy radio series and one for the books, one for Dresden Files, and one for the News Quiz, including both the collected editions in CD by BBC, my own “audiobooks” built out of the podcasts, and the more recent podcast episodes that I have not collected into audiobook files yet.

So what is the idea? There are two components that, as far as I can see, are currently heavily lacking in the FLOSS world. The first is a way to generate audiobook files, which is what I complained eight years ago. Indeed, if you look even at a random sample on Project Gutenberg, the audiobook is actually a ton of files (47!) each with a chapter in them. A proper audiobook file would be a single file, with chapter markers, and per-chapter metadata (chapter title, and in that case, the performer).

It’s more than just a matter of having a single file to move around. While of course the hardware improvements made a number of these points moot, the original reason to have a single big file over multiple small files was to avoid having to seek to a different point in the disk in-between chapters. It also allows the decoder to keep going, between chapters, as there is no “end of stream” but rather just a marker that at a given point in time some different metadata applies. Again, as I said this is no longer as relevant as it used to be, but it’s also not entirely gone.

The other component that is currently lacking, is a good playback solution. While VLC can obviously play those files right now, and if I’m not mistaken it also extracts the per-chapter metadata correctly, it lacks two features that make enjoying audiobooks possible. The first is possibly complicated, and relates to the ability to store bookmarks and current-playing time. While supposedly VLC supports the feature for resuming from last playback, I have heard it’s still sometimes unreliable (I have no idea how it’s implemented), plus it does not support just bookmarking a given time in a file/book. Bookmarking is particularly important when listening to non-novel audiobooks, as you may want to go back to it afterwards, to re-listen to advice or take a reference to further details.

The other feature is basically UI heavy, and it involves mostly the mobile UI (at least the Android one) and is the ability to scan backward and forward in the file. You have probably seen this in other players including Netflix’s own app, that allow you to scan back 30 seconds — in audiobooks it’s also useful to scan forward 30 seconds, particularly when considering the bookmarks above.

As usual for Free Ideas I have no time to work on this myself. I can give the idea details out, and depending on things I may be able to contribute to a bounty on it, but otherwise, no code I can share about this yet.

Free Idea: a QEMU Facedancer fuzzer

This post is part of a series of free ideas that I’m posting on my blog in the hope that someone with more time can implement. It’s effectively a very sketched proposal that comes with no design attached, but if you have time you would like to spend learning something new, but no idea what to do, it may be a good fit for you.

Update (2017-06-19): see last paragraph.

You may already be familiar with the Facedancer, the USB fuzzer originally designed and developed by Travis Goodspeed of PoC||GTFO fame. If you’re not, in no so many words, it’s a board (and framework) that allows you to simulate the behaviour of any USB device. It works thanks to a Python framework which supports a few other board and could – theoretically – be expanded to support any device with gadgetfs support (such as the BeagleBone Black I have at home, but I digress).

I have found about this through Micah’s videos, and I have been thinking for a while to spend some time to do the extension to gadgetfs so I can use it to simulate glucometers with their original software — as in particular it should allow me to figure out which value represent what, by changing what is reported to the software and see how it behaves. While I have had no time to do that yet, this is anyway a topic for a different post.

The free idea I want to give instead is to integrate, somehow, the Facedancer framework with QEMU, so that you can run the code behind a Facedancer device as if it was connected to a QEMU guest, without having to use any hardware at all. A “Virtualdancer”, which would not only obviate the need for hardware in the development phase (if the proof of concept of a facedancer were to require an off-site usage) but also would integrate more easily into fuzzing projects such as Bochspwn or TriforceAFL.

In particular, I have interest not only in writing simulated glucometers for debugging purposes (although a testsuite that requires qemu and a simulated device may be a bit overkill), but also in simulating HID devices. You may remember that recently I had to fix my ELECOM trackball, and this is not the first time I have to deal with broken HID descriptor. I have spent some more time looking into the Linux HID subsystem and I’m trying to figure out if I can make some simplifications here and there (again, topic for another time), so having a way to simulate an HID device with strange behaviour and see if my changes fix it or not would be extremely beneficial.

Speaking of HID, and report descriptors in particular, Alex Ionescu (of ReactOS fame) at REcon pointed out that there appear to be very few reported security issues with HID report descriptor parsing, in particular for Windows, which seems strange given how parsing those descriptor is very hard, and in particular there are very seriously broken descriptors out there. This would be another very interesting surface for a QEMU-based dancer software, to run through a number of broken HID report descriptor and send data to see how the system behaves. I would be very surprised if there is no bug in particular on the many small and random drives that apply workarounds such as the one I did for ELECOM.

Anyway, as I said I haven’t even had time to make the (probably minor) modifications to the framework to support BBB (which I already have access to), so you can imagine I’m not going to be working on this any time soon, but if you feel like working on some USB code, why not?

Update (2017-07-19): I pointed Travis at this post over Twitter, and he showed me vUSBf. While this does not have the same interface as facedancer, it proves that there is a chance to provide a virtual USB device implemented in Python.

As a follow up, Binyamin Sharet linked to Umap2 which supports fuzzing on top of facedancer, but does not support qemu as it is.

While it’s not quite (yet) what I had in mind, it proves that it is a feasible goal, and that there is already some code out there getting very close!