Unpaper fork, part 2

Last month I posted a call to action hoping for help with cleaning up the unpaper code, as the original author has not been updating it since 2007, and it had a few issues. While I have seen some interest in said fork and cleanup, nobody stepped up with help, so it is proceeding, albeit slowly.

What is available now in my GitHub repository is mostly cleaned up, although still not extremely more optimised than the original — I actually removed one of the “optimisations” I added since the fork: the usage of sincosf() function. As Freddie pointed out in the other post’s comments, the compiler has a better chance of optimising this itself; indeed both GCC and PathScale’s compiler optimise two sin and cos calls with the same argument into a single sincos call, which is good. And using two separate calls allows declaring the temporary used to store the results as constant.

And indeed today I started rewriting the functions so that temporaries are declared as constant as possible, and with the most limited scope as it’s applicable to theme. This was important to me for one reason: I want to try making use of OpenMP to improve its performance on modern multicore systems. Since most of the processing is applied independently to each pixel, it should be possible for many iterative cycles to be executed in parallel.

It would also be a major win in my book if the processing of input pages was feasible in parallel as well: my current scan script has to process the scanned sheets in parallel itself, calling many copies of unpaper, just to process the sheets faster (sometimes I scan tens of sheets, such as bank contracts and similar). I just wonder if it makes sense to simply start as many threads as possible, each one handling one sheet, or if that would risk to hog the scheduler.

Finally there is the problem of testing. Freddie also pointed me at the software I remembered to check the differences between two image files: pdiff — which is used by the ChromiumOS build process, by the way. Unfortunately I then remembered why I didn’t like it: it uses the FreeImage library, which bundles a number of other image format libraries, and upstream refuses to apply sane development to it.

What would be nice for this would be to either modify pdiff to use a different library – such as libav! – to access the image data, or to find or write something similar that does not require such stupidly-designed libraries.

Speaking about image formats, it would be interesting to get unpaper to support other image format beside PNM; this way you wouldn’t have to keep converting from and to the other formats when processing. One idea that Luca gave me was to make use of libav itself to handle that part: it already supports PNM, PNG, JPEG and TIFF, so it would provide most of the features it’d be needing.

In the mean time, please let me know if you like how this is doing — and remember that this blog, the post and me are Flattr enabled!

On releasing Ruby software

You probably know that I’ve been working hard on my Ruby-Elf software and its tools, which include my pride elfgrep and are now available in the main Portage tree so that it’s just an emerge ruby-elf away. To make it easier to install, manage and use, I wanted to make the package as much in line with Ruby packaging best practices taking into consideration both those installing it as a gem and those installing it with package managers such as Portage. This gave me a few more insights on packaging that before escaped me a lot.

First of all, thankfully, RubyGems packaging starts to be feasible without needing a bunch of third party software; whereas a lot of software used to require Hoe or Echoe to even run tests, some of it is reeling back, and using simply the standard Gem-provided Rake task to run packaging; this is also the road I decided to take with Ruby-Elf. Unfortunately Gentoo is once again late on the Rubygems game, as we still have 1.3.7 and not 1.5.0 used; this is partly because we’ve been hitting our own roadblocks with the upgrade of Ruby 1.9, which is really proving a pain in our collective … backside — you’d expect that in early 2011 all the main Ruby packages would work with the 1.9.2 release just fine, but that’s still not the case.

Integrating Rubyforge upload, though, is quite difficult because the Rubyforge extension itself is quite broken and no longer works out of the box — main problem being that it tries to use the “Any” specification for CPU, but that exists no more, replaced by “Other”; you can trick it into using that by changing the automated configuration, but it’s not a completely foolproof system. The whole extension seem pretty much outdated and written hastily (if there is a problem when creating the release slots or uploading the file, the state of the release is left halfway through).

For what concerns mediating between keeping a simple RubyGems packaging and still providing all the details needed for distributions’ packaging, while not requiring all the users to install the required development packages, I’ve decided to release two very different packages. The RubyGem only installs the code, the tools, and the man pages; it lacks the tests, because there is a lot of test data that would otherwise be installed without any need for it. The tarball on the other hand contains all the data from the git repository, but including the gemspec file (that is needed for instance in Gentoo to have fakegems install properly). In both cases, there are two type of files that are included in the two distributions but are not part of the git repositories: the man pages and the Ragel-generated demanglers (which I’m afraid I’ll soon have to drop and replace with manually-written ones, as Ragel is unsuitable for totally recursive patterns like the C++ mangling format used by GCC3 and specified by the IA64 ABI); by distributing these directly, users are not required to have either Ragel or libxslt installed to make full use of Ruby-Elf!

Speaking about the man pages; I love the many tricks I can make use of with DocBook and XSLT; I don’t have to reproduce the same text over and over when the options, or bugs, are the same for all the tools – I have a common library to implement them – I just need to include the common file, and use XPointer to tell it which part of the file to pick up. Also, it’s quite important to me to keep the man pages updated, since i took a page out of the git book: rather than implementing the --help option with a custom description of them, the --help option calls up the manpage of the tool. This works out pretty well, mostly because this particular gem is designed to work on Unix systems, so that the man tool is always going to be present. Unfortunately in the first release I made it didn’t work out all too well, as I didn’t consider the proper installation layout of the gem; this is now fixed and works perfectly even if you use gem install ruby-elf.

The one problem I still have is that I have not yet signed the packages themselves; the reason is actually quite simple: while it’s trivial with OpenSSH to proxy the ssh-agent connection, so that I can access private hosts when jumping from my frontend system to Yamato, I can’t find currently any way to proxy the GnuPG agent, which is needed for me to sign the packages; sure I could simply connect another smartcard reader to Yamato and move the card there to do the signing, but I’m not tremendously happy with such a solution. I think I’ll be writing some kind of script to do that; it shouldn’t be very difficult to do with ssh and nc6.

Hopefully, having now released my first very much Ruby package, and my first Gem, I hope to be able to do a better job at packaging, and fixing others’ packages, in Gentoo.

This is crazy, man!

And here, the term “man” refers to the manual page software.

I already noted that I was working on fixing man-db to work with heirloom-doctools but I didn’t go in much details about the problems I faced. So let me try to explain it a bit in more detail, if you’re interested.

The first problem is, we have two different man implementations in portage: sys-apps/man and sys-apps/man-db. The former is the “classical” one used by default, while the latter is a newer implementation, which is supposedly active and maintained. I say supposedly because it seems to me like it’s not really actively maintained, nor tested, at all.

The “classic” implementation is often despised because, among other things, it’s not designed with UTF-8 in mind at all, and even its own output, for locales where ASCII is not enough, is broken. For instance, in Italian, it outputs a latin1 bytestreams even when the locale is set to use UTF-8. Confusion ensures. The new implementation should also improve the caching by not using flat-files for the already-processed man pages, but rather using db files (berkeley or gdbm).

So what’s the problem with man-db, the new implementation? It supports UTF-8 natively, so that’s good, but the other problem is that it’s only ever tested coupled with groff (GNU implementation of the (n)roff utility), and with nothing else. How should this matter? Well, there are a few features in groff that are not found anywhere else, this includes, for instance, the HTML output (eh? no I don’t get it either, I can tell that the nroff format isn’t exactly nice, and I guess that having the man pages available as HTML makes them more readable for users, but why adding that to groff, and especially to the man command? no clue), as well as some X-specific output. Both those features are not available in heirloom, but as far as I can see they are so rarely used that it really makes little sense to depend on groff just because of those; and the way man-db is designed, when a nroff implementation different from groff is found, the relative options are not even parsed or considered; which is good to reduce the size of the code, but also requires to at least compile-test the software with a non-groff implementation, something that upstream is not currently doing it seems.

More to the point, not only the code uses broken (as in, incomplete) groff-conditionals, but the build system does depend on groff as well: the code to build the manual called groff directly, instead of using whatever the configure script found, and the testsuite relied on the “warnings” feature that is only enabled with groff (there was one more point to that: previously it returned error when the --warnings option was passed, so I had to fix it so that it is, instead, ignored).

But it doesn’t go much better from here on: not only man-db tries to use two output terminals that are no defined by the heirloom nroff (so I hacked that around temporarily on heirloom itself, the proper fix would be having a configuration option for man-db), but it also has problems handling line and page lengths. This is quite interesting actually, since it took me a long time (and some understanding of how nroff works, which is nowhere near nice, or human).

Both nroff and groff (obviously) are designed to work with paged documents (although troff is the one that is usually used to print, since that produces PostScript documents); for this reason, by default, their output is restricted width-wise, and has pagebreaks, with footers and headers, every given number of lines. Since the man page specification (command, man section, source, …) are present in the header and footer, the page breaks can be seen as spurious man page data in-between paragraphs of documentation. This is what you read on most operating systems, included Solaris, where the man program and the nroff program are not well combined. To avoid this problem, both the man implementations set line and page breaks depending on the terminal: the line break is set to something less than the width of the terminal, to fill it with content; the page break is set as high as possible to allow smooth scrolling in the text.

Unfortunately, we got another “groff versus the world” situation here: while the classic nroff implementation sets the two values with the .ll and .pl roff commands (no, really you don’t want to learn about roff commands unless you’re definitely masochist!), groff uses the LL register (again… roff… no thanks), and thus can be set with the command-line parameter -rLL=$width; this syntax is not compatible with heirloom’s nroff, but I’m pretty sure you already guessed that.

The classic man implementation, then, uses both the command and the register approach, but without option switches; instead it prepends to the manpage sources the roff commands (.ll and .nr LL that sets the register from within the roff language), and then gets the whole output processed; this makes it well compatible with the heirloom tools. On the other hand, man-db uses the command-line approach which makes it less compatible; indeed when using man-db with the heirloom-doctools, you’re left with a very different man output, which looks like an heirloom in itself, badly formatted and not filling the terminal.

Now we’re in quite a strange situation: from one side, the classic man implementation sucks at UTF-8 (in its own right), but it works mostly fine with heirloom-doctools (that supports UTF-8 natively); from the other side we got a newer, cleaner (in theory) implementation, that should support UTF-8 natively (but requiring a Debian-patched groff), but has lock-ins on groff (which definitely sucks with UTF-8). The obvious solution here would be to cross-pollinate the two implementations to get a single pair of programs that work well together and has native UTF-8; but it really is not easy.

And at the end… I really wish the roff language for man pages could die in flames, it really shows that it was designed for a long-forgotten technology. I have no difficulty to understand why GNU wanted to find a new approach to documentation with info pages (okay the info program suck; the documentation itself doesn’t suck as much, you can use either pinfo or emacs to browse it decently). Myself, though, I would have gone to plain and simple HTML: using docbook or some other format, it’s easy to generate html documentation (actually, docbook supports features that allows for the same documentation page to become both a roff-written man page, and an HTML page), and with lynx, links, and graphical viewers, the documentation would be quite accessible.

Another C++ piece hits the dust

You might remember my problem with C++ especially on servers; yes that’s a post of almost two years ago. Well today I was able to go one step further, and kill another piece of C*+ from at least my main system (it’s going to take a little more for it to apply to the critical systems). And that piece is nothing less than groff.

Indeed, last night we were talking in #gentoo-it about FreeBSD’s base system and the fact that, for them similarly to us, their only piece of C++ code in base system is groff; an user pointed out that they were considering switching to something else, so a Google run later I come up with the heirloom project website.

The heirloom project contains some tools ported from the OpenSolaris code base, but working fine on Linux and other OSes; indeed, they work quite well in Gentoo, after creating an ebuild for them, removed groff from profiles, and fixed the dependencies of man and zsh.

A few notes though:

  • the work is not complete yet so pleas don’t start already complaining if something doesn’t look right;
  • man is not configured out of the box; I’m currently wondering what’s the best way to do this;
  • after configuring (manually) man, you should be able to read most man pages without glitches;
  • for some reason, we currently install man pages in different encodings (for instance man’s own man page in Italian is written in latin-1); heirloom-doctools use UTF-8 by default, which is a good thing, I guess; groff does seem to have a lot of problems with UTF-8 (and man as well, since the localised Italian output often have broken encoding!);
  • groff (and man) both have special handling of Japanese for some reason, I don’t know whether the heirloom utilities are better or worse for Japanese, somebody should look into it.