This is crazy, man!

And here, the term “man” refers to the manual page software.

I already noted that I was working on fixing man-db to work with heirloom-doctools but I didn’t go in much details about the problems I faced. So let me try to explain it a bit in more detail, if you’re interested.

The first problem is, we have two different man implementations in portage: sys-apps/man and sys-apps/man-db. The former is the “classical” one used by default, while the latter is a newer implementation, which is supposedly active and maintained. I say supposedly because it seems to me like it’s not really actively maintained, nor tested, at all.

The “classic” implementation is often despised because, among other things, it’s not designed with UTF-8 in mind at all, and even its own output, for locales where ASCII is not enough, is broken. For instance, in Italian, it outputs a latin1 bytestreams even when the locale is set to use UTF-8. Confusion ensures. The new implementation should also improve the caching by not using flat-files for the already-processed man pages, but rather using db files (berkeley or gdbm).

So what’s the problem with man-db, the new implementation? It supports UTF-8 natively, so that’s good, but the other problem is that it’s only ever tested coupled with groff (GNU implementation of the (n)roff utility), and with nothing else. How should this matter? Well, there are a few features in groff that are not found anywhere else, this includes, for instance, the HTML output (eh? no I don’t get it either, I can tell that the nroff format isn’t exactly nice, and I guess that having the man pages available as HTML makes them more readable for users, but why adding that to groff, and especially to the man command? no clue), as well as some X-specific output. Both those features are not available in heirloom, but as far as I can see they are so rarely used that it really makes little sense to depend on groff just because of those; and the way man-db is designed, when a nroff implementation different from groff is found, the relative options are not even parsed or considered; which is good to reduce the size of the code, but also requires to at least compile-test the software with a non-groff implementation, something that upstream is not currently doing it seems.

More to the point, not only the code uses broken (as in, incomplete) groff-conditionals, but the build system does depend on groff as well: the code to build the manual called groff directly, instead of using whatever the configure script found, and the testsuite relied on the “warnings” feature that is only enabled with groff (there was one more point to that: previously it returned error when the --warnings option was passed, so I had to fix it so that it is, instead, ignored).

But it doesn’t go much better from here on: not only man-db tries to use two output terminals that are no defined by the heirloom nroff (so I hacked that around temporarily on heirloom itself, the proper fix would be having a configuration option for man-db), but it also has problems handling line and page lengths. This is quite interesting actually, since it took me a long time (and some understanding of how nroff works, which is nowhere near nice, or human).

Both nroff and groff (obviously) are designed to work with paged documents (although troff is the one that is usually used to print, since that produces PostScript documents); for this reason, by default, their output is restricted width-wise, and has pagebreaks, with footers and headers, every given number of lines. Since the man page specification (command, man section, source, …) are present in the header and footer, the page breaks can be seen as spurious man page data in-between paragraphs of documentation. This is what you read on most operating systems, included Solaris, where the man program and the nroff program are not well combined. To avoid this problem, both the man implementations set line and page breaks depending on the terminal: the line break is set to something less than the width of the terminal, to fill it with content; the page break is set as high as possible to allow smooth scrolling in the text.

Unfortunately, we got another “groff versus the world” situation here: while the classic nroff implementation sets the two values with the .ll and .pl roff commands (no, really you don’t want to learn about roff commands unless you’re definitely masochist!), groff uses the LL register (again… roff… no thanks), and thus can be set with the command-line parameter -rLL=$width; this syntax is not compatible with heirloom’s nroff, but I’m pretty sure you already guessed that.

The classic man implementation, then, uses both the command and the register approach, but without option switches; instead it prepends to the manpage sources the roff commands (.ll and .nr LL that sets the register from within the roff language), and then gets the whole output processed; this makes it well compatible with the heirloom tools. On the other hand, man-db uses the command-line approach which makes it less compatible; indeed when using man-db with the heirloom-doctools, you’re left with a very different man output, which looks like an heirloom in itself, badly formatted and not filling the terminal.

Now we’re in quite a strange situation: from one side, the classic man implementation sucks at UTF-8 (in its own right), but it works mostly fine with heirloom-doctools (that supports UTF-8 natively); from the other side we got a newer, cleaner (in theory) implementation, that should support UTF-8 natively (but requiring a Debian-patched groff), but has lock-ins on groff (which definitely sucks with UTF-8). The obvious solution here would be to cross-pollinate the two implementations to get a single pair of programs that work well together and has native UTF-8; but it really is not easy.

And at the end… I really wish the roff language for man pages could die in flames, it really shows that it was designed for a long-forgotten technology. I have no difficulty to understand why GNU wanted to find a new approach to documentation with info pages (okay the info program suck; the documentation itself doesn’t suck as much, you can use either pinfo or emacs to browse it decently). Myself, though, I would have gone to plain and simple HTML: using docbook or some other format, it’s easy to generate html documentation (actually, docbook supports features that allows for the same documentation page to become both a roff-written man page, and an HTML page), and with lynx, links, and graphical viewers, the documentation would be quite accessible.

Another C++ piece hits the dust

You might remember my problem with C++ especially on servers; yes that’s a post of almost two years ago. Well today I was able to go one step further, and kill another piece of C*+ from at least my main system (it’s going to take a little more for it to apply to the critical systems). And that piece is nothing less than groff.

Indeed, last night we were talking in #gentoo-it about FreeBSD’s base system and the fact that, for them similarly to us, their only piece of C++ code in base system is groff; an user pointed out that they were considering switching to something else, so a Google run later I come up with the heirloom project website.

The heirloom project contains some tools ported from the OpenSolaris code base, but working fine on Linux and other OSes; indeed, they work quite well in Gentoo, after creating an ebuild for them, removed groff from profiles, and fixed the dependencies of man and zsh.

A few notes though:

  • the work is not complete yet so pleas don’t start already complaining if something doesn’t look right;
  • man is not configured out of the box; I’m currently wondering what’s the best way to do this;
  • after configuring (manually) man, you should be able to read most man pages without glitches;
  • for some reason, we currently install man pages in different encodings (for instance man’s own man page in Italian is written in latin-1); heirloom-doctools use UTF-8 by default, which is a good thing, I guess; groff does seem to have a lot of problems with UTF-8 (and man as well, since the localised Italian output often have broken encoding!);
  • groff (and man) both have special handling of Japanese for some reason, I don’t know whether the heirloom utilities are better or worse for Japanese, somebody should look into it.

My personal crusade against C++ abuse

Okay, after my blog about C++ bindings libraries I decided to go a bit deeper. What does stop me from removing C++ support from GCC entirely?

Yes of course nocxx USE flag on gcc is certainly not supported, as we don’t usually put built_with_use checks on gcc to make sure that we got a C++ compiler, and we just assume we got it, but this is not the point.

For a server, having gcc installed, while sometimes useful, might be a waste of space. Having the C++ standard library installed when you know for sure that you don’t need it, is a sure waste of space.

So what did I do? I checked what needed libstdc++ in my vserver’s chroot. There are only three packages hitting: libpcre (which I’ll talk about more later), fcgi and groff. The latter is a problem: groff is used by man and other similar tools, and it is written in C++. Yet, it doesn’t use the standard library.

So I decided to give a try to one simple trick that also other packages using C++ code but no C++ library use: I added to my overlay an experimental ebuild that instead of linking the C++ code as usual, it uses gcc to link, and add -lsupc++ to the libraries to link to. It’s just the minimal subset of C++ symbols that GCC has to provide, the result is that groff can be built without depending on libstdc++, and can run just as fine where gcc was built with nocxx. Isn’t that great?

I also found out that tvtime has the same problem: it links libstdc++ because one of its deinterlacers is written in C++, but does not need libstdc++… well again the fix is to link -lsupc++. tvtime-1.0.2-r2 is in tree for who wants to try it.

Now I’m starting to wonder how much C++ gets abused every day, by adding libstdc++ dependencies to package who wouldn’t need them at all, so I started looking around, and using more often the nocxx USE flag when I know I won’t have use for it, like for tiff, or berkley db.

For what concerns libpcre, the build system does have a way to disable the C++ bindings, and I have in my overlay a modified ebuild that adds a cxx USE flag (enabled by default through EAPI=1) that soon I’ll commit to the main tree. Before doing that, though, I need to check all the packages depending on libpcre: while most developers would say that nocxx/-cxx USE flags are not supported, I’d like to actually make sure that the user is well informed rather than finding out during a failure in make. Maybe it’s being too good to users who fiddle around when they shouldn’t, but there are two reasons why I think it’s better doing it this way.

The first is of course that I’m the first person to actually disable C++ support on packages which I’m not going to use with C++, or at least not through their C++ bindings – an example of this is kdelibs, that while written in C++, uses the C interface of both tiff and pcre, rather than the C++ bindings.

The second is that I find it better to waste 20 minutes once to make sure that the ebuilds are luser-proof than having to deal with 20 dupe bugs because they disabled an option in a dependency. I see many colleagues being grumpy and blaming the users for shooting on their feet. Sure, they are right, but that attitude often leads to more stress because of the bugs open, and the more bugs are opened the more grumpy you become, and it’s basically a no-way-out cycle. My idea is that if I can do something more to make sure I don’t get bugs, I’ll do it.

So anyway, the builds are working, and I’m now wondering about updating Typo trying the 4.1 version… at least that’s hopefully updated more often, and shouldn’t have the bugs that the 4.0 SVN has. And maybe a decent spam protection.