Documentation: remake versus download

One of the things that I like a lot about Gentoo is that you can easily have installed the whole set of documentation for almost every library out there, being API, tutorials or all the stuff like that.

This, unfortunately, comes with a price: you need the time and the tools to build this documentation most of the times. And sometimes the tools you’re needed to install are almost overkill against the library they are used by. While most of the software out there with generated man pages ships with them already prebuilt in the tarball (thanks to automake, the whole thing can be done quite neatly), there are packages that don’t ship with them, either because they don’t have a clean way to tar them up at release or because they are not released (ruby-elf is culprit of this too, since it’s only available on the repository for now).

For those, the solution usually is to bring in some extra packages like, for the ruby-elf case above, the docbook-ns stylesheets that are used to produce the final man page from the DocBook 5 sources. But it might not use this: there are quite a lot of different ways to build man pages: perl scripts, compiled tools, custom XML formats, you name it.

And this is just for man pages, which are usually updated explicitly by their authors: API documentation, which is usually extrapolated from the source code directly, is rarely generated when creating the final release distribution. This goes for C/C++ libraries that use doxygen or gtk-doc, to Java packages that use JavaDoc, to Ruby extensions that use RDoc (indeed, the original idea for this post came to me when I was working on Ruby-ng eclass and noticed that almost all the Ruby extensions I packaged required me to rebuild the API documentation at build time).

Now, when it comes to API documentation, it’s obvious we don’t really want to “waste” time generating it for non-developers: they would never care about reading it in the first place. This is why we have USE flags after all. But sometimes, even this does not seem to be enough control. The first problem is: which format do we use for the documentation? For those of you that don’t know it, Doxygen can generate documentation in many forms, included but not limited to HTML, PDF (through LaTeX) and Microsoft Compressed HTML (CHM). There are packages that do build all formats available; some autodiscover the available tools, other try to use the tools even when they are not installed in the system.

We should probably do some kind of selection, but it has to be said it’s not obvious, especially when upstream, while adding proper targets to rebuild documentation, only design them for their own usage: to generate and publish, on their site or something, the resulting documentation. We install the documentation for the system user, we should probably focus on what can be displayed on screen, which would probably steer us toward installing HTML files because they are browsable and easy to look at on the screen. But I’m sure there are people who are interested in having the PDFs at hand instead, so if we were to focus on just those people will complain. Not like at this point I’m caring about a 100% experience but rather having a good experience for a 90% of people, maybe 95%.

I do remember that there are quite a few packages that do try to use LaTeX to rebuild documentation, this because there have been quite a few sandbox problems with the font cache that was regenerated during portage build. Unfortunately, I don’t have any number at hand, because – silly me – the tinderbox strips documentation away to save space (maybe I should remove that quirk, the raid1 volumes have quite a bit of free space by now). I can speak, recently, for Ragel, which I’ve move away from rebuilding the documentation, inspired first by the FreeBSD ports which downloaded the pre-built PDF version from Ragel’s site (I did the same for version 6.4, under doc USE flag), and then sidestepping the issue altogether since upstream now ships with the PDF in the source tarball.

But this is also buggering me as upstream for a few projects: what is the best for my users? The online API documentation is useful when you don’t want to rebuild the documentation locally, and can be searched by search engines much more easily, but is that enough? Offline users? Users with restricted bandwidth? Servers with restricted bandwidth? Of course offline users can regenerate the documentation, but is that the best option? Should the API documentation be shipped within the source tarball? That could make the tarball much much bigger than just the sources; it can even double in size.

Downloadable documentation, Python-style, looks to me like one of the best options. You get the source tarball, and the documentation tarball; you install the latter if the doc USE flag is enabled. But how to generate them? I guess that adding one extra target to the Makefiles (or equivalent for your build system) may very well be an option, I’ll probably work on that for lscube with a ready recipe showing how to make the tarball during make dist (and of course documenting it where it’s easier to reach than my blog).

The only problem with this is that it doe not take advantages of improved generation by newer version of the software; for instance if one day Doxygen, JavaDoc, RDoc and the like decide finally to agree on a single, compatible XML/XHTML format for documentation to be accessed with an application integrating a browser and an index system (I’d like to say that both Apple and Microsoft provide applications that seem to be doing that; I haven’t used them quite long enough to actually tell how well they work, but they are designed to do that).

But at least let this be a start for a discussion: should we really rebuild PDF documentation when installing packages for Gentoo, even under doc USE flag, or should we stick with more display-oriented formats?