Video: unpaper with Meson — From DocBook to ReStructured Text

I’m breaking the post-on-Tuesday routine to share the YouTube-uploaded copy of the stream I had yesterday on Twitch. It’s the second part of the Unpaper conversion to Meson, which is basically me spending two hours faffing around Meson and Sphinx to update how the manual page for Unpaper is generated.

I’m considering trying to keep up with having a bit of streaming every weekend just to make sure I get myself some time to work on Free Software. If you think this is interesting do let me know, as it definitely helps with motivations, to know I’m not just spending time that would otherwise be spent playing Fallout 76.

Future planning for Ruby-Elf

My work on Ruby-Elf tends to happen in “sprees” whenever I actually need something from it that wasn’t supported before — I guess this is true for many projects out there, but it seems to happen pretty regularly for me with my projects. The other day I prepared a new release after fixing the bug I found while doing the postmortem of a libav patch — and then I proceeded giving another run to my usual collisions check after noting that I could improve the performance of the regular expressions …

But where is it directed, as it is? Well, I hope I’ll be able to have version 2.0 out before end of 2013 — in this version, I want to make sure I get full support for archives, so that I can actually analyze static archives without having to extract them beforehand. I’ve got a branch with the code to get access to the archives themselves, but it can only extract the file before actually being able to read it. The key in supporting archives would probably be supporting in-memory IO objects, as well as offset-in-file objects.

I’ve also found an interesting gem called bindata which seems to provide a decent way to decode binary data in Ruby without having to actually fully pre-decode it. This would probably be a killer for Ruby-Elf, as a lot of the time I’m forcibly decoding everything because it was extremely difficult to access it on the spot — so the first big change for Ruby-Elf 2 is going to be to drop down the task of decoding to bindata (or, otherwise, another similar gem).

Another change that I plan is to drop the current version of the man pages. While DocBook is a decent way to deal with man pages, and standard enough to be around in most distributions, it’s one “strange” dependency for a Ruby package — and honestly the XML is a bit too verbose sometimes. For the most horsey beefy man pages, the generated roff page is half as big as the source, which is the other way around from what anybody would expect them.

So I’m quite decided that the next version of Ruby-Elf will use Markdown for the man pages — while it does not have the same amount of semantic tagging, and thus I might have to handle some styling in the synopsis manually, using something like md2man should be easy (I’m not going to use ronn because of the old issue with JRuby and rdiscount) and at the same time, it gives me a public HTML version for free, thanks to GitHub conversion.

Finally, I really hope that by Ruby-Elf 2 I’ll be able to get least the symbol demangler for the Itanium C++ ABI — that is the one used by modern GCC, yes, it was originally specified for the Itanic. Working toward supporting the full DWARF specification is something that is on the back of my mind but I’m not very convinced right now, because it’s huge. Also, if I were to implement it I would then have to rename the library to Dungeon.

The issue with the split HTML/XHTML serialization

Not everybody knows that HTML 5 has been released in two flavours: HTML 5 proper, which uses the old serialization, similarly to HTML 4, and what is often incorrectly called XHTML 5 which uses XML serialization, like XHTML and XHTML 1.1 did. The two serializations have different grades of strictness, and the browsers deal witht hem that way.

It so happens that the default output on DocBook for XHTML 1 is compatible with the HTML serialization, which means that even if the files have a .html extension, locally, they will load correctly in Chrome, for instance. The same can’t be said to XHTML 1.1 or XHTML5 output; one particularly nasty problem is that the generated code will output XML-style tags such as <a id="foo" /> which throw off the browsers entirely, unless properly loaded as XHTML … and on the other hand, IE still has trouble when served properly-typed XHTML (i.e. you have to serve it as application/xml rather than application/xhtml+xml).

So I have two choices: redirect all the .html requests to .xhtml, make it use XHTML 5 and work around the IE8 (and earlier) limitations, or I can forget about XHTML 5 at all. This starts to get tricky! So for the moment I decided to not go with XHTML 5, and at the same time I’m going to keep building ePub 2 books, and publish them as they are, instead of using ePub 3 (even though, as I said, O’Reilly got it working for their workflow).

Unfortunately even if I went through that on the server side to fix it, that wouldn’t even be enough alone! I would have to also change the CSS, since many things that were always <div> before, are now using proper semantic types, including <section> (with the exception of the table of contents on the first landing page, obviously (damn). This actually makes it easier in one way as it lets me drop the stupid nth-child CSS3 trick I used to set the style of the main div, compared to the header and footer. Hopefully this should let me fix the nasty IE 3 style beveled border that Chrome put around the Flattr button when using XHTML 5.

In the mean time I have a few general fixes to the style, now I just need to wait for the cover image to come from my designer friend, and then I can update both the website and the eBook versions all around the stores.

To close the post.. David you deserve a public apology: while you were listed as <editor> on the DocBook sources before, and the XSL was supposed to emit it on the homepage, for whatever reason, it fails to. I’ve upgrade you to <author> until I can find why the XSL is misbehaving so I can fix it properly.

In the mean time, tomorrow I’ll write a few more words about automake and then

The future of Autotools Mythbuster

You might have noticed after yesterday’s post that I have done a lot of visual changes to Autotools Mythbuster over the weekend. The new style is just a bunch of changes over the previous one (even though I also made use of sass to make the stylesheet smaller), and for the most part is to give it something recognizable.

I need to spend another day or two working on the content itself at the very least, as the automake 1.13 porting notes are still not correct, due to further changes done on Automake side (more on this in a future post, as it’s a topic of its own). I’m also thinking about taking a few days off Gentoo Linux maintenance, Munin development, and other tasks, and just work on the content on all the non-work time, as it could use some documentation of install and uninstall procedures for instance.

But leaving the content side alone, let me address a different point first. More and more people lately have been asking for a way to have the guide available offline, either as ebook (ePub or PDF) or packaged. Indeed I was asked by somebody if I could drop the NonCommercial part of the license so that it can be packaged in Debian (at some point I was actually asked why I’m not contributing this to the main manuals; the reason is that I really don’t like the GFDL, and furthermore I’m not contributing to automake proper because copyright assignment is becoming a burden in my view).

There’s an important note here: while you can easily see that I’m not pouring into it the amount of time needed to bring this to book quality, it does take a lot of time to work on it. It’s not just a matter of gluing together the posts that talk about autotools from my blog, it’s a whole lot of editing, which is indeed a whole lot of work. While I do hope that the guide is helpful, as I wrote before, it’s much more work for the most part that I can pour into on my free time, especially in-between jobs like now (and no, I don’t need to find a job — I’m waiting to hear from one, and got a few others lined up if it falls through). While Flattr helps, it seems to be drying up, at least for what concerns my content; even Socialvest is giving me some grief, probably because I’m no longer connecting from the US. Beside that, the only “monetization” (I hate that word) strategy I got for the guide is AdSense – which, I remind you, kicked my blog out for naming an adult website on a post – and making the content available offline would defeat even the very small returns of that.

At this point, I’m really not sure what to do; on one side I’m happy to receive more coverage just because it makes my life easier to have fewer broken build systems around. On the other hand, while not expecting to get rich off it, I would like to know that the time I spend on it is at least partly compensated – token gestures are better than nothing as well – and that precludes a simple availability of the content offline, which is what people at this point are clamoring for.

So let’s look into the issues more deeply: why the NC clause on the guide? Mostly I want to have a way to stop somebody else exploiting my work for gain. If I drop the NC clause, nothing can stop an asshole from picking up the guide, making it available on Amazon, and get the money for it. Is it likely? Maybe not, but it’s something that can happen. Given the kind of sharks that infest Amazon’s self-publishing business, I wouldn’t be surprised. On the other hand, it would probably make it easier for me to accept non-minor contributions and still be able to publish it at some point, maybe even in real paper, so it is not something I’m excluding altogether at this point.

Getting the guide packaged by distributions is also not entirely impossible right now: Gentoo has generally not the same kind of issues as Debian regarding the NC clauses, and since I’m already using Gentoo to build and publish it, making an ebuild for it is tremendously simple. Since the content is also available on Git – right now on Gitorious, but read on – it would be trivial to do. But again, this would be cannibalizing the only compensation I got for the time spent on the guide. Which makes me very doubtful on what to do.

About the sources, there is another issue: while at the time I started all this, Gitorious was handier than GitHub, over time Gitorious interface didn’t improve, while the latter improved a lot, to the point that right now it would be my choice to host the guide: easier pull requests, and easier coverage. On the other hand, I’m not sure if the extra coverage is a good thing, as stated above. Yes, it is already available offline through Gitorious, but GitHub would make it effectively easier to get offline than to consult online. Is that what I want to do? Again, I don’t know.

You probably also remember an older post of mine from one and a half years ago where I discussed the reasons why I haven’t published Autotools Mythbuster at least through Amazon; the main reason was that, at the time, Amazon has no easy way to update the book for the buyers without having them buying a new copy. Luckily, this has changed recently, so the obstacle is actually fallen. With this in mind, I’m considering making it available as a Kindle book for those of you who are interested. To do so I have first to create it as an ePub though — so it would solve the question that I’ve been asked about the eBook availability… but at the same time we’re back to the compensation issue.

Indeed, if I decide to set up ePub generation and start selling it on the Kindle store, I’d be publishing the same routines on the Git repository, making it available to everybody else as well. Are people going to buy the eBook, even if I priced it at $0.99? I’d suppose not. Which brings me to not be sure what the target would be, on the Kindle store: price it down so that the convenience to just buy it from Amazon overweights the work to rolling your own ePub, or googling for a copy, – considering that just one person rolling the ePub can easily make it available to everybody else – or price it at a higher point, say $5, hoping that a few, interested users would fund the improvements? Either bet sounds bad to me honestly, even considering that Calcote’s book is priced at $27 at Amazon (hardcopy) and $35 at O’Reilly (eBook) — obviously, his book is more complete, although it is not a “living” edition like Autotools Mythbuster is.

Basically, I’m not sure what to do at all. And I’m pretty sure that some people (who will comment) will feel disgusted that I’m trying to make money out of this. On the whole, I guess one way to solve the issue is to drop the NC clause, stick it into a Git repository somewhere, maybe keep it running on my website, maybe not, but not waste energy into it anymore… the fact that, with the much more focused topic, it has just 65 flattrs, is probably indication that there is no need for it — which explains why I couldn’t find any publisher interested in making me write a book on the topic before. Too bad.

Texinfo to Kindle, an odissey

This should be my last week in Los Angeles for the moment. Tomorrow Excelsior will be connected directly to the Internet, with its own IP (v4) and an IPv6 tunnel ready. I’ll catch a plane next week to go back to Italy to take care of a few things, while it crunches numbers.

Since I’m expecting long plane rides in my future, I hope to be able to read much more. In particular, I want to finally find the time to learn enough Elisp to write my own Emacs modes. I really miss a decent ActionScript mode while I’m working on Flash code (don’t ask).

So I set myself out to find a way to produce a standard ePub file — from that, converting to a Kindle-compatible Mobi file, is just a matter of using Calibre.

I found this post from one and a half years ago, which describe the situation pretty nicely.. while I’m currently ignoring the issue with the TOC that the author is describing (probably simply because I haven’t been able to load this on my Kindle and judge it yet), I found a different one: makeinfo will generate invalid XML.

The problem lies in the id= attribute of XML, which is tightly specified by the language to have a given format (has to start with only certain characters, and then only a few more are allowed — it can’t start with a number for instance, nor it can contain a slash character). While makeinfo already had a function to (partially) escape an XML id, it wasn’t using it for the docbook output. The function itself, then, wasn’t considering all the escapes, and thus even when calling it, the output would still be invalid, if the texi sources contained non-alphanumeric characters.

So now I have a patch for texinfo which should work; too bad I also have to get a copyright assignment for this as well, and I don’t know if I’ll have to wait till I get home to sign and send it back or not. The important part is having the patch though. I also fixed the issue with setfilename being added to the output when creating docbook.

Then there is another issue: the dbtoepub script. In Gentoo this script is installed by docbook-xsl-stylesheets and docbook-xsl-ns-stylesheets within /usr/share — the problem is that it was never mad easy to execute, and its dependencies weren’t considered. I took the chance of a bump of the stylesheets to add an USE flag for Ruby to the package (the script is written in Ruby) that will either remove the script or also install an executable wrapper so that it can be executed.

Actually, while I was at it, I made sure that the two ebuilds, which install two variants of the same basic content, will be almost identical just changing the directory where the content is installed, and making the remaining changes happen depending on $PN (an exception being the keywords as the namespaced version is not used so much, it’s just me liking them most of the time).

After I got the epub file, it was time to make sure it was complying with the specs; I’ve been burnt before by invalid or simply non-standard epub files. Luckily, Adobe of all people released an open source (BSD-licensed) tool to audit the files; epubcheck version 1.1 is now in tree as app-text/epubcheck. I’m hoping somebody who knows more Java than me can get a new version of jing in tree so I can bring epubcheck 3 into the tree — they use a quite newer one than is available right now, and that’s bad. The new version is designed to support the new version of the epub standard (which is supported by the 1.77.0 release of the stylesheets as well, and should be relatively easy to use even without Ruby), so I’m fine with version 1.1 right now.

Anyway all the tools I’ve been using should now be in tree (I’m testing the texinfo patch as we speak), and soon enough I should be able to start reading that manual on my Kindle.. expect some Emacs modes from me, afterwards.

Unpaper fork, part 2

Last month I posted a call to action hoping for help with cleaning up the unpaper code, as the original author has not been updating it since 2007, and it had a few issues. While I have seen some interest in said fork and cleanup, nobody stepped up with help, so it is proceeding, albeit slowly.

What is available now in my GitHub repository is mostly cleaned up, although still not extremely more optimised than the original — I actually removed one of the “optimisations” I added since the fork: the usage of sincosf() function. As Freddie pointed out in the other post’s comments, the compiler has a better chance of optimising this itself; indeed both GCC and PathScale’s compiler optimise two sin and cos calls with the same argument into a single sincos call, which is good. And using two separate calls allows declaring the temporary used to store the results as constant.

And indeed today I started rewriting the functions so that temporaries are declared as constant as possible, and with the most limited scope as it’s applicable to theme. This was important to me for one reason: I want to try making use of OpenMP to improve its performance on modern multicore systems. Since most of the processing is applied independently to each pixel, it should be possible for many iterative cycles to be executed in parallel.

It would also be a major win in my book if the processing of input pages was feasible in parallel as well: my current scan script has to process the scanned sheets in parallel itself, calling many copies of unpaper, just to process the sheets faster (sometimes I scan tens of sheets, such as bank contracts and similar). I just wonder if it makes sense to simply start as many threads as possible, each one handling one sheet, or if that would risk to hog the scheduler.

Finally there is the problem of testing. Freddie also pointed me at the software I remembered to check the differences between two image files: pdiff — which is used by the ChromiumOS build process, by the way. Unfortunately I then remembered why I didn’t like it: it uses the FreeImage library, which bundles a number of other image format libraries, and upstream refuses to apply sane development to it.

What would be nice for this would be to either modify pdiff to use a different library – such as libav! – to access the image data, or to find or write something similar that does not require such stupidly-designed libraries.

Speaking about image formats, it would be interesting to get unpaper to support other image format beside PNM; this way you wouldn’t have to keep converting from and to the other formats when processing. One idea that Luca gave me was to make use of libav itself to handle that part: it already supports PNM, PNG, JPEG and TIFF, so it would provide most of the features it’d be needing.

In the mean time, please let me know if you like how this is doing — and remember that this blog, the post and me are Flattr enabled!

On releasing Ruby software

You probably know that I’ve been working hard on my Ruby-Elf software and its tools, which include my pride elfgrep and are now available in the main Portage tree so that it’s just an emerge ruby-elf away. To make it easier to install, manage and use, I wanted to make the package as much in line with Ruby packaging best practices taking into consideration both those installing it as a gem and those installing it with package managers such as Portage. This gave me a few more insights on packaging that before escaped me a lot.

First of all, thankfully, RubyGems packaging starts to be feasible without needing a bunch of third party software; whereas a lot of software used to require Hoe or Echoe to even run tests, some of it is reeling back, and using simply the standard Gem-provided Rake task to run packaging; this is also the road I decided to take with Ruby-Elf. Unfortunately Gentoo is once again late on the Rubygems game, as we still have 1.3.7 and not 1.5.0 used; this is partly because we’ve been hitting our own roadblocks with the upgrade of Ruby 1.9, which is really proving a pain in our collective … backside — you’d expect that in early 2011 all the main Ruby packages would work with the 1.9.2 release just fine, but that’s still not the case.

Integrating Rubyforge upload, though, is quite difficult because the Rubyforge extension itself is quite broken and no longer works out of the box — main problem being that it tries to use the “Any” specification for CPU, but that exists no more, replaced by “Other”; you can trick it into using that by changing the automated configuration, but it’s not a completely foolproof system. The whole extension seem pretty much outdated and written hastily (if there is a problem when creating the release slots or uploading the file, the state of the release is left halfway through).

For what concerns mediating between keeping a simple RubyGems packaging and still providing all the details needed for distributions’ packaging, while not requiring all the users to install the required development packages, I’ve decided to release two very different packages. The RubyGem only installs the code, the tools, and the man pages; it lacks the tests, because there is a lot of test data that would otherwise be installed without any need for it. The tarball on the other hand contains all the data from the git repository, but including the gemspec file (that is needed for instance in Gentoo to have fakegems install properly). In both cases, there are two type of files that are included in the two distributions but are not part of the git repositories: the man pages and the Ragel-generated demanglers (which I’m afraid I’ll soon have to drop and replace with manually-written ones, as Ragel is unsuitable for totally recursive patterns like the C++ mangling format used by GCC3 and specified by the IA64 ABI); by distributing these directly, users are not required to have either Ragel or libxslt installed to make full use of Ruby-Elf!

Speaking about the man pages; I love the many tricks I can make use of with DocBook and XSLT; I don’t have to reproduce the same text over and over when the options, or bugs, are the same for all the tools – I have a common library to implement them – I just need to include the common file, and use XPointer to tell it which part of the file to pick up. Also, it’s quite important to me to keep the man pages updated, since i took a page out of the git book: rather than implementing the --help option with a custom description of them, the --help option calls up the manpage of the tool. This works out pretty well, mostly because this particular gem is designed to work on Unix systems, so that the man tool is always going to be present. Unfortunately in the first release I made it didn’t work out all too well, as I didn’t consider the proper installation layout of the gem; this is now fixed and works perfectly even if you use gem install ruby-elf.

The one problem I still have is that I have not yet signed the packages themselves; the reason is actually quite simple: while it’s trivial with OpenSSH to proxy the ssh-agent connection, so that I can access private hosts when jumping from my frontend system to Yamato, I can’t find currently any way to proxy the GnuPG agent, which is needed for me to sign the packages; sure I could simply connect another smartcard reader to Yamato and move the card there to do the signing, but I’m not tremendously happy with such a solution. I think I’ll be writing some kind of script to do that; it shouldn’t be very difficult to do with ssh and nc6.

Hopefully, having now released my first very much Ruby package, and my first Gem, I hope to be able to do a better job at packaging, and fixing others’ packages, in Gentoo.

The status of some deep roots

While there are quite a few packages that are know to be rotting in the tree, and thus are now being pruned away step by step, there are some more interesting facets in the status of Gentoo as a distribution nowadays.

While the more interesting and “experimental” areas seem to have enough people working on them (Ruby to a point, Python more or less, KDE 4, …), there are quite some deeper areas that are just left to rot as well, but cannot really be pruned away. This includes for instance Perl (for which we’re lagging behind a lot, mostly due to the fact that tove is left alone maintaining that huge piece of software), and SGML, which in turn includes all the DocBook support.

I’d like to focus a second on that latter part because I am partly involved in that; since I like using DocBook and I actually use the stylesheets to produce the online version of Autotools Mythbuster using the packages that are available in Portage. Now, when I wanted to make use of DocBook 5, the stylesheet for the namespaced version (very useful to write with emacs and nxml) weren’t available, so I added them, adding support for them to the build-docbook-catalog script. With time, I ended up maintaining the ebuilds for both versions of the stylesheets, and that hasn’t been always the cleanest thing given that upstream dropped the tests entirely in the newer versions (well, technically they are still there, but they don’t work, seems like they lack some extra stuff that is nowhere documented).

Now, I was quite good as I was with this; I just requested stable for the new ebuilds of the stylesheets (both variants) and I could have kept just doing that, but … yesterday I noticed that the list of examples in my guide had broken links, and after mistakenly opening a bug on the upstream tracker, I noticed that the bug is fixed already in the latest version. Which made me smell something: why nobody complained that the old stylesheets were broken? Looking at the list of bugs for the SGML team, you can see that lots of stuff was actually ignored for way too long a time. I tried cleaning up some stuff, duping bugs that were obviously the same, and fixing one in the b-d-c script, but this is one of the internal roots that is rotting, and we need help to save it.

For those interested in helping out, I have taken note of a few things that should probably be done with medium urgency:

  • make sure that all the DTDs are available in the latest release, and that they are still available upstream; I had to seed an old distfile today because upstream dropped it;
  • try to find a way to install the DocBook 5 schemas properly; right now the nxml-docbook5-schemas package install its own copy of the Relax-NG Compact file; on Fedora 11, there is a package that installs more data about DocBook 5, we should probably use the same original sources; the nxml-docbook5-schemas package could then either be merged in with that package or simply use the already-installed copy;
  • replace b-d-c, making it both more generic and using a framework that exists already (like eselect) instead of reinventing the wheel; the XML/DTD catalog can easily be used for more than just DocBook, while I know the Gentoo documentation team does not want for the Gentoo DTD to just be available as a package to install in the system (which would make it much easier to keep updated for the nxml schemas, but sigh), I would love to be able to make fsws available that way (once I’ll finish building the official schema for it and publish it, again more on that in the future);
  • find out how one should be testing the DocBook XSL stylesheets, so that we can run tests for them; it would have probably avoided the problem I had with Autotools Mythbuster in the past months;
  • package the stylesheets for Xalan and Saxon, which are different from the standard ones; b-d-c already has support for them to a point (although not having to explicit this kind of things in the b-d-c replacement is desirable), but I didn’t have reason to add them.

I don’t think I’ll have much time on working on them in the future, so user contributions are certainly welcome; if you do open any bug for these issue, please do CC me directly, since I don’t intend (yet) to add myself to the sgml alias.

Autotools Mythbuster: updated guide

While preparing for my first vacation ever next week, I’ve been trying to write up more content on my guide so that at least my fellow developers in LScube have a references of what I’ve been doing, and Gentoo developers as well, as lately I’ve been asked quite a few interesting questions (and not just them).

So, first of all, thanks to David (user99) who cleaned up the introduction chapter, and to Gilles (eva) who gave me the new stylesheet (so that it doesn’t look as rough as it did before, it also narrows the lines so that it reads better. It might not be the final style, but it really is an improvement now.

As for my changes, I’ve been trying to change slightly the whole take of the guide, trying to write up complete working examples for the readers to use, that are listed in the main page. At the same time, I’m trying to cover the most important, or less known, topics, with particular attention to what people asked me, or what I’ve been using on projects which is not very well known. The list of topics added include:

  • using AC_CHECK_HEADERS to get one out of a prioritised list of headers;
  • using AC_ARG_* macros (enable, with and environment variables);
  • using AS_HELP_STRING to document the arguments;
  • using AC_ARG_WITH to set up automatic but not automagic dependencies;
  • using automake with non-recursive makefiles (including some of the catches);
  • improved automake silent-rules documentation;
  • using external macro files with autoconf (work in progress, for now only autoconf with no extras is documented).

I’m considering the idea of merging in some of the For A Paralllel World articles, at least those dealing with automake, to complete the documentation. The alternative would be to document all these kind of problems and writing something along the lines of “A Distributor’s Bible”… the problem with that idea is that almost surely somebody will complain if I use the name “Bible” (warning: I’m not a Catholic, I’m an atheist!)… and if I am to call it “The Sacred Book of Distributors” I’ll just be having to dig up all the possible mocks and puns over various religions, ‘cause I’ll be almost surely writing the ten commandments for upstream projects (“Thou shall not ignore flags”, “Thou shall version your packages”), and that also will enter a politically correctness problem.

Oh well, remember that I do accept gifts (and I tried not putting there the stuff that I’ll be buying next week… I already told my friends not to let me enter too many shops, but I’ll be following them and that’s not going to be a totally safe tour anyway…).

Documentation: remake versus download

One of the things that I like a lot about Gentoo is that you can easily have installed the whole set of documentation for almost every library out there, being API, tutorials or all the stuff like that.

This, unfortunately, comes with a price: you need the time and the tools to build this documentation most of the times. And sometimes the tools you’re needed to install are almost overkill against the library they are used by. While most of the software out there with generated man pages ships with them already prebuilt in the tarball (thanks to automake, the whole thing can be done quite neatly), there are packages that don’t ship with them, either because they don’t have a clean way to tar them up at release or because they are not released (ruby-elf is culprit of this too, since it’s only available on the repository for now).

For those, the solution usually is to bring in some extra packages like, for the ruby-elf case above, the docbook-ns stylesheets that are used to produce the final man page from the DocBook 5 sources. But it might not use this: there are quite a lot of different ways to build man pages: perl scripts, compiled tools, custom XML formats, you name it.

And this is just for man pages, which are usually updated explicitly by their authors: API documentation, which is usually extrapolated from the source code directly, is rarely generated when creating the final release distribution. This goes for C/C++ libraries that use doxygen or gtk-doc, to Java packages that use JavaDoc, to Ruby extensions that use RDoc (indeed, the original idea for this post came to me when I was working on Ruby-ng eclass and noticed that almost all the Ruby extensions I packaged required me to rebuild the API documentation at build time).

Now, when it comes to API documentation, it’s obvious we don’t really want to “waste” time generating it for non-developers: they would never care about reading it in the first place. This is why we have USE flags after all. But sometimes, even this does not seem to be enough control. The first problem is: which format do we use for the documentation? For those of you that don’t know it, Doxygen can generate documentation in many forms, included but not limited to HTML, PDF (through LaTeX) and Microsoft Compressed HTML (CHM). There are packages that do build all formats available; some autodiscover the available tools, other try to use the tools even when they are not installed in the system.

We should probably do some kind of selection, but it has to be said it’s not obvious, especially when upstream, while adding proper targets to rebuild documentation, only design them for their own usage: to generate and publish, on their site or something, the resulting documentation. We install the documentation for the system user, we should probably focus on what can be displayed on screen, which would probably steer us toward installing HTML files because they are browsable and easy to look at on the screen. But I’m sure there are people who are interested in having the PDFs at hand instead, so if we were to focus on just those people will complain. Not like at this point I’m caring about a 100% experience but rather having a good experience for a 90% of people, maybe 95%.

I do remember that there are quite a few packages that do try to use LaTeX to rebuild documentation, this because there have been quite a few sandbox problems with the font cache that was regenerated during portage build. Unfortunately, I don’t have any number at hand, because – silly me – the tinderbox strips documentation away to save space (maybe I should remove that quirk, the raid1 volumes have quite a bit of free space by now). I can speak, recently, for Ragel, which I’ve move away from rebuilding the documentation, inspired first by the FreeBSD ports which downloaded the pre-built PDF version from Ragel’s site (I did the same for version 6.4, under doc USE flag), and then sidestepping the issue altogether since upstream now ships with the PDF in the source tarball.

But this is also buggering me as upstream for a few projects: what is the best for my users? The online API documentation is useful when you don’t want to rebuild the documentation locally, and can be searched by search engines much more easily, but is that enough? Offline users? Users with restricted bandwidth? Servers with restricted bandwidth? Of course offline users can regenerate the documentation, but is that the best option? Should the API documentation be shipped within the source tarball? That could make the tarball much much bigger than just the sources; it can even double in size.

Downloadable documentation, Python-style, looks to me like one of the best options. You get the source tarball, and the documentation tarball; you install the latter if the doc USE flag is enabled. But how to generate them? I guess that adding one extra target to the Makefiles (or equivalent for your build system) may very well be an option, I’ll probably work on that for lscube with a ready recipe showing how to make the tarball during make dist (and of course documenting it where it’s easier to reach than my blog).

The only problem with this is that it doe not take advantages of improved generation by newer version of the software; for instance if one day Doxygen, JavaDoc, RDoc and the like decide finally to agree on a single, compatible XML/XHTML format for documentation to be accessed with an application integrating a browser and an index system (I’d like to say that both Apple and Microsoft provide applications that seem to be doing that; I haven’t used them quite long enough to actually tell how well they work, but they are designed to do that).

But at least let this be a start for a discussion: should we really rebuild PDF documentation when installing packages for Gentoo, even under doc USE flag, or should we stick with more display-oriented formats?