It’s all in the schema

When it comes to file formats, I’m old school, and I probably prefer XML over something like YAML. I also had multiple discussions with people over the years, that could be summarised in “If only we used $format, we wouldn’t have these problems” — despite the problems being fairly clearly a matter of schema rather than format.

These discussions didn’t even stop in the bubble, since while protocol buffers are the de-facto file format, for the whole time I worked there, there had been multiple option for expanding the format, with templates, additional languages built on top of them, and templates for those additional languages.

Schema is metadata: data describing data. And in particular, it describe how the data looks like in general terms: which fields it has, what’s the format of the fields, and what are their valid values, and so on. Don’t assume that with schema I refer to XML Schemas only! There’s a significant amount of information that is not usually captured by a schema description languages (and XML Schemas is only one of them) — things like does that epoch time represent a date in UTC or local timezone?

The reason why I don’t have have any particularly strong opinion on data formats as much as I do data schemas is that once you have a working abstracted interface for them, you don’t need to care what the format is. This is clearly easier said than done, of course. DOM and SAX are complicated enough, and the latter is so specific to XML that there is practical zero hope to reuse a design depending on it for anything but XML. And you may have a preference of one format over another for other reasons.

For example, if your configuration was stored in XML, the SAX API allows you to parse the configuration file and fill in a structure in a single pass, which may be more memory-efficient than parsing the files into key/value pairs and requesting them by string. I did something like that with other file types through Ragel, but let’s be honest, in most cases, configuration file parsing speed is not your bottleneck (except if it is and in that case you probably know how to handle that already).

The big problem for me with choosing a schema is that unless you have an easy way to expand it, you’ll find yourself stuck at some point. Just look at the amount of specifications around for complex file formats such as pcapng. Or think of the various revisions of RFCs just for HTTP/1.1 (without considering the whole upgrade to HTTP/2 and later). Committing to a schema is scary, because if you get it wrong, you’re likely going to be stuck for a long while, or you end up with the compatibility issue of changing the format every other release of whatever tool uses the format.

This is not far from what happens with office formats as well. If you look at the various formats used by Microsoft Word, they seems to change for each of the early releases, but then kind-of settled down by the time Word 97 came along, before standardizing on the OOXML format. And even in the open source world, OpenDocument took quite a while before being stable enough to be usable, but is now fairly stable.

I wish I now had an answer to give everyone about how to handle schemas and updates to them. Unfortunately, I don’t. In the bubble, the answer is not to worry too hard about the schema as long as your protocol buffer definitions are valid, because the monorepo will enforce their usage. It’s a bit of a misconception as well, since even with a single format and a nearly-globally enforced schema there can be changes that need to be propagated first, but it solves a lot of problems.

I had thought about schemas before, and I’m thinking of them again, in the context of glucometerutils because I would really like to have an easy way to export the data dumped from various meters into a format that can be processed with different software. This way, I only need to care about building the protocol tooling, and leave it to someone else who has better ideas about visualisation and analysis to build tools for that part.

My best guess right now about that is to just keep a tool that can upgrade the downloaded format from one version to the next — and make sure that there’s a library in-between the exporter and the consumer, so that as long as they both have the same library version, there’s no need to keep the two tools in sync. But I have not really written any code for that.

Again on libvirt’s XML configuration files

So, Daniel Velliard – among the other things the maintainer of libvirt – dared me to give further voice to my concerns about libvirt’s configuration files being “aXML, almost XML”, given, and I quote him here:

I’m also on the XML standard group at W3C and the main author of libxml2, I would have no troubles debunking your argument very quickly

I guess I’ll then have to cross-post this entry to the libvir-list to make sure that I’m not, to paraphrase him, “hiding” on my blog (do I hide on a blog with at least 15K visitors a month, with no comment moderation, and syndicated on the homepage of Gentoo’s website ?).

First of all, I’m not really undermining Daniel’s technical preparation; I’m pretty sure he generally knows what he’s doing. But his being the author of libxml2 or being part of W3C’s standard group does not mean that he’s perfect. Nor it means that somebody should refrain from commenting on his ideas. This really reminds me of what people said about Ryan when I criticised his idea and just wanted me to shut up because he did a good job at porting games before. My original title for this post was reading “Being part of groups or having written libraries does not make you infallible” but it wasn’t catchy enough.

So, this is the preamble for my blog’s readers only, since it’s definitely not relevant to the libvirt project. The rest is for the list as well.


In a recent post on my blog I ranted on about libvirt and in particular I complained that the configuration files look like what I call “almost XML”. The reasons why I say that are multiple, let me try to explain some.

In the configuration files, at least those created by virt-manager there is no specification of what the file should be (no document type, no namespace, and, IMHO, a too generic root element name); given that some kind of distinction is needed for software like Emacs’s nxml-mode to know how to deal with the file, I think that’s pretty bad for interaction between different applications. While libvirt knows perfectly well what it’s dealing with, other packages might not. Might not sound a major issue but it starts tickling my senses when this happens.

The configuration seem somewhat contrived in places like the disk configuration: if the disk is file-backed it require the file attribute to the <source> element, while it needs the dev attribute if it’s a block device; given that it’s a path in both cases it would have sounded easier on the user if a single path attribute was used. But this is opinable.

The third problem I called out for in the block is a lack of a schema for the files; Daniel corrected me pointing out that the schemas are distributed with the sources and installed. Sure thing, I was wrong. On the other hand I maintain that there are problems with those schemas. The first is that both the version distributed with 0.7.4 and the git version as of today suffer from bug #546254 (secret.rng being not well formed) so it means nobody has even tested them as of lately; then there is the fact that they are never referenced by the human-readable documentation which is why I didn’t find it the first time around; add also to that some contrived syntax in those schema as well that causes trang to produce a non-valid rnc file out of them (nxml-mode uses rnc rather than rng).

But I guess the one big problem with the schemas is that they don’t seem to properly encode what the human-readable documentation says, or what virt-manager does. For instance (please follow me with selector-like syntax), virt-manager creates /domain/os/type[@machine='pc-0.11'] in the created XML; the same attribute seem to be documented: “There are also two optional attributes, arch specifying the CPU architecture to virtualization, and machine referring to the machine type”. The schema does not seem to accept that attribute though (“element type: Relax-NG validity error : Invalid attribute machine for element type” with xmllint, just to make sure that it’s not a bug in any other piece of software, this is Daniel’s libxml2).

Now after voicing my opinions here, as Daniel dared me to do, I’d like to explain a second why I didn’t post this on the list in the first place: of what I wrote here, my beefs for calling this aXML, the only things that can be solved easily are the schemas; schemas that, at the time I wrote the blog, I was unable to find. The syntax, and the lack of a “safe” identification of the files as libvirt’s are the kind of legacy problems one has to deal with to avoid wasting users’ time with migrations and corrections, so I don’t really think they should be addressed unless a redesign of the configuration is intended.

Just my two cents, you’re free to take them as you wish, I cannot boast a curriculum like Daniel’s, but I don’t think I’m stepping out of place to point out these things.

The status of some deep roots

While there are quite a few packages that are know to be rotting in the tree, and thus are now being pruned away step by step, there are some more interesting facets in the status of Gentoo as a distribution nowadays.

While the more interesting and “experimental” areas seem to have enough people working on them (Ruby to a point, Python more or less, KDE 4, …), there are quite some deeper areas that are just left to rot as well, but cannot really be pruned away. This includes for instance Perl (for which we’re lagging behind a lot, mostly due to the fact that tove is left alone maintaining that huge piece of software), and SGML, which in turn includes all the DocBook support.

I’d like to focus a second on that latter part because I am partly involved in that; since I like using DocBook and I actually use the stylesheets to produce the online version of Autotools Mythbuster using the packages that are available in Portage. Now, when I wanted to make use of DocBook 5, the stylesheet for the namespaced version (very useful to write with emacs and nxml) weren’t available, so I added them, adding support for them to the build-docbook-catalog script. With time, I ended up maintaining the ebuilds for both versions of the stylesheets, and that hasn’t been always the cleanest thing given that upstream dropped the tests entirely in the newer versions (well, technically they are still there, but they don’t work, seems like they lack some extra stuff that is nowhere documented).

Now, I was quite good as I was with this; I just requested stable for the new ebuilds of the stylesheets (both variants) and I could have kept just doing that, but … yesterday I noticed that the list of examples in my guide had broken links, and after mistakenly opening a bug on the upstream tracker, I noticed that the bug is fixed already in the latest version. Which made me smell something: why nobody complained that the old stylesheets were broken? Looking at the list of bugs for the SGML team, you can see that lots of stuff was actually ignored for way too long a time. I tried cleaning up some stuff, duping bugs that were obviously the same, and fixing one in the b-d-c script, but this is one of the internal roots that is rotting, and we need help to save it.

For those interested in helping out, I have taken note of a few things that should probably be done with medium urgency:

  • make sure that all the DTDs are available in the latest release, and that they are still available upstream; I had to seed an old distfile today because upstream dropped it;
  • try to find a way to install the DocBook 5 schemas properly; right now the nxml-docbook5-schemas package install its own copy of the Relax-NG Compact file; on Fedora 11, there is a package that installs more data about DocBook 5, we should probably use the same original sources; the nxml-docbook5-schemas package could then either be merged in with that package or simply use the already-installed copy;
  • replace b-d-c, making it both more generic and using a framework that exists already (like eselect) instead of reinventing the wheel; the XML/DTD catalog can easily be used for more than just DocBook, while I know the Gentoo documentation team does not want for the Gentoo DTD to just be available as a package to install in the system (which would make it much easier to keep updated for the nxml schemas, but sigh), I would love to be able to make fsws available that way (once I’ll finish building the official schema for it and publish it, again more on that in the future);
  • find out how one should be testing the DocBook XSL stylesheets, so that we can run tests for them; it would have probably avoided the problem I had with Autotools Mythbuster in the past months;
  • package the stylesheets for Xalan and Saxon, which are different from the standard ones; b-d-c already has support for them to a point (although not having to explicit this kind of things in the b-d-c replacement is desirable), but I didn’t have reason to add them.

I don’t think I’ll have much time on working on them in the future, so user contributions are certainly welcome; if you do open any bug for these issue, please do CC me directly, since I don’t intend (yet) to add myself to the sgml alias.

Yes, again more static websites

You might remember I like static websites and that I’ve been working on a static website framework based on XML and XSLT.

Upon necessity, I’ve added support to that framework for multi-language websites; this is both because people asked for my website to be translated in Italian (since my assistance customers don’t usually know English, not that well at least), and because I’m soon working on the website for a metal group that is to be available in both languages too.

Now, making this work in the framework wasn’t an easy job: as it is now, there is a single actual XML document that the stylesheet, with all its helper templates, gets applied to, it already applied a two-pass translation, so that custom elements (like the ones that I use for the projects’ page of my site – yes I know it gets stuck when loading) are processed properly, and translate into fsws proper elements.

To make this work I then applied a similar method (although now I start to feel like I did it in the wrong order): I create a temporary document filtering all the elements that have no xml:lang attribute or have the proper language in that, once for each language the website is written in. Then, I apply the rest of the processing over this data.

Since all the actual XHTML translation happens in the final pass, this pass become almost transparent to the rest of the processing, and at the same time, pages like the articles index can share the whole list of articles between the two versions, since I just change the section element of the intro instead of creating two separate page descriptions.

Now, I’ll be opening fsws one day once this is all sorted out, described and so on, for now I’m afraid it’s still too much in flux to be useful (I haven’t written a schema of any kind just yet, and I want to do that soon so I can even validate my own websites). For now, though, I can share the code I’m currently using to handle the translation of the site. As usual, I don’t rely on any kind of dynamic web application to serve the content (which the frameworks generate in static form), but rather I rely on Apache’s mod_negotiation and mod_rewrite (which ship with the standard distribution).

This is the actual configuration that vanguard is using to do the serving:

AddLanguage en .en
AddLanguage it .it

DefaultLanguage en
LanguagePriority en it
ForceLanguagePriority Fallback

RewriteEngine On

RewriteRule ^(/[a-z]{2})?/$     $1/home [R=permanent]

RewriteRule ^/([a-z]{2})/(.+)$ /$2.$1

(I actually have a few more rules in that configuration file but that’s beside the point now).

Of course this also requires that the MultiView option is also enabled, since that’s what makes Apache pick up the correct file without having map files around. Since the file are all named like home.en.xhtml and, requesting the explicit language as suffix allows Apache to just pick up the correct file, without having to mess with extra configuration of files.

Right now there are a few more things that I have to work on, for instance the language selection on the top should really bring you to the other language version of the same page, rather than the homepage. Or it works fine on single-language site just if you never use xml:lang, I should special-case that. For this to work I have to add a little more code to the framework, but it should be feasible in the next weeks. Then there are some extra features I haven’t even started implementing but just planned: an overlay based photo gallery, and some calendar management for ICS and other things like that.

Okay this should be it for the teasing about fsws; I really have to find time to set up a repository for, and release, my antispam rules, but that will have to wait for next week I guess.

More XSL translated websites

I have written before that, over CMS- or Wiki-based website, I prefer static websites, and that with a bit of magic with XSL and XML you can get results that look damn cool. I also have worked on the new xine site which is entirely static and generated from XML sources and libxslt.

When I wrote the xine website, I also reused some of the knowledge from my own website even though the two of them are pretty different in many aspects: my website used one xml file per page, with an index page, and a series of extra stylesheets that convert some even higher level structures into the mid-level blocks that then translated to XHTML; the xine website used a single XML file with XInclude to merge in many fragments, with one single document for everything, similarly to what DocBook does.

Using the same framework, but made a bit more generic, I wrote the XSL framework (that I called locally “Flameeyes’s Static Website” or fsws for short) that is generating the website for a friend of mine, an independent movie director (which is hosted on vanguard too). I have chosen to go down this road because he needed something cheap, and he didn’t care much about interaction (there’s Facebook for that, mostly). In this framework I implemented some transformation code that implements part of the flickr REST API, and also a shorthand to blend in Youtube videos.

Now, I’m extending the same framework, keeping it abstract from the actual site usage, allowing different options for settig up the pages, to rewrite my own website with a cleaner structure. Unfortunately it’s not as easy as I thought, because while my original framework is extensible enough, and I was able to add in enough of my previous stylesheets’ fragments into it without changing it all over, there are a few things that I could probably share again between different sites without needing to recreate it each time but require me to make extensive changes.

I hope that once I’m done with the extension, I’ll be able to publish fsws as a standard framework for the creation of static websites; for now I’m going to extend it just locally, and for a selected number of friends, until I can easily say “Yes it works” – the first thing I’ll be doing then would be the xine website. But I’m sure that at least this kind of work is going to help me getting better understanding of XSLT that I can use for other purposes too.

Oh and in the mean time I’d like to pay credit to Arcsin whose templates I’ve been using both for my and others’ sites… I guess I know who I’ll be contacting if I need some specific layout.

The documentation problem

I always have been a fierce sustainer of documented code: code without documentation is not much helpful, because “self-explanatory” code does not really exist; even if you could tell what a single piece of code does, you cannot know for sure how that interacts with the rest of a wider system.

With my recent work on feng I’ve been noticing how important is to write very good documentation of side effects of functions, and even more important is to document eventual mutexes acquired during execution of the function and similar. The feng codebase was barely documented at all when I started working on the project; now it would probably be quite good actually. Ohloh reports 26% of comment-to-code ratio. I still think it’s not high enough.

Unfortunately, I have noticed also that Doxygen seems to have quite some limitations, or I haven’t studied well enough the recent versions. The first problem is that feng is not a standalone project; it uses liberis and netembryo (used to use bufferpool before replacement) and their documentation should probably be tightly coupled. Indeed this has been what really buggered me about bufferpool: part of the logic that feng used to work was embedded in bufferpool, but the documentation wasn’t linked automatically one from the other.

Also, while there is a way to create new commands, it’s not exactly easy, or, last I remember, very flexible. This is quite a problem because it doesn’t help when I have to write to multiple functions things like “Unused parameter for compatibility with GFunc interface” or “Internal function to be called by g_slist_foreach()”, or finally “This function will acquire the Foo::lock mutex”. Consistency needs to be checked manually, and I’m pretty sure right now the documentation is not consistent.

I guess these are some of the reasons why glib and other GNOME-based projects don’t use Doxygen but rather gtk-doc. On the other hand, I sincerely don’t like gtk-doc at all, just like I don’t like the extended formatting allowed by Doxygen: XML-style blocks inside the comments don’t look at their place at all.

I’m sincerely not sure how to solve any of these problems to be honest, maybe some extension to Doxygen, maybe letting Doxygen produce some special XML files instead of directly HTML files, and then hook up some XSLT to that, having some doxygen-style comment like this:

 * @brief Do something
 * @custom acquire-lock Foo::lock

to produce a new special element with name acquire-lock and content Foo::lock, then process that with XSLT to produce a consistent “This function will acquire the lock Foo::lock” would probably be a very welcome addition to me.

I really have to look into this one day; I admit I haven’t studied the internals and advanced supports of Doxygen, especially not in the latest releases. I know it has some kind of XML output but I’m not sure how flexible that is. But not for now, since I already have enough tasks to take care of and stuff to write. One day, maybe.

Wondering about feeds

You might have noticed in the past months a series of issues with my presence on Planet Gentoo. Sometimes posts didn’t appear for a few days, then there have been issues with entries figuratively posted in the future, and a couple of planet spam really made my posts quite obnoxious to many. I didn’t like it either, seems like I had some problems with Typo when moved to Apache from lighttpd, and then there has been issues with Planet and its handling of Atom feeds and similar. Now these problems should be solved, Planet has moved to Venus software, and it now uses the Atom feeds again which are much more easily updated.

But this is not my topic today, today I wish to write about how you can really mess it up with XML technologies. Yesterday I wanted to prepare a feed for the news on the xine’s website so that it could be shown on Ohloh too. Since the idea is to use static content, I wanted to generate the feed, with XSLT, starting from the same data use to generate the news page. Not too difficult actually, I do something similar for my website as well .

But, since my website only needs to sort-of work, while the xine site needs to actually be usable, I decided to validate the generated content using the W3C validator; the results were quite bad. Indeed, the content in the RSS feed needs to be escaped or just plain text, no raw XHTML is allowed.

So I turned to check Atom, which is supposedly better at things, and is being used for a lot of other stuff as well already. That really looks like XML technology for once, using the things that actually make it work nicely: namespaces. But if I look at my blog’s feed I do see a very complex XML file. I tried giving up on it for a while and gone back to RSS, but while the feed is simple around the entries, the entries themselves are quite a bit to deal with, especially since they require the RFC822 date format which is not really the nicest thing to deal with (for once, it expects days names and month names in English, and it’s far from easily parsed by a machine to translate in a generic date that can be translated in the feed’s user’s locale).

I reverted to Atom, created a new ebuild for the Atom schema for nxml (which by the way fail at allowing auto-completion in XSL files, I need to contact someone about that), and started looking at what is strictly needed. The result is a very clean feed which should work just fine for everybody. The code, as usual, is available on the repository.

As soon as I have time I’ll look into switching my website to also provide an Atom feed rather than an RSS feed. I’m also considering the idea of redirecting the requests for the RSS feed on my blog to Atom, if nobody gives me a good reason to keep RSS. I have already hidden them from the syndication links on the right, which now only present Atom feeds, and they are already the most requested compared to the RSS versions. For the ones who can’t see why I’d like to standardise on a single format: I don’t like redundancy where it’s not needed, and in particular, if there is no practical need to keep both, I can reduce the amount of work done by Typo by just hiding the RSS feeds and redirecting them from within Apache rather than keeping them to hit the application. Considering that typo creates feeds for each one of the tags, categories and posts (the latter I already hide and redirect to the main feed, since they make no sense to me), it’s a huge amount of requests that would be merged.

So if somebody has reasons for which the RSS feeds should be kept around, please speak now. Thanks!

Inconsistent Scalable Vector Graphics

The one job I’m taking care of at the moment involves me drawing some stuff using SVG in C code, without using any support libraries. Without going into much detail, since I cannot because of an NDA, I can say the generated file has to be as small as possible since, as you might guess by now, it has to be done on an embedded system.

The task itself is not too difficult, but today I started the actual reduction of the code so that it fits in the software the I have to develop, and here starts the problems. The first issue has been I was tired of looking up the correct attributes for each SVG element, so I ended up doing the same I did for DocBook 5 and added a new ebuild to portage: app-emacs/nxml-svg-schemas:1.1 which installs the SVG 1.1 schemas so that Emacs’s nxml-mode can tab-complete the elements and attributes. I positively love Emacs and nXML since it allows me to have specific XML variants support by just adding its schemas to the system!

A little note now about nXML though: I’ll have to contact upstream because I found one nasty limitation of it: I cannot make it locate the correct shemas on a version basis, which means I won’t be able to provide SVG 1.2 schemas alongside as 1.1 with the code as it is; if I can get a new locating rules schemas that can detect the correct schema to use also through version, that’s going to solve not only SVG 1.2 but also future DocBook versions. So this enters my TODO list. Also, am I the only one using nXML in Gentoo? I’m maintaining all the three schemas ebuilds, it’s not like it’s a big hassle, but I wonder what would happen if I were to leave Gentoo — or more likely at this point if I were to end up in the hospital again; I hope I’m fine now but one is never sure, and my mindset is pretty pessimistic nowadays.

At any rate, I’ve been studying the SVG specifications to find a way to reduce the useless data in the generated file, without burdening the software with doing manual calculations. The easy way out is to use path and polyline elements to draw most of the lines in the file, which would be fine if it wasn’t they only accept coordinates in “pixels” (which are not actual pixel, but just the basic unit for the SVG file itself). This is not too bad since you can define a new viewport which can have an arbitrary size in “pixels”, and is stretched over the area. The problem is with supporting the extra viewports.

The target of the file to generate is to work on as many systems as possible, but it’s a requirement that it works on Windows with Internet Explorer, as well as Firefox. For SVG files under Internet Explorer there is the old, unmaintained and deprecated Adobe SVG plugin (which is still the default Internet Explorer will try to install) and the examotion Renesis Player which is still maintained. So I take out my test file, and try it.

I wrote the file testing it with eog which I’m not sure which SVG library uses for the rendering and with rsvg that uses librsvg obviously; with those, my test file was perfect, the problem has been with other software, since I got the following results:

  • Inkscape wouldn’t load it properly at all and just draw crazy stuff;
  • batik 1.6 worked;
  • Firefox, Safari and Opera shown me grey and red rectangles rather than the actual lines I wrote in the SVG;
  • Renesis Player shown me lines, but way too thick for what I wanted;
  • OpenOffice shown it with the right dimensions but didn’t translate it 2×2 cm down from the upper left corner like I instructed the svg to.

After reporting the issue on examotion’s tracker, since that is the most important failure in that list for my current requirements, I got a suggestion of switching the definition of font-size to direct attribute rather than through style so to change the actual svg measure unit. This made no difference for the three implementations that worked before, nor on examotion, but actually got me one step closer to the wished result:

  • inkscape still has problems, the white rectangle I draw to get a solid white background is positioned over the rest of the elements, rather than under like I’d expect since it’s the first element in the file; it also does not extend the grid like it should, so the viewBox attribute is not properly handled;
  • OpenOffice still has the problem with translation but for the rest seems fine;
  • Safari still has the same problems;
  • Opera 9.6 on Windows finally renders it perfectly, but fails under Ubuntu (?!);
  • Firefox official builds for Windows and OSX, as well as under Ubuntu, work fine; under Gentoo, it does not, and still show the rectangles;
  • Adobe SVG plugin work fine.

At this point I should have enough working implementations so that I can proceed with my job task, but this actually made me think about the whole thing about SVG, and it reminded me tremendously of the OASIS OpenDocument smoke which I had a fight with more than three years ago. I like very much XML-based technologies for sake of interoperation, but it’d be nice if the implementations actually had a way to produce a proper result.

Like in OpenDocument, where the specifications allow two different styles for lists, and different software implements just one of them, making themselves incompatible one with the other, SVG defines some particular features that are not really understood or used by some implementations, or can create compatibility issues between implementations.

In this case, it seems like my problem is the way I use SVG subdocuments to establish new viewports, and then use the viewBox feature to change their unit space. This is perfectly acceptable and well described by the specifics, but it seems to cause further issues down the line with the measure units inside and outside these subdocuments. But it seems like the problem is not just one-way, from this other bugreport on Inkscape you can also see that Inkscape does not generate so pure SVG as it should.

While XML has been properly designed to be extensible, thanks to things like namespaces and similar, one would expect that the feature provided by a given format would be used before creating your own extensions; in this case from that bug report you can see (and I indeed double checked that it still is the case) that Inkscape does not use SVG’s own features to establish a correspondence between “SVG pixels” and the size in real-world units of the image; indeed, it adds two new attributes to the main document: inkscape:export-xdpi and inkscape:export-ydpi, while SVG expects you to use the viewBox for providing that information.

Sigh, I just wished to get my graph working.

DocBook 5 and Gentoo

After my post about DocBook 5 which I wrote from the hospital, with no connection to my Yamato to test most of the basis of it, today I started looking at working with DocBook 5 here, too.

This is mostly so I can resume working on the article I had to put on hold for surgery, but it’ll also work fine out as a way to start looking into what I was thinking of.

The first problem was to get good new Emacs to work with DocBook 5 files. Even the latest CVS version will still only support the old 4.x series “schemas”, and thus won’t support the new namespace aware DocBook 5. Luckily, I use Aquamacs on the laptop too so I already looked up how to support it. To make it more streamlined, for everybody me included, I decided to make it a step further.

If you’re an Emacs user in Gentoo and want to work with DocBook 5, just install app-emacs/nxml-docbook5-schemas and it will install the Relax-NG compact schema for DocBook 5 with support for XInclude (hurray for XInclude), and set up nxml so that it’s loaded properly, just restart Emacs (or find a way to make it reload nxml cache) and you’re set. Please note that nxml-mode decides which schema to use once you open the file in the buffer, so if you’re converting an old DocBook file to the new version, you’ll have to close the buffer and reopen the file.

Unfortunately xmlto does not seem to support DocBook 5 just yet, and it insists on looking for a DTD to validate the content. Even the latest version, which I bumped in the tree, fails to generate content out of DocBook 5. So I decided that xsltproc is good enough for me, and decided to make it possible to install the stylesheets as needed.

I’ve bumped app-text/docbook-xsl-stylesheets to the latest version and then created a new package, app-text/docbook-xsl-ns-stylesheets that use the new namespace aware stylesheets. Unfortunately for this to work, I also had to update the build-docbook-catalog script, that agriffis used to maintain… last touched in 2004. I suppose this is one of the things that should be quite more documented as to “Why? How? Who?” — there are instructions on how to release it but no documentation on why this is needed, how it was implemented and so on.

The script is very much abandoned and had quite a few hacks in it, one of which I removed because it wasn’t making any sense any longer (it was redirecting some old versions of stylesheets to the local copy, but not the most recent ones — such old versions don’t exist for xsl-ns for instance so it makes no sense to duplicate the redirections). Part of the reason why it was executed at every install/removal is not even relevant nowadays (stylesheets used to be installed with their version number, like it is done for DTDs, but this is no longer the case as they are not slotted), it should well be possible to just reduce the code needed to a simple function in the ebuild itself, rather than using such a bit-rotting script.

At any rate, I future proofed it a notch by making it possible to support xsl-xalan and xsl-saxon just by creating the proper ebuilds for the tree (I’m not interested in those yet so I didn’t create them).

I suppose one interesting thing would be to install Gentoo’s DTDs and Gentoo’s stylesheets through ebuilds too, but it’s unlikely to be easy or much useful as it is, so let’s not get there yet.

So anyway, I hope what I did today is going to be helpful to others to use DocBook 5 with Gentoo; if it was, appreciation tokens are quite appreciated, especially at convalescence time ;)