It’s all in the schema

When it comes to file formats, I’m old school, and I probably prefer XML over something like YAML. I also had multiple discussions with people over the years, that could be summarised in “If only we used $format, we wouldn’t have these problems” — despite the problems being fairly clearly a matter of schema rather than format.

These discussions didn’t even stop in the bubble, since while protocol buffers are the de-facto file format, for the whole time I worked there, there had been multiple option for expanding the format, with templates, additional languages built on top of them, and templates for those additional languages.

Schema is metadata: data describing data. And in particular, it describe how the data looks like in general terms: which fields it has, what’s the format of the fields, and what are their valid values, and so on. Don’t assume that with schema I refer to XML Schemas only! There’s a significant amount of information that is not usually captured by a schema description languages (and XML Schemas is only one of them) — things like does that epoch time represent a date in UTC or local timezone?

The reason why I don’t have have any particularly strong opinion on data formats as much as I do data schemas is that once you have a working abstracted interface for them, you don’t need to care what the format is. This is clearly easier said than done, of course. DOM and SAX are complicated enough, and the latter is so specific to XML that there is practical zero hope to reuse a design depending on it for anything but XML. And you may have a preference of one format over another for other reasons.

For example, if your configuration was stored in XML, the SAX API allows you to parse the configuration file and fill in a structure in a single pass, which may be more memory-efficient than parsing the files into key/value pairs and requesting them by string. I did something like that with other file types through Ragel, but let’s be honest, in most cases, configuration file parsing speed is not your bottleneck (except if it is and in that case you probably know how to handle that already).

The big problem for me with choosing a schema is that unless you have an easy way to expand it, you’ll find yourself stuck at some point. Just look at the amount of specifications around for complex file formats such as pcapng. Or think of the various revisions of RFCs just for HTTP/1.1 (without considering the whole upgrade to HTTP/2 and later). Committing to a schema is scary, because if you get it wrong, you’re likely going to be stuck for a long while, or you end up with the compatibility issue of changing the format every other release of whatever tool uses the format.

This is not far from what happens with office formats as well. If you look at the various formats used by Microsoft Word, they seems to change for each of the early releases, but then kind-of settled down by the time Word 97 came along, before standardizing on the OOXML format. And even in the open source world, OpenDocument took quite a while before being stable enough to be usable, but is now fairly stable.

I wish I now had an answer to give everyone about how to handle schemas and updates to them. Unfortunately, I don’t. In the bubble, the answer is not to worry too hard about the schema as long as your protocol buffer definitions are valid, because the monorepo will enforce their usage. It’s a bit of a misconception as well, since even with a single format and a nearly-globally enforced schema there can be changes that need to be propagated first, but it solves a lot of problems.

I had thought about schemas before, and I’m thinking of them again, in the context of glucometerutils because I would really like to have an easy way to export the data dumped from various meters into a format that can be processed with different software. This way, I only need to care about building the protocol tooling, and leave it to someone else who has better ideas about visualisation and analysis to build tools for that part.

My best guess right now about that is to just keep a tool that can upgrade the downloaded format from one version to the next — and make sure that there’s a library in-between the exporter and the consumer, so that as long as they both have the same library version, there’s no need to keep the two tools in sync. But I have not really written any code for that.

Again on libvirt’s XML configuration files

So, Daniel Velliard – among the other things the maintainer of libvirt – dared me to give further voice to my concerns about libvirt’s configuration files being “aXML, almost XML”, given, and I quote him here:

I’m also on the XML standard group at W3C and the main author of libxml2, I would have no troubles debunking your argument very quickly

I guess I’ll then have to cross-post this entry to the libvir-list to make sure that I’m not, to paraphrase him, “hiding” on my blog (do I hide on a blog with at least 15K visitors a month, with no comment moderation, and syndicated on the homepage of Gentoo’s website ?).

First of all, I’m not really undermining Daniel’s technical preparation; I’m pretty sure he generally knows what he’s doing. But his being the author of libxml2 or being part of W3C’s standard group does not mean that he’s perfect. Nor it means that somebody should refrain from commenting on his ideas. This really reminds me of what people said about Ryan when I criticised his idea and just wanted me to shut up because he did a good job at porting games before. My original title for this post was reading “Being part of groups or having written libraries does not make you infallible” but it wasn’t catchy enough.

So, this is the preamble for my blog’s readers only, since it’s definitely not relevant to the libvirt project. The rest is for the list as well.

Hello,

In a recent post on my blog I ranted on about libvirt and in particular I complained that the configuration files look like what I call “almost XML”. The reasons why I say that are multiple, let me try to explain some.

In the configuration files, at least those created by virt-manager there is no specification of what the file should be (no document type, no namespace, and, IMHO, a too generic root element name); given that some kind of distinction is needed for software like Emacs’s nxml-mode to know how to deal with the file, I think that’s pretty bad for interaction between different applications. While libvirt knows perfectly well what it’s dealing with, other packages might not. Might not sound a major issue but it starts tickling my senses when this happens.

The configuration seem somewhat contrived in places like the disk configuration: if the disk is file-backed it require the file attribute to the <source> element, while it needs the dev attribute if it’s a block device; given that it’s a path in both cases it would have sounded easier on the user if a single path attribute was used. But this is opinable.

The third problem I called out for in the block is a lack of a schema for the files; Daniel corrected me pointing out that the schemas are distributed with the sources and installed. Sure thing, I was wrong. On the other hand I maintain that there are problems with those schemas. The first is that both the version distributed with 0.7.4 and the git version as of today suffer from bug #546254 (secret.rng being not well formed) so it means nobody has even tested them as of lately; then there is the fact that they are never referenced by the human-readable documentation which is why I didn’t find it the first time around; add also to that some contrived syntax in those schema as well that causes trang to produce a non-valid rnc file out of them (nxml-mode uses rnc rather than rng).

But I guess the one big problem with the schemas is that they don’t seem to properly encode what the human-readable documentation says, or what virt-manager does. For instance (please follow me with selector-like syntax), virt-manager creates /domain/os/type[@machine='pc-0.11'] in the created XML; the same attribute seem to be documented: “There are also two optional attributes, arch specifying the CPU architecture to virtualization, and machine referring to the machine type”. The schema does not seem to accept that attribute though (“element type: Relax-NG validity error : Invalid attribute machine for element type” with xmllint, just to make sure that it’s not a bug in any other piece of software, this is Daniel’s libxml2).

Now after voicing my opinions here, as Daniel dared me to do, I’d like to explain a second why I didn’t post this on the list in the first place: of what I wrote here, my beefs for calling this aXML, the only things that can be solved easily are the schemas; schemas that, at the time I wrote the blog, I was unable to find. The syntax, and the lack of a “safe” identification of the files as libvirt’s are the kind of legacy problems one has to deal with to avoid wasting users’ time with migrations and corrections, so I don’t really think they should be addressed unless a redesign of the configuration is intended.

Just my two cents, you’re free to take them as you wish, I cannot boast a curriculum like Daniel’s, but I don’t think I’m stepping out of place to point out these things.