It’s all in the schema

When it comes to file formats, I’m old school, and I probably prefer XML over something like YAML. I also had multiple discussions with people over the years, that could be summarised in “If only we used $format, we wouldn’t have these problems” — despite the problems being fairly clearly a matter of schema rather than format.

These discussions didn’t even stop in the bubble, since while protocol buffers are the de-facto file format, for the whole time I worked there, there had been multiple option for expanding the format, with templates, additional languages built on top of them, and templates for those additional languages.

Schema is metadata: data describing data. And in particular, it describe how the data looks like in general terms: which fields it has, what’s the format of the fields, and what are their valid values, and so on. Don’t assume that with schema I refer to XML Schemas only! There’s a significant amount of information that is not usually captured by a schema description languages (and XML Schemas is only one of them) — things like does that epoch time represent a date in UTC or local timezone?

The reason why I don’t have have any particularly strong opinion on data formats as much as I do data schemas is that once you have a working abstracted interface for them, you don’t need to care what the format is. This is clearly easier said than done, of course. DOM and SAX are complicated enough, and the latter is so specific to XML that there is practical zero hope to reuse a design depending on it for anything but XML. And you may have a preference of one format over another for other reasons.

For example, if your configuration was stored in XML, the SAX API allows you to parse the configuration file and fill in a structure in a single pass, which may be more memory-efficient than parsing the files into key/value pairs and requesting them by string. I did something like that with other file types through Ragel, but let’s be honest, in most cases, configuration file parsing speed is not your bottleneck (except if it is and in that case you probably know how to handle that already).

The big problem for me with choosing a schema is that unless you have an easy way to expand it, you’ll find yourself stuck at some point. Just look at the amount of specifications around for complex file formats such as pcapng. Or think of the various revisions of RFCs just for HTTP/1.1 (without considering the whole upgrade to HTTP/2 and later). Committing to a schema is scary, because if you get it wrong, you’re likely going to be stuck for a long while, or you end up with the compatibility issue of changing the format every other release of whatever tool uses the format.

This is not far from what happens with office formats as well. If you look at the various formats used by Microsoft Word, they seems to change for each of the early releases, but then kind-of settled down by the time Word 97 came along, before standardizing on the OOXML format. And even in the open source world, OpenDocument took quite a while before being stable enough to be usable, but is now fairly stable.

I wish I now had an answer to give everyone about how to handle schemas and updates to them. Unfortunately, I don’t. In the bubble, the answer is not to worry too hard about the schema as long as your protocol buffer definitions are valid, because the monorepo will enforce their usage. It’s a bit of a misconception as well, since even with a single format and a nearly-globally enforced schema there can be changes that need to be propagated first, but it solves a lot of problems.

I had thought about schemas before, and I’m thinking of them again, in the context of glucometerutils because I would really like to have an easy way to export the data dumped from various meters into a format that can be processed with different software. This way, I only need to care about building the protocol tooling, and leave it to someone else who has better ideas about visualisation and analysis to build tools for that part.

My best guess right now about that is to just keep a tool that can upgrade the downloaded format from one version to the next — and make sure that there’s a library in-between the exporter and the consumer, so that as long as they both have the same library version, there’s no need to keep the two tools in sync. But I have not really written any code for that.