XML, the eXtensible Markup language, is probably one of the data description formats most hated between developers, especially open source developers. It is also one of the most used formats, lately.
A lot of projects nowadays rely on XML: backend for configuration systems (think GConf), feeds for your blogs, format for the documents of both OpenOffice and Microsoft Office, used for modern web applications and for RPC between webservices.
It is hated a lot because it is often misused. Look at the XML configuration files of fontconfig, they tend to be quite unreadable. Indeed XML is not much designed to be human-readable as it is to be easily parsable by very different software. This is one good reason to hate it. Then add stuff like SOAP and WSDL and you can see when XML can really get out of hand.
XML is a very good way to make it simpler for different software to share data, as you can easily add more data into a given format without having to rewrite the parser, and as long as you follow some design rules, it is also easy to keep backward compatibility to very old versions. It is also good for converting structured data between formats, think DocBook, XHTML and our very own GuideXML. I also use a variant of that for my site even if I never actually formalised it and published it. One day I’ll do that, too.
Binary formats are not easily extensible, although EBML (the format upon which the Matroska multimedia container format is based) tries to do that, with a huge amount of complexity. They are quite nicer to deal with when you have a lot of data to transmit and little of it has to be understood by humans, so I will always find UPnP over-engineered, and its use of XML not a good choice.
Text-based formats like the INI format are, in my opinion better suited for configuration files. This is especially true since there are quite a lot of libraries that implement an easy way to parse them without reinventing the wheel, it should also be trivial to write a simple command that can be used to parse them in bash – if there isn’t one already.
But this post was supposed to be about XML, right? So is its misuse as a configuration file enough to make XML the most blamed format out there? Maybe, but it’s certainly not the only reason I can find.
Another reason, to me more important, is the “almost XML” formats. What is an “almost XML” format? It’s simply a format that is based off XML but is not enough XML. In this category I’d put the ASX format. Even if Wikipedia defines it an XML data format, the truth is that most ASX files have a bastardised XML format:
- closing tags gets optional, just like HTML;
- tag and attribute names are case-insensitive, just like HTML.
Parsing an ASX file is quite more a problem than parsing a true XML file; and caused quite a bit of problems with xine-lib (and its frontends), as it stops us from just reusing a parser like the one in libxml2, and it’s easier to make mistake while reinventing the wheel.
Yesterday (actually, today for me; if you couldn’t tell before, I started writing blog posts in advance and just showing them the day afterward about at noon on my timezone) I ended up working on another “almost XML” format. This time the format itself is declared as XML, it also has .xml extension, but it is not described by a DTD (or an XML schema), it features redundant fields, unused fields, and… it is not parsed as XML.
To be precise, I’m writing a software that writes these files, and I can tell you, they are written as XML, and also read as XML from my side. I actually use libxml2 to do the work. But the consumer of those files does not treat them as XML. Instead, it expects the file to be formatted in a precise way, with the line count being always the same, which means that I have to keep comments, I can’t add more comments, I can’t ignore ignored elements, and so on.
Now I can tell why there is so much hate for XML around.