Finding differences in XML files

As I wrote in a previous entry (XML misuses), my current job entails working with XML files. Badly designed XML files, but that’s not the main issue here, although it does make the task a bit more complex than it would be.

I have two XML files, both big around 1.5MiB; I have to find what differs between them, one is the original, the other is the one generated by the software I’m writing.

A simple diff run between the two can’t work for me because it shows a HUGE lot of information I don’t care about, as it tells me whenever a whole line differs, while I need to know which attributes in these very long lines change.

I though that it was quite a common task working with XML files actually, so after asking a bit around for suggestions, I just ran a search to find a software that would do what I need.

The first option was to use Microsoft’s XML Notepad (which I’m using already since I’m working under Windows and I didn’t want to look for KXML Editor for Windows), and its Compare option. Actually, that was in theory exactly what I needed. Unfortunately there is a bug, a huge one: whenever an attribute of a variable changes, instead of showing me the attribute name it outputs the element’s name. Quite a simple bug to fix if you had the sources around, but this is Microsoft.

Update 2020-10-14: since it’s now 2020, Microsoft actually did release the sources! You can find XmlNotepad on GitHub and it seems to be actively maintained.

Then I was suggested to try another proprietary commercial tool called oxygen. The way it compares XML files have multiple algorithms, I tried first the “XML accurate” one as that seemed the most appropriate, I needed an accurate comparison indeed. It excepted out of memory, suggesting me to increase its limit. So I did, and still excepted, twice. The “fast” algorithm instead just gave me a line by line diff with XML syntax highlight. Pretty much useless. Luckily it was a demo.

Time to Google around, and I’ve found a few tools that seemed to be what I needed. Unfortunately I’ve found stuff that was written originally under GCC 2.95, then ported to GCC 3, and obviously fails with GCC 4, like XyDiff, which also had a pretty idiotic build system; I’ve found Java code written for Java 1.4 and not working on 1.6 (with no source available of course) – many thanks for the “Write once run everywhere” idea – and much more sophisticated stuff (like Nokia’s xmlpatch which is probably something I could use for other stuff, but not for this). The only option that seemed at least near what I needed was xmldiff, which is in portage already. Too bad the thing, left 20 minutes working on the two files, was still crunching numbers with 80% cpu, and not outputting a single line of difference.

Is it this difficult to find a tool doing what I need here? I suppose I could spend some time writing my own tool, either for just this particular case or generic, but I’m not really sure I want to start this yet. Especially since I wouldn’t have time to polish it and I’m tired of starting projects I never complete in a decent way.

And I’m not sure which language I should use either. I could use Ruby but I’m afraid of the memory usage. I’m not good enough with Java to use that. Standard C could be an option but would probably be a bit overcomplex. The probably “better” choice for this would be C# so that I could use it directly on the computer I’m generating the file on, but then I’d like something usable on Linux too, so I would probably be forced to look into Mono for that…

If somebody has a suggestion, it’d be very welcome.