Finding differences in XML files

As I wrote in a previous entry (XML misuses), my current job entails working with XML files. Badly designed XML files, but that’s not the main issue here, although it does make the task a bit more complex than it would be.

I have two XML files, both big around 1.5MiB; I have to find what differs between them, one is the original, the other is the one generated by the software I’m writing.

A simple diff run between the two can’t work for me because it shows a HUGE lot of information I don’t care about, as it tells me whenever a whole line differs, while I need to know which attributes in these very long lines change.

I though that it was quite a common task working with XML files actually, so after asking a bit around for suggestions, I just ran a search to find a software that would do what I need.

The first option was to use Microsoft’s XML Notepad (which I’m using already since I’m working under Windows and I didn’t want to look for KXML Editor for Windows), and its Compare option. Actually, that was in theory exactly what I needed. Unfortunately there is a bug, a huge one: whenever an attribute of a variable changes, instead of showing me the attribute name it outputs the element’s name. Quite a simple bug to fix if you had the sources around, but this is Microsoft.

Update 2020-10-14: since it’s now 2020, Microsoft actually did release the sources! You can find XmlNotepad on GitHub and it seems to be actively maintained.

Then I was suggested to try another proprietary commercial tool called oxygen. The way it compares XML files have multiple algorithms, I tried first the “XML accurate” one as that seemed the most appropriate, I needed an accurate comparison indeed. It excepted out of memory, suggesting me to increase its limit. So I did, and still excepted, twice. The “fast” algorithm instead just gave me a line by line diff with XML syntax highlight. Pretty much useless. Luckily it was a demo.

Time to Google around, and I’ve found a few tools that seemed to be what I needed. Unfortunately I’ve found stuff that was written originally under GCC 2.95, then ported to GCC 3, and obviously fails with GCC 4, like XyDiff, which also had a pretty idiotic build system; I’ve found Java code written for Java 1.4 and not working on 1.6 (with no source available of course) – many thanks for the “Write once run everywhere” idea – and much more sophisticated stuff (like Nokia’s xmlpatch which is probably something I could use for other stuff, but not for this). The only option that seemed at least near what I needed was xmldiff, which is in portage already. Too bad the thing, left 20 minutes working on the two files, was still crunching numbers with 80% cpu, and not outputting a single line of difference.

Is it this difficult to find a tool doing what I need here? I suppose I could spend some time writing my own tool, either for just this particular case or generic, but I’m not really sure I want to start this yet. Especially since I wouldn’t have time to polish it and I’m tired of starting projects I never complete in a decent way.

And I’m not sure which language I should use either. I could use Ruby but I’m afraid of the memory usage. I’m not good enough with Java to use that. Standard C could be an option but would probably be a bit overcomplex. The probably “better” choice for this would be C# so that I could use it directly on the computer I’m generating the file on, but then I’d like something usable on Linux too, so I would probably be forced to look into Mono for that…

If somebody has a suggestion, it’d be very welcome.

9 thoughts on “Finding differences in XML files

  1. Why does the following not work:1. Reformat the XML files with xmllint –format2. Use a diff tool (e.g. kdiff3) to see the differences.

    Like

  2. There’s XMLSpy (commercial) but rather good.And then there was this XSL file my friend looked up while doing a similar project resolving diffs in XML files. I’ll try and get the sources.

    Like

  3. Johnny, that seems interesting and I’ll probably see to add it to portage, but it seems to suffer from the same problem of “usual” xmldiff: it takes a huge lot of time.Christoph, the XML is already indented, but the problem is that each line carries about 20 different attributes, so that won’t work.Thanks Rajat, that really interests me :)

    Like

  4. If you plan on writing your own simple xml parsers I think python would be a good choice. I have only worked 2 months with python on a language procesing course and I must admit it is great with strings.The dictionaries/lists make it easy for you to list all the attributes (and check for differences) and for the values of them and also check for changes there. If you are just doing it for this specific project you can build a simple parser like than in less than a day.

    Like

  5. There are already enough XML parsers, I’m certainly not going to write a new one. As for Python, it’s probably the second worst language you can tell me to use, the only other is Perl.Besides, all the features you named are present in any medium to high level language (C++, Ruby, PHP, C#, …).

    Like

  6. But for memory-efficiently comparing two XML files, you’ll probably need a pull parser. Those are not as common.

    Like

  7. Have you considered applying a simple XSL transform to both files to generate text files and use plain diff on them?If you’re looking for attribute diffs, it’s trivial to generate an indented list of elements with their attributes, one attribute per line with its value, attribute sorted on their name of course since attribute order does not matter.

    Like

  8. Hi Flameeyes,I am currently working on similar project, where i need to compare two XMLs of size around 2-5 Mb and figure out differences. So far had tried with Xmlunit Code and no success :( Can you please help me if you have working piece of code for this job ?? Thanks in advance.Mail Id : sajithgowda@gmail.com

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s