A personal reverse engineering project

While Mike seem to always have time to strange and silly reverse engineering projects, I don’t usually have neither time nor skills.. while I have RE’d a couple of things in the past I never did something really useful. But I did set my mind on trying to reverse engineering something, last year, even though I failed to find the original CDs it came in, which appeared in front of my eyes the other day.

The target is a set of six CDs, digital copies of most of the relevant Italian Literature written from the 13th century onward. These CDs were attached to an Italian magazine (L’Espresso), and consisted of a Windows 3.1 interface to access the data. My original reason to be interested in reverse engineering this data is simply that, with the advent of widespread ebook readers in today’s market, the content really ought to be accessible in some other way than its original Win16 frontend.

Even more important than that, the important content of the CDs is public domain Italian literature; the backend software used for storing it, standing to the jewel case cover, was developed by CNR funded by the Italian government (it’s called DBT 3.1 and it’s a textual analysis software; Google doesn’t seem to report any recent relevant information about it though). This makes my doubts about reverse engineering someone else’s code basically go away altogether.

Interestingly, out of six disks with about 350MB of data each, the actual differing files – which has to be the data files – are just over 250MB. This is interesting to note for me because it means that most of the other data in each of the CDs is just the frontend software, which in turn means that splitting it in six disks was pure and only marketing in the part of the magazine; it could have been done in one or two disks at most, sparing the environment of the garbage produced by four extra CDs per user. Sigh!

I haven’t started working on the actual reverse engineering yet; the data files are all in what appears like custom formats, with reversed extension/filename pairs (the name defines the file’s content, the extension the volume it relates to), just for a matter of components’ lengths, limited by MS-DOS compatibility. The old 8.3 naming scheme is a huge hinder in trying to understand what contains what, but there are a number of data files, and a good set of index files, which agree with DBT being described as a tool for text processing and indexing.

The only file type that file(1) detects consistently in all six disks are these:

DBBIBLIO.LZ1: DBase 3 data file (1276248076 records)
DBBIBLIO.LZ2: DBase 3 data file (1124728851 records)
DBBIBLIO.LZ3: DBase 3 data file (1124597763 records)
DBBIBLIO.LZ4: DBase 3 data file (1226768415 records)
DBBIBLIO.LZ5: DBase 3 data file (1125056517 records)
DBBIBLIO.LZ6: DBase 3 data file (1394081803 records)

But this doesn’t make any sense, as four out of six files are smaller than 100KiB, and the other two are below 150KiB, which means they are definitely not DBase 3 even though their signature does seem to be similar.

I think this is one eye-opening example of why government money should be spent on Open Source technologies, or at least Open Standards, given the software generating (and probably accessing) this data is now no longer accessible to me as its original user, and yet it is hiding content that is vastly of public domain (Dante’s Divina Commedia is definitely out of copyright scope, even though footnotes in modern versions are not — but this collection was designed to be devoid of any kind of footnotes!).

At any rate, I now ripped a set of ISOs of the discs so I can access them more easily, and as soon as I have time I’ll be trying to understand what compression might be used by the code (I’m sure they did compress at least part of it!). For now I think I have an interesting bit to follow up on already: file(1) reporting a strange message about vasprintf() when testing some compressed DLLs.