While Mike seem to always have time to strange and silly reverse engineering projects, I don’t usually have neither time nor skills.. while I have RE’d a couple of things in the past I never did something really useful. But I did set my mind on trying to reverse engineering something, last year, even though I failed to find the original CDs it came in, which appeared in front of my eyes the other day.
The target is a set of six CDs, digital copies of most of the relevant Italian Literature written from the 13th century onward. These CDs were attached to an Italian magazine (L’Espresso), and consisted of a Windows 3.1 interface to access the data. My original reason to be interested in reverse engineering this data is simply that, with the advent of widespread ebook readers in today’s market, the content really ought to be accessible in some other way than its original Win16 frontend.
Even more important than that, the important content of the CDs is public domain Italian literature; the backend software used for storing it, standing to the jewel case cover, was developed by CNR funded by the Italian government (it’s called DBT 3.1 and it’s a textual analysis software; Google doesn’t seem to report any recent relevant information about it though). This makes my doubts about reverse engineering someone else’s code basically go away altogether.
Interestingly, out of six disks with about 350MB of data each, the actual differing files – which has to be the data files – are just over 250MB. This is interesting to note for me because it means that most of the other data in each of the CDs is just the frontend software, which in turn means that splitting it in six disks was pure and only marketing in the part of the magazine; it could have been done in one or two disks at most, sparing the environment of the garbage produced by four extra CDs per user. Sigh!
I haven’t started working on the actual reverse engineering yet; the data files are all in what appears like custom formats, with reversed extension/filename pairs (the name defines the file’s content, the extension the volume it relates to), just for a matter of components’ lengths, limited by MS-DOS compatibility. The old 8.3 naming scheme is a huge hinder in trying to understand what contains what, but there are a number of data files, and a good set of index files, which agree with DBT being described as a tool for text processing and indexing.
The only file type that file(1)
detects consistently in all six disks are these:
DBBIBLIO.LZ1: DBase 3 data file (1276248076 records)
DBBIBLIO.LZ2: DBase 3 data file (1124728851 records)
DBBIBLIO.LZ3: DBase 3 data file (1124597763 records)
DBBIBLIO.LZ4: DBase 3 data file (1226768415 records)
DBBIBLIO.LZ5: DBase 3 data file (1125056517 records)
DBBIBLIO.LZ6: DBase 3 data file (1394081803 records)
But this doesn’t make any sense, as four out of six files are smaller than 100KiB, and the other two are below 150KiB, which means they are definitely not DBase 3 even though their signature does seem to be similar.
I think this is one eye-opening example of why government money should be spent on Open Source technologies, or at least Open Standards, given the software generating (and probably accessing) this data is now no longer accessible to me as its original user, and yet it is hiding content that is vastly of public domain (Dante’s Divina Commedia is definitely out of copyright scope, even though footnotes in modern versions are not — but this collection was designed to be devoid of any kind of footnotes!).
At any rate, I now ripped a set of ISOs of the discs so I can access them more easily, and as soon as I have time I’ll be trying to understand what compression might be used by the code (I’m sure they did compress at least part of it!). For now I think I have an interesting bit to follow up on already: file(1)
reporting a strange message about vasprintf()
when testing some compressed DLLs.
Will you be making any data available for download?Have you positively identified those *.LZ* files as the data files? ‘LZ’ would imply Lempel-Ziv compression variation, and is probably not too complex, given the vintage (Win3.1 era).The information is supposed to decode to textual data and not scanned images depicting text, right? If it’s text, what encoding? Since it’s Italian, I suspect vanilla ASCII won’t do the job.
This reminds of an occasion many years ago when I wanted to extract some text which, like yours, was stored in proprietary form along with a Windows viewer (obviously) incapable of copying text to the clipboard. As I wasn’t particularly interested in the format, I devised a quick method of getting to the text with minimal effort: I started the viewer in Wine while tracing all calls to XDrawText, then placed a suitable weight on the page down key and waited. Once it reached the end, some trivial log processing was all that stood between me and the text.
Thanks for the idea of using Wine to debug what the heck it is doing, I didn’t think of it.The @.LZ@ shouldn’t relate to the actual format of the content, it is a shorthand for the package’s title _Letteratura Italiana Zanichelli_ (Zanichelli’s Italian Literature) which is also used as the ISO label of the discs. As I said there are a bunch of files with different name parts and the six extensions @.LZ1@ to @.LZ6@.But who knows, it might actually make sense.Also, ASCII is normally used in Italian unfortunately, even when it shouldn’t, so my name is _quite_ a problem (for the usual ò). At worse, though, 8-bit ASCII is fine for all the alphabet required for Italian — usually in CP1521, Latin 1 or IBM850.
it may very possibly a variation of DBF files, which is store text in plain, so that piping them trough `string` should tell the story.If that’s the case dev-lang/harbour is a clipper for unix and could help, we here do program with it if you need an hand.
I tried already piping all of them through strings and not everything makes sense: a quick @fgrep@ call looking for “nel mezzo del cammin” (from Dante’s _Divina Commedia_ for the non-Italian speakers) didn’t get me anything either.I also wonder how much compression there is to the work, since a @tar.xz@ file of the data files bring them down from 256MB to 175MB, which would suggest some kind of compression is also implemented.I guess I should consider the idea of skimming through the frontend’s strings to see if I come by the copyright declaration of underlying libraries, if zlib or something similar is used it should be listed there somewhere.
Or, you know, just upload one of the 6 images somewhere so that your obsessive RE buddies can take a crack at it. 🙂 Let me know if you need some space.
Yeah, what he said – you’ve got me curious.Personally, I’d start by seeing if the 7-Zip GUI knows what they are; I’ve found that “file” is very, very good most of the time, but when it falls down, it falls hard. I’ve not tried the Linux version, but 7-Zip does a remarkably good at handling weird compression formats.Not having any idea what the “signature” is for a DBase file… I wonder if the reason it thinks it’s DBase is because of the “DBT” software – perhaps there’s a “DB” near the beginning of the file that it’s picking up on