Amazon, Project Gutenberg, and Italian Literature

This post starts in the strangest of places. The other night, my mother was complaining how few free Italian books are available on the Kindle Store.

Turns out, a friend of the family, who also has a Kindle, has been enjoying reading free English older books on hers. As my mother does not speak or read English, she’d been complaining that the same is not possible in Italian.

The books she’s referring to are older books, the copyright of which expired, and that are available on Project Gutenberg. Indeed, the selection of Italian books on that site is fairly limited, and it is something that I have indeed been sadden about before.

What has Project Gutenberg to do with Kindle? Well, Amazon appears to collect books from Project Gutenberg, convert them to Kindle’s native format, and “sell” them on the Kindle Store. I say “sell” because for the most part, these are available at $0.00, and are thus available for free.

While there is no reference to Project Gutenberg on their store pages, there’s usually a note on the book:

This book was converted from its physical edition to the digital format by a community of volunteers. You may find it for free on the web. Purchase of the Kindle edition includes wireless delivery.

Another important point is that (again, for the most part), the original language editions are also available! This is how I started reading Jules Verne’s Le Tour du monde en quatre-vingts jours while trying to brush up my French to workable levels.

Having these works available on the Kindle Store, free of both direct cost and delivery charge, is in my opinion a great step to distribute knowledge and culture. As my nephews (blood-related and otherwise) start reaching reading age, I’m sure that what I will give them as presents is going to be Kindle readers, because between having access to this wide range of free books, and the embedded touch-on dictionary, they feel like something I’d have thoroughly enjoyed using when I was a kid myself.

Unfortunately, this is not all roses. the Kindle Store still georestrict some books, so from my Kindle Store (which is set in the US), I cannot download Ludovico Ariosto’s Orlando Furioso in Italian (though I can download the translation for free, or buy for $0.99 a non-Project Gutenberg version of the original Italian text). And of course there is the problem of coverage for the various languages.

Italian, as I said, appears to be a pretty bad one when it comes to coverage. If I look at Luigi Pirandello’s books there are only seven entries, one of which is in English, and another one being a duplicate. Compare this with the actual list of his works and you can see that it’s very lacking. And since Pirandello died in 1936, his works are already in the public domain.

Since I have not actually being active with Project Gutenberg, I only have second hand knowledge of why this type of problem happens. One of the thing I remember having been told about this, is that most of the books you buy in Italian stores are either annotated editions, or updated for modern Italian, which causes their copyright to be extended do the death of the editor, annotator or translator.

This lack of access to Italian literature is a big bother, and quite a bit of a showstopper to giving a Kindle to my Italian “nephews”. I really wish I could find a way to fix the problem, whether it is by technical or political means.

On the political side, one could expect that, with the focus on culture of the previous Italian government, and the focus of the current government on the free-as-in-beer options, it would be easy to convince them to release all of the Italian literature that is in the public domain for free. Unfortunately, I wouldn’t even know where to start to ask them to do that.

On the technical side, maybe it is well due time that I spend a significant amount of time on my now seven years old project of extracting a copy of the data from the data files of Zanichelli’s Italian literature software (likely developed at least in part with public funds).

The software was developed for Windows 3.1 and can’t be run on any modern computer. I should probably send the ISOs of it to the Internet Archive, and they may be able to keep it running there on DosBox with a real copy of Windows 3.1, since Wine appears to not support the 16-bit OLE interfaces that the software depends on.

If you wonder what would be a neat thing for Microsoft to release as open-source, I would probably suggest the whole Windows 3.1 source code would be a starting point. If nothing else, with the right license it would be possible to replace the half-complete 16-bit DLLs of Wine with official, or nearly-official copies.

I guess it’s time to learn more about Windows 3.1 in my “copious spare time” (h/t Charles Stross), and start digging into this. Maybe Ryan’s 2ine might help, as OS/2 and Windows 3.1 are closer than the latter is to modern Windows.

A personal reverse engineering project

While Mike seem to always have time to strange and silly reverse engineering projects, I don’t usually have neither time nor skills.. while I have RE’d a couple of things in the past I never did something really useful. But I did set my mind on trying to reverse engineering something, last year, even though I failed to find the original CDs it came in, which appeared in front of my eyes the other day.

The target is a set of six CDs, digital copies of most of the relevant Italian Literature written from the 13th century onward. These CDs were attached to an Italian magazine (L’Espresso), and consisted of a Windows 3.1 interface to access the data. My original reason to be interested in reverse engineering this data is simply that, with the advent of widespread ebook readers in today’s market, the content really ought to be accessible in some other way than its original Win16 frontend.

Even more important than that, the important content of the CDs is public domain Italian literature; the backend software used for storing it, standing to the jewel case cover, was developed by CNR funded by the Italian government (it’s called DBT 3.1 and it’s a textual analysis software; Google doesn’t seem to report any recent relevant information about it though). This makes my doubts about reverse engineering someone else’s code basically go away altogether.

Interestingly, out of six disks with about 350MB of data each, the actual differing files – which has to be the data files – are just over 250MB. This is interesting to note for me because it means that most of the other data in each of the CDs is just the frontend software, which in turn means that splitting it in six disks was pure and only marketing in the part of the magazine; it could have been done in one or two disks at most, sparing the environment of the garbage produced by four extra CDs per user. Sigh!

I haven’t started working on the actual reverse engineering yet; the data files are all in what appears like custom formats, with reversed extension/filename pairs (the name defines the file’s content, the extension the volume it relates to), just for a matter of components’ lengths, limited by MS-DOS compatibility. The old 8.3 naming scheme is a huge hinder in trying to understand what contains what, but there are a number of data files, and a good set of index files, which agree with DBT being described as a tool for text processing and indexing.

The only file type that file(1) detects consistently in all six disks are these:

DBBIBLIO.LZ1: DBase 3 data file (1276248076 records)
DBBIBLIO.LZ2: DBase 3 data file (1124728851 records)
DBBIBLIO.LZ3: DBase 3 data file (1124597763 records)
DBBIBLIO.LZ4: DBase 3 data file (1226768415 records)
DBBIBLIO.LZ5: DBase 3 data file (1125056517 records)
DBBIBLIO.LZ6: DBase 3 data file (1394081803 records)

But this doesn’t make any sense, as four out of six files are smaller than 100KiB, and the other two are below 150KiB, which means they are definitely not DBase 3 even though their signature does seem to be similar.

I think this is one eye-opening example of why government money should be spent on Open Source technologies, or at least Open Standards, given the software generating (and probably accessing) this data is now no longer accessible to me as its original user, and yet it is hiding content that is vastly of public domain (Dante’s Divina Commedia is definitely out of copyright scope, even though footnotes in modern versions are not — but this collection was designed to be devoid of any kind of footnotes!).

At any rate, I now ripped a set of ISOs of the discs so I can access them more easily, and as soon as I have time I’ll be trying to understand what compression might be used by the code (I’m sure they did compress at least part of it!). For now I think I have an interesting bit to follow up on already: file(1) reporting a strange message about vasprintf() when testing some compressed DLLs.