So I’m in the hospital, with too much time in my hands of course, since the exams take time to be prepared. I spent the weekend out of the hospital with my family, which really made me feel better, and with that I was able to get some new books to read, in particular The Dragon Reborn, and a Japanese grammar textbook. I was also able to get the Italian edition of Cowboy Bebop, an anime series I really love (and have seen many times in TV in Italian — this time I also have the Japanese track).
But one thing I’m doing for most part of my days is filling crosswords puzzles. It’s a nice way to spend time, and it keeps my mind awake, without numbing it down. Reading is nice but I find this more intriguing, during the day.
Unfortunately, a programmer with too much time in his hands is always a problem, because obviously something very nasty might come out of his mind, like, in this case, the idea of digging into crosswords puzzles solution by machine.
Just for information, this post is going to have excerpts in Italian, since I haven’t yet tried to complete a crossword puzzle in English (I doubt I’d be able to), and thus I’d be focusing on solution of Italian crossword puzzles. I’m sorry if this reduces the scope of the post, but I’m just very bored and this post reflects that.
Also, since I haven’t been digging into this at all yet, I’m using an expensive connection, after the initial 50MB/day of flat rate, I’m probably going to say lots of things technically wrong, so the value of this post has to be taken with a full package of salt.
As much as crosswords vary, especially between different authors, and different points in time (they reflect a lot the social life of Italians, for instance “La provincia di Cogne”, literally “the province of Cogne” – i.e.: Aosta – was probably impossible to find, and know, before the murder happening in that city that monopolised the attention of the media for many years), there are a few constant pieces of it that could easily be automated in resolution.
This usually applies to the definitions used for short two-letter words, like the following ones, taken from an actual crossword schema:
- Le ha doppie l’ufficiale (“Ufficiale” has pairs of these): FI, as those are the two letters that are found twice in the word “Ufficiale”;
- In centro e nel sobborgo (In “centro” and in “sobborgo” — In center and suburbs): RO as those are the two ordered letters found in both words;
- Iniziali della Allende (Initials of Allende): IA for Isabelle Allende;
- Le consonanti in azione (The consonants in “azione” — The consonants in action): ZN as the two consonants in the word “azione”;
- In mezzo al buio (In the middle of “buio” — In the middle of dark): UI as the two letters in the middle of the word “buio”;
- Ex-sigla di Forlì (Former code of Forlì): FO, as the previous province code for Forlì (nowadays FC for Forlì-Cesena);
- even more similar definitions;
There is of course a huge problem here in dividing the definitions in tokens so that the computer could understand what the subject of the definition is, but it’s not tremendously impossible. Once you know which word to look at for a purely letter-wise definition (like 1, 2, 3 and 4), it’s trivial for the computer to calculate it, even without knowing it from before; just like a human, though, a software solver would need memory for the third and sixth options.
In the case of initials for a public character, again just like a human, the software could fill in the surname initial to begin with, even without knowing the other one, but then it could be helped through a list of names. With a sophisticate enough software, it could eventually learn the definition once the other cell is filled with a sure enough value. For what concern province codes, which are quite often used, it would be very easy, as most of those definitions are just “Venezia” for “VE” (Venice name and Venice province code), and the list of provinces is usually one and not variable.
Of course it’s impossible to complete a crosswords puzzle with just the mechanical solution to these, the complex definitions are the tricky ones that would require some complex solution like neural networks, like the one a friend of mine is working on for his short university degree in computer engineering. Most of the definitions are just matter of looking up words from memory associated by keywords in the definition. Sometimes, the same database of public characters could be used for finding out the names for the definitions, like:
- La Gale della TV (The Gale from TV): Megan Gale, is a TV character.
But other times, you’re just given a few synonyms to find the definition, thus associative memory is the only thing you can use to find the solution.
I’ll think about this a little more, I have time here in the hospital, and maybe I’ll actually try to come to something about this in the distant future, while I’m convalescing from the surgery maybe.