Europe and USA: my personal market comparison

While I have already announced I’m moving to London, I don’t want to give the idea that I don’t trust Europe. One of my acquaintances, an eurosceptic, thought it was apt o congratulate me for dropping out of Europe when I announced my move, but that couldn’t be farthest from my intention. As I said already repeatedly now, my decision is all to be found in my lack of a social circle in Dublin, and the feelings of loneliness that really need to be taken care of.

Indeed, I’m more than an Europeist, I’m a Globalist, in so far as I don’t see any reason why we should have borders, or limitations on travel. So my hope is not just for Europe to become a bigger, more common block. Having run a business for a number of years in Italy, where business rules are overly complicated, and the tax system assumes that you’re a cheater by default, and fines you if you don’t invoice enough, I would have seriously loved the option to have an “European business” rather than an “Italian business” — since a good chunk of my customers were based outside of Italy anyway.

This concept of “European business”, unfortunately, does not exist. Even VAT handling in Europe is not unified, and even though we should have at least a common VAT ID registration, back when I set up my business, it required an explicit registration at the Finance Ministry to be able to make use of the ID outside of Italy. At the time, at least, I think Spain also opted out to registering their VAT IDs on the European system by default. Indeed that was the reason why Amazon used to the run separate processes for most European business customers, and for Italian and Spanish customers.

Speaking of Amazon, those of you reading me from outside Europe may be surprised to know that there is no such thing as “Amazon Europe”, – heck, we don’t even have Amazon Ireland! – at least as a consumer website. Each country has its own Amazon website, with similar, but not identical listings, prices and “rules of engagement” (what can be shipped where and for which prices). For the customers this has quite a few detrimental effects: the prices may be lower in a country that they may not usually look at the store of, or they may have to weight the options based on price, shipping restrictions and shipping costs.

Since, as I said, there is no Amazon Ireland, living in Dublin also is an interesting exercise with Amazon: you may want to order things from Amazon UK, either because of language reasons, or simply because it requires a power plug and Ireland has the same British plug as the UK. And most of the shipping costs are lower, either by themselves, or because there are re-mailers from Northern Ireland to Dublin, if you are okay with waiting an extra day. But at the same time, you’re forced to pay in GBP rather than Euro (well, okay not forced, but at least strongly advised to — Amazon currency conversion has a significantly worse exchange rate than any of my cards, especially Revolut) and some of the sellers will actually refuse to send to Ireland, for no specific reason. Sometimes, you can actually buy the same device from Amazon Germany, which will then ship from a UK-based storehouse anyway, despite the item not being available to send to Ireland from Amazon UK. And sometimes Amazon Italy may be a good 15% cheaper (on a multiple-hundreds euro item) than Amazon UK.

So why does Amazon not run a global European website? Or why doesn’t an European-native alternative appears? It looks to me like the European Union and its various bodies and people keep hoping to find European-native alternatives to the big American names all the times, at least on the papers, probably in the hope of not being tied to the destiny of American with what comes down in the future, particularly given how things have gone with the current politics on all sides. But in all their words, there does not appear to be any option of opening up opportunities for creating cross-Europe collaboration on corporations.

The current situation of the countries that make up Europe and the States that make up the USA, is that you are just not allowed to do certain types, or levels of business in all the countries without registering and operating as a company in that country. That is the case for instance of phone operators, that get licenses per country, and so each operates independent units. This becomes sometimes ludicrous because you then have Vodafone providing services in about half of Europe, but with such independent units that their level of competence for instance on security and privacy is extremely variable. In particular it looks like Vodafone Italy still has not learnt how to set up HTTPS correctly, and despite logging you in a TLS-encrypted connection, it does not set the cookie as secure, so a downgrade is enough to steal authentication cookies. In 2017.

If you remember, when I complained about the half-baked roaming directive results, I have suggested that one of my favourite options would be to have a “European number” — just give me a special “country code” that can be replaced by any one member’s code, and the phone number is still valid, and appears local. This is important because, despite the roaming directive allowing me to keep my regular Irish (for now) SIM card on my phone while travelling to either UK or Finland, it prevents me from getting a local phone number. And since signing up for some local services, including sometimes free WiFi hotspots from various cafes and establishment, relies on being able to receive a local SMS, it is sometimes more of an hindrance than a favour.

Both Revolut and Transferwise, as well as other similar “FinTech” companies have started providing users with what they call “borderless” accounts: Euro, Sterling and Dollar accounts all into one system. Unfortunately this is only half of the battle. Indeed, while I welcome in particular Revolut’s option of using a single balance that can provide all the currencies in a single card is a great option. But this only works to a point, because these accounts are “special” — in particular the Revolut Euro account is provided with a Lithuanian IBAN, but a UK BIC code, which makes a few system that still expect both throw up. And this is not even going into how SEPA Direct Debit just does not work: my Italian services can only debit an Italian bank, my Irish services can only charge an Irish bank, and my one French service can only charge a French bank. Using credit cards via VISA has actually better success rate for me, even though at least Vodafone Italy can only charge a specific one of my credit cards, rather than any of them. Oh yeah and let’s not forget the fact that you just can’t get your salary paid into a non-Irish bank account in Ireland.

Banks in Europe end up operating as country-wide silos, to the point that even Ulster Bank Republic of Ireland cannot (at least, can no longer) provide me with an Ulster Bank Northern Ireland bank account — or to be precise, cannot act on my already-existing foreigner bank account that is open in Northern Ireland. And because of all these things happening, the moment I will actually move to London I’ll have to figure out how to get a proper account there. I’m having trouble right now opening an account there already not because I don’t have the local tax ID but because they need proof of employment from a UK company, while I’m still employed by the Irish company. Of the same multinational. Oh my.

You could say that banks and telcos are special cases. They are partial monopolies and there are good reasons why they should be administered on a country-by-country basis. But the reality is that in the United States, these things are mostly solved — plenty of telco stuff is still pretty much local, but that’s because of network access and antitrust requirements, as well, to a point, the need of building and servicing local infrastructure (a solution to this is effectively splitting the operation of the telco from the provider of physical infrastructure, but that comes with its own problems). But at the very least, banking in the US is not something that people have to deal with when changing State, or having to work with companies of other states.

These silos are also visible to consumers in other forms, that may not be quite obvious. TV, movie and similar rights are also a problem the same way. Netflix for instance will only show a subset of the programming they have access to depending on the country you’re currently located in. This is because, except for the content they produce themselves, they have to acquire rights from different companies holding them in different countries, because different TV networks would already have secured rights and not want to let them broadcast in their place.

I brought up this part last, despite being probably the one most consumers know or even care about, because it shows the other problem that companies trying to build up support for Europe, or even to be started as Europe-native companies, have to deal with. TV networks are significantly more fragmented than in the USA. There is no HBO, despite Sky being present in a number of different countries. There is nothing akin to CNN. There are a number of 24-hours news channels that are reachable over more-or-less freeview means, but the truth is that if you want to watch TV in Europe, you need a local company to provide you with it. And the reason is not one that is easy to solve: different countries just speak different languages, sometimes more than once.

It’s not just a matter of providing a second channel in a different language: content needs to be translated, sometimes adapted. This is very clear in video games, where some countries (cough Germany cough) require cutting content explicitly, to avoid upsetting something or someone. Indeed, video games releases for many platforms, in the past at least including PC, but luckily it appears not the case nowadays, end up distributing games only in a subset of European languages at a time. Which is why I loathed playing Skyrim on the PlayStation 3, as the disk only includes Italian, French and German, but no English, which would be my default option (okay, nowadays I would probably play it in French to freshen up my understanding of it).

For American start-ups – but this is true also for open source project, and authors of media such as books, TV series or movies – internationalization or localization are problems that can be easily shelved for the “after we’re famous” pile. First make the fame, or the money, then export and care about other languages. In Europe that cannot possibly be the case. Even for English, that in the computer world is still for now the lingua franca (pun intended), I wouldn’t expect there would be a majority of users happy to use a non-localized software, particularly when you consider as part of that localization the differences in date handling. I mean, I started using “English (UK)” rather than the default American for my Google account years ago because I wanted a sane date format in Gmail!

All of this makes the fragmented European market harder for most projects, companies, and even ideas to move as fast as the American or (I presume, but have not enough detail about it) the Chinese market, in which a much wider audience can be gained without spending so much effort to deal with cross-border bureaucracy and cross-culture porting. But let me be clear, I do not think that the solution is to normalize Europe onto a single language. We can’t even do that for countries, and I don’t think it would be fair to anyone to even consider this. What we need is to remove as many other roadblocks as it’s feasible to remove, and then try to come up with an easier way to fund translation and localization processes, or an easier way to access rights at a Union level rather than on a country-by-country basis.

Unfortunately, I do not expect that this is going to happen in my lifetime. I still wish we’ll end up with a United Federation of Planets, at the end of the day, though.

I’m just absolutely insane at this point

You might remember that I have had some nasty problems with my surname and internationalisation in general. What I didn’t post at the time, because I was still unsure about it, was something very Internet-related, with my surname.

A few months ago, in a job-related call, I had to leave my email address to someone, who either is not much practical with English or just didn’t like the “hackers’ way” of having a pseudonym, and suggested that “flameeyes” is too difficult to write and I should just use my full name, at least for job-related mail. Beside the obvious concern that if I am to work for somebody who think “flameeyes” is not good, I’m probably not doing a job I’d enjoy (I sincerely hope to find something that would let me just use “flameeyes” as company’s username, although I admit it might not be that feasible unless I’m self-employed, which I guess I could be ironically since I drafted this post things changed and I am now self-employed…), I have two main drawbacks of doing that. Again, my surname hits, since I doubt any mail server would be able to accept “pettenò” as part of the username, not even in punycode, and then probably “diego.elio.pettenò@something” is unlikely to be easy to spell.

I actually thought about making a point. I am disappointed when people seem to think that “Pettenò” and “Petteno” are just the same and equivalent; sorry they are not. It’s not just a decoration on the “o”; I can understand if an English speaker wouldn’t know how to differentiate, I’m sure they also find it difficult to type, but beside the fact they can copy it, what I get upset about is when it’s not handled in the databases where I register myself. I’m sure if I have to spell it out on the phone it’s going to be a problem, but I get upset if the automated systems can’t handle it, because we have the tools. We’re not in the ‘80s where multiple 8-bit ASCII codepages were cool, we’re in 2009 and UTF-8 is available everywhere.

Or almost everywhere; although IDN (Internationalised Domain Names) actually exist, which allow UTF-8 characters to be used for domain names, they are not enabled by default on all TLDs, and they are not for instance in either the .it ccTLD (which would be my country’s), nor on the .eu TLD (which would be my choice, as you can see from my own blog), nor they enable all the symbols in all the TLDs. This is of course because there have been quite a few security concerns regarding their use, as differentiating between variant of the same character as allowed by UTF-8 is far from trivial for the human eye and that allowed for some quite elaborated phising attack. Indeed if I were to use an UTF-8 encoded domain name with the .it TLD in Firefox, it would show up with its punycode encoding because it’s just not safe.

Interestingly enough, the Spanish ccTLD .es not only allows IDN, with the “ò” character, but also seems not to have limitations on registration of domains from foreigners (I have to thank Santiago – Coldwind – for letting me know about that!); while my local registar does not seem to have .es as an option, the one on the other side of the ocean does; while r4l isn’t cheap, it does a good job for the xine domain and since I’m likely to do some tricky stuff with it I might as well have need for some help on the domain handling side (TopHost, with whom I registered this domain, is quite cheap, but still does not allow me to drop the “www.” third-level domain for the main site, and doesn’t have a quick resolution; R4L, while having had some trouble when I registered the domain in the first place, has always been quick and friendly). The choice of R4L was really a good one, thanks to the guys there I could get the IDN domain registered, which failed at the first automated attempt!

So at the end I decided to bite the bullet and basically waste sixty of our eurobucks on this crazy idea, registering three .es domains: pettenò.es, and; I really don’t like the idea of having the last two myself, that is not my surname and I probably am open to just transfer them to someone else if they were to be used by someone having that name; on the other hand I’m expecting disaster from most applications so I’m trying to cover my asses to avoid that non-IDN-enabled software would send mail to the wrong address (that would be a funny security issue).

I want to set up a few things before starting but I’m going to start using some new addresses on the main domain (pettenò.es) quite soon; you probably won’t see my mail going out with that domain in the first place, but I’m going to use it to register on sites and try accessing mailing lists. It’s going to be fun.

I know there are people probably already using IDN domains out there in the wild, but the reason why I want to look at this at first hand is that, first of all, I’m quite the nitpicker especially for Free Software; you most certainly are going to see changes in Free Software I work on if it doesn’t work with my new shiny IDN-enabled domain. Second, I’m going to try them in an environment where they are not expected; Spanish web developers have more chance to have encountered IDN before (I remember www.elpaí used as an example), and thus to have taken care to support that than their colleagues in Italy. I’m going to have so much fun with the Italian bureaucracy with this, I’m sure.

But I’m sure I can find enough problems with Free Software especially in configuration files; I’m also sure that a lot of projects wouldn’t consider those problems, because, you know, supporting UTF-8 everywhere would be like accepting that English is not the only language… it so happens that it is not! And I get pretty vocal when it comes to supporting my full name properly. Maybe I just have problems, and I need the help of someone good, or maybe I’m just not ready to settle for mediocre software.

We’ll see.

Update (2016-04-29): These domains are no longer under my control, although I still own http://pettenò.eu.

The UTF-8 security challenge

I make no mystery of the fact I like my surname to be spelt correctly, even if that’s internationally difficult. I don’t think that’s too much to ask sincerely; if you want to refer to me and you don’t know how to spell my surname, you have a few other options, starting from my nickname (“Flameeyes”), which I keep on using everywhere, included the domain of this blog because, well, it’s a name as good as my “real” name. While I know other developers, starting from Donnie, prefer to be recognized mainly by their real name; since I know my name is difficult to type for most English speakers, I don’t usually ask that much; Flameeyes was, after all, more unique to me than “Diego Pettenò”, since of the latter there are other three just in my city.

But even without going with nicknames, that might not sound “professional”, I’m fine with being called Diego (in Gentoo I’m the only one; for what concern multimedia areas, I’m Diego #2 since “the other Diego” Biurrun takes due priority), or since a few months ago Diego Elio (I don’t pretend to be unique in the world but when I chose my new name, beside choosing my grandfather’s name, I also checked I wouldn’t step in the shoes of another developer), or, if you really really need to type my name in full, “Diego Petteno`” (yes there is an ascii character to represent my accent and it’s not the usual single quotation mark; even the quotation mark, though, works as a tentative, like for banks and credit cards . If you’re in a particular good mood and want to tease me around you could also use 炎目 (which is probably a too literal translation of “Flameeyes” in kanji); I think the only person ever using that to call me has been Chris (White), and it also does not solve the issue of UTF-8.

Turns out it’s not that easy at all. I probably have gone a little overboard the other day about one GLSA mistyping my name (it still does), because our security guys are innocent on the matter: glsa-check breaks with UTF-8 in the GLSA XML files (which is broken of glsa-check, since you should not assume anything about the encoding of XML files, each file declares its own encoding!), which makes it hard to type my name; tthe reason why I was surprised (and somewhat annoyed) is that I was expecting it to be typed right for once, py handled it and I’m sure he has the ò character on his keyboard.

Curious about this, I also wanted to confirm how the other distributions would handle my name. A good chance to do that was provided by CVE-2008-4316 (which I discussed briefly already ). The results are funny, disappointing and interesting at the same time.

The oCERT advisory has a broken encoding and shows the “unknown character” symbol (�); on the other hand, Will’s mail at SecurityFocus shows my name properly. Debian cuts my surname, while Ubuntu simply mistype it; on the other hand, Red Hat is showing it properly; score one for Red Hat.

One out of four distributions (Gentoo has no GLSA on the matter, but I know what would have happened, nor the CVE links to other distributions, just a few more security-focused sites I’m not interested about in this momet) handle my name correctly, that’s not really good. Especially, I’m surprised that the one distribution getting it right is Red Hat, since the other two are the ones I usually see called in the mix when people talk about localising Free Software packages. Gentoo at least does not pretend to be ready for internationalisation in the first place (although we have a GLEP that does ).

Okay I certainly am a nit-picker, but we’re in 2009, there are good ways to handle UTF-8, and the only obstacles I see nowadays are very old legacy software and English speakers who maintain that seven bits are enough to encode the world, which is not true by a long shot.

International problems

I’m probably one quite strange person myself, that I knew, but I never thought that I would actually have so many problems when it comes to internationalisation, especially on Linux, but not limited to. I have written before that I have problems with my name (and a similar issue happened last week when the MacBook I ordered for my mom was sent by TNT to “Diego Petten?” ­– which wouldn’t then be found properly by the computer system when looking up the package by name), but lately I have been having even worse problems.

One of the first problem has happened while mailing patches with git on mailing list hosted by the servers; my messages were rejected because I used as sender “Diego E. ‘Flameeyes’ Pettenò”, without the double quotes around. For some RFC, when a period is present in the sender or destination names, the whole name has to be quoted in double quotes, but git does not seem to know about that and sends wrong email messages that get rejected. Even adding the escaped quotes in the configuration file didn’t help, so at the end I send my git email with my (new) full name “Diego Elio ‘Flameeyes’ Pettenò” even if it’s tremendously long and boring to read, and Lennart scolded me because now I figure with three different aliases in PulseAudio (on the other hand, ohloh handles that gracefully ).

Little parenthesis, if you’re curious where the “Elio” part comes from; I have legally changed my name, adding “Elio” as part of my first name last fall (it’s not a “second name” in the strict meaning of this term, because Italy does not have the concept of second name, it’s actually part of my first name). The reason for this is that there are four other “Diego Pettenò” in my city, two of which are around my age, and the Italian system is known for mistaking identities; while it does not really make me entirely safe to just add a second name, it should make it less likely that a mistake would happen. I have chosen Elio because that was the name of my grandfather.

So this was one of the problems; nothing really major, and was solved easily. The next problem happened today when I went for writing some notes about extending the guide (for which I still fail to find a publisher; unless I find one, it’ll keep the open donation approach), and, since the amount of blogging about the subject lately has been massive, I wanted to make sure I used the proper typographical quotation marks . It would have been easy to use them from OS X, but from Linux it seems it’s quite more difficult.

On OS X, I can reach the quotation marks on the keys “1” and “2”, adding the Option and Shift keys accordingly (single and double, open and closed); on Linux, with the US English, Alternate International keyboard I’m using, the thing is quite more difficult. The sequence would be something like Right Control, followed by AltGr and ' (or "), followed by < or >; even if I didn’t have to use AltGr to have the proper keys (without AltGr on the Alternate International keyboard the two symbols are “dead keys”, and are used for composing, quite important since I write both English and Italian with the same keyboard), it’s quite a clumsy way to access the two. And it also wouldn’t work with GNU Emacs on X11.

My first idea would have been to use xmodmap to just change the mappings of “1” and “2” to add third and shifted third levels, just like on OS X. Unfortunately adding extra levels with xmodmap seems to only work with the “mode switch” key rather than with the “ISO Level 3” key; the final result is that I had to “sacrifice” the right Command key (I use an Apple keyboard on Linux) to use as “mode switch” (keeping the right Option as Level 3 shift), and then mapping the 12 keys like I wanted. The result is usable but it also means that all the modifiers on the right side have completely different meaning from what they were designed to, and is not easy to remember all of them.

I thought about using the Keyboard Layout Editor but it requires antlr3 for Python, which is not available in Gentoo and seems to be difficult to update, so for now I’m stuck with this solution; next week when the iMac should arrive I’ll probably spend some more time on the issue (I already spent almost the whole afternoon, more than I should have used), I’d sincerely love to be able to set up the same exact keyboard layout for both systems, so I don’t have to remember in which one I am to get the combinations right; I already publish my custom OSX layout that basically implements the Xorg alternate international layout in OSX (you already have the same layout available in Windows as “US International”, so OSX was the only one lacking that), so I’ll probably just start maintaining layouts for both systems in the future.

And I don’t even want to start talking about setting up proper IME for Japanese under this configuration…

A question of names

I wanted to write about names and their spelling since a post by Michael S. Kaplan but for a reason or another I postponed it till now. I decided to return on this topic since, for the whole day at the hospital up to now, my name was regularly misspelled, and considering I am not even hitting problems with “foreign” names here, it makes me quite upset sincerely, as it’s all due to the way software has been written.

As you most likely know if you read my blog, wherever this happens, is that my name is Diego Pettenò (well, this is not going to be technically right in a few months but that’s another point altogether). You can see there is something “funny” on the final “o” of my surname. If you’re American, you might not know that’s called an “accent”, and it gives the proper way to pronounce the name. I guess one of the reasons English is considered easier than French, Italian and Spanish is that it lacks accents.

What is the problem? As the computers used nowadays seems all to derive from some English based design, they base themselves still on the ASCII table, the ASCII table makes it very difficult to handle special characters, which include “ò”. On some systems, like the credit card system for what I can tell, this is handled by replacing the accent with a quotation mark, making my name Diego Petteno’; not exactly my name but it comes closer than “Diego Petteno” that many other systems use; this is especially boring because “Petteno” (with no accent) is a different surname in this area, so I make it a point to distinguish between the two.

It is even worse when you go away from the Venice area, where both surnames are quite common, and enter Verona area, where at least Pettenò is not; I’ve been called Petteno all day, and I’m not liking it. And this is staying in the same country, actually the same region. I don’t even want to know how people whose main alphabet is not the latin one feel about this, with forms to be compiled with an approximation of their actual name.

I’m always signing up with my full proper name when I can, but a few times I’ve been asked to remove “non-letters” from my surname, and the scary thing is that this seems to happen more often with Italian sites rather than American ones, lately. It is not possible, for instance, to issue a wire transfer to “Diego Pettenò”, you have to round it down to “Diego Petteno”, even when using SEPA (Single Euro Payment Area, which means a global “namespace” for wire transfer in the Euro Area; note that European languages are quite full of special characters, I can’t think of another one but English than doesn’t have them, and much more “complex” than “ò”).

And don’t even try to get me started about katakana passwords :P

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.

Locales, NLS, kernels and filesystems

One issue that is yet to be solved easily by most distribution (at least those not featuring extensive graphical configuration utilities, like Fedora and Ubuntu do), is most likely localisation.

There is an interesting blog post by Wouter Verhelst on Planet Debian that talks about setting locales variables. It’s a very interesting reading, as it clarifies pretty well the different variables.

One related issue seems to be understanding the meaning of the NLS settings that are available in the kernel configuration for Linux. Some users seem to think that you have to enable the codepages in there to be able to use a certain locale as system locale.

This is not the case, the NLS settings in there are basically only used for filesystems, and in particular only VFAT and NTFS filesystems. The reason of this lies in the fact that both filesystems are case-insensitive.

In usual Unix filesystems, like UFS, EXT2/3/4, XFS, JFS, ReiserFS and so on, file names are case sensitive, and they end up being just a string of arbitrary characters. On VFAT and NTFS, instead, the filenames are case *in*sensitive.

For case sensitivity, you need equivalence tables, and those are defined by different NLS values. For instance, for Western locale, the character ‘i’ and ‘I’ are equivalent, but in Turkish, they are not, as ‘i’ pairs with ‘İ’ and ‘I’ with ‘ı’ (if you wish to get more information about this, I’d refer you to Michael S. Kaplan’s blog on the subject).

So when you need to support VFAT or NTFS, you need to support the right NLS table, or your filesystem will end up corrupted (on Turkish charset, you can have two files called “FAIL” and “fail” as the two letters are not just the same). This is the reason why you find the NLS settings in the filesystems section.

Of course, one could say that HFS+ used by MacOS is also case-insensitive, so NLS settings should apply to that too, no? Well, no. I admit I don’t know much about historical HFS+ filesystems, as I only started using MacOS from version 10.3, but at least since then, the filenames are saved encoded in UTF-8, which has very well defined equivalence tables. So there is no need for option selections, the equivalence table is defined as part of the filesystem itself.

Knowing this, why VFAT does not work properly with UTF-8, as stated by the kernel when you mount it as iocharset=utf-8? The problem is that VFAT works on a per-character equivalence basis, and UTF-8 is a variable-size encoding, which does not suit well VFAT.

Unfortunately, make oldconfig and make xconfig seem to replace, at least on my system, the default charset with UTF-8 every time, maybe because UTF-8 is the system encoding I’m using. I guess I should look up to see if it’s worth to report a bug about this, or if I can fix it myself.

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.

Some thoughts about internationalisation and common strings

One big issue with translating a software in a language, even if that is your first language, is that you might not be sure of which term is used to translate a given one in, for instance, English. This is even worse when a given software is targeted to specific areas of interest that might not be common knowledge.

You can read this quite interesting entry on Michael Kaplan’s blog shows that even Microsoft can get localisation wrong (and mind you, Microsoft and Apple seems to pour money and money on localisation, and are usually better at localising than Free Software, although there are more chances for amatorial Free Software to be translated than non-Free “Gratisware”). And in particular, quoting Michael:

This does not speak against linguists as linguists, but it can be looked at as speaking a little bit against linguists as localizers, since the skillsets and what makes people great in these two very different jobs are not entirely overlapping skills.

So this dos not really mean that you need to be a linguist, or a professional translator, to be able to properly translate software. Which is nice, especially for Free Software.

But you certainly have to find a way to handle properly terms that are specific to an area. It’s common, in textbooks translated in the ‘80s from English to Italian, the word array, in programming, was translated with “*schiera*”… which is right, linguistically, but it’s horrible to think of. Today, array is left in English, and written in italic. Similarly for directory and direttrice (most of the younger italian people around will probably think I’m lying, but it’s unfortunately true).

One way to come around this is to use a glossary that provides translations for common terms in multiple languages. It’s not an idea of mine, I read the existence of it again on Kaplan’s blog, unfortunately when he was writing that it got removed from Microsoft’s website.

Another way that is more common in Free Software, and that should really really really be used more, is to centralise some of the user interface handling. KDE and GNOME use this by just defining a series of default actions and dialogs, so that they can be reused in different software, providing the user with the same text (modified when it has to be, for names of software or other details like that), and only requires one translation at once.

This would also be nice to be applied to ebuilds. Present the user the same interface, similar error messages when an USE flag has to be updated, try to reduce the amount of new text that every ebuild, init script or other script that users need to interface with.

This would be a nice first step to have, in the future, internationalised ebuilds. Yes there are tons of messages to transltae, but they usually don’t change so much, and if we can get a more externally-accessible repository, say, with GIT, in the future, we can get translators as well as proxy maintainers, to commit on their own the translations for USE flags documentation (metadata.xml) and for ebuilds’ messages.

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.

Call for translators for xine-lib-1.1.9

Next month KDE 4.0 is going to be released; before that can happen, we need to release xine-lib-1.1.9 which contains some fixes that Matthias Kretz (the Phonon maintainer) added to improve the support for the xine engine.

This time, as the release is actually scheduled to happen in a short timeframe, I want to try something different: I’m calling for translators.

xine-lib, like many other projects, uses gettext to make it possible to internationalize the strings with user-directed message and information. At the moment there isn’t a real translation team, different contributors tend to update the messages from time to time, but many languages are quite behind with the current set of strings that xine provides.

I usually take care of Italian translation, so I’ll probably try to update that tomorrow, and anyway before the release is done. If you’re good at translating from English to your native language, please either look at the gettext manual (if you don’t know how gettext works) or just checkout a copy of xine-lib from the repository (you want xine-lib/xine-lib repository), and start translating.

Either send the patches on the xine-devel mailing list (see on SourceForge, or use GMane), or open a bug on the xine tracker with the translation patch as attachment.

The output of “hg export” of a local commit is preferred so that your name and email is properly listed on the log.

Thanks in advance from me and the rest of the xine project team.

Tips for localizers, even from Microsoft

Many might not remember this, because I haven’ t blogged on that topic in quite a long time, but one of my interests is also localization of software. This mainly springs from the fact that I saw and I continue seeing people who don’ t even want to use Linux because they don’ t know English nor they are willing to learn it just for that.

For instance, I wrote before a proposal for internationalizing ebuilds, early this year. I also tried coming up with a feasible way to internationalize init scripts without having to deal with different scripts per every language possible.

With my interest in internationalization and localization, I also started following a Microsoft employee blog, Sorting It All Out by Michael S. Kaplan. A very interesting blog that deals with international issues like language support, Unicode and keyboards. Even if it’s of course mostly centred upon Microsoft products, it’s still an enlightening reading for Free Software developers, as it tends to explain the reasons for some choices made in Windows for properly support internationalization.

Also, the blog is quite interesting because it really takes a critic eye on the problems, showing even what Microsoft did wrong, and those are errors that other developers should really learn from. He also is a Mac user, beside the obvious Windows user, so he sometimes compare Apple and Microsoft products, giving an objective look at the implementations. So it’s really a suggested reading for any of my readers also interested in internationalization and localization problems.

Anyway, today I read his entry about redundant messages, and I was sure: Free Software developers should really learn to check out technology blogs even when they are from “the other side”.

That’s really a common mistake for free software too, using way too many strings to convey the same message. This makes translation quite more difficult, sometimes very difficult, and they might as well confuse users pretty badly.

The same applies to non-identical messages, and I’m actually seeing this in xine right now. The description of plugins is internationalized, but the problem is that even similar plugins use very different descriptions, This means that fuzzy translations can’t really help with translating new plugins.

So for 1.2 one of the entries in my personal TODO list (which I should remember to write on xine’s actual TODO). is to design and document a proper description scheme to be followed by the description of the plugins, this way the description would follow the same scheme, wouldn’t throw off the user with very different messages, and will make translation quite easier.

Kaplan announced that today’s post will be the first of a long series; I will certainly follow it so that I can learn from it and make it easier to localize xine.. then I’ll be hoping that more people will join the xine project to update the translations. On this note, I’ll also write a new entry to call for translators.

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.

Translating ebuilds, a proof of concept

There was only one comment to my previous post, but I actually didn’t expect that one either, first because I probably lost most of my publicity as I’m no more on Planet Gentoo or Gentoo Universe, and second because I know it’s not an issue that concern most of the people who actually use Gentoo at the current stage.

Nonetheless, I wanted to give it a try to a proof of concept of ebuilds translation. If you’re interested in this, you probably want to fetch my overlay, and look at the media-sound/pulseaudio ebuild that is there. Right now there is only Italian translation for it, as that’s the only language I can translate it to, but it works as a proof of concept for me. To try it out, run LC_ALL=“it_IT” and then emerge –1 pulseaudio.

The trick is actually simple once you know it:

messages_locale() {
        locale | grep LC_MESSAGES | cut -d '=' -f 2 | tr -d '"' | cut -d '.' -f 1

This function is used to extract the locale currently set for LC_MESSAGE value. Why is this needed? Well, it’s simple: you might be using LC_ALL rather than LC_MESSAGE to set the locale, you also might be using just LANG rather than setting the LC_* variables, so at the end, using locale is the best shot to make sure we get the proper language for messages set up on the system. In the example I have you above, by rewriting LC_ALL we bypass all the other settings.

local msgfile="${FILESDIR}/${P}-postinst"
[[ -f "${msgfile}.$(messages_locale|cut -d '_' -f 1)" ]] && msgfile="${msgfile}.$(messages_locale|cut -d '_' -f 1)"
[[ -f "${msgfile}.$(messages_locale)" ]] && msgfile="${msgfile}.$(messages_locale)"

einfo ""
local save_IFS="${IFS}"
while read line; do
        elog "$line"
done < "${msgfile}"
einfo ""

This is instead the code that actually handles the loading of the translated message that is then printed on screen for the user. It’s a very rough code as it is, I know already, so no need for pointing me at that: the tools’ chain shown above is ran at least two times up to four if the current language is in a country locale form (like pt_BR), rather than just a language name (it). The code is also prone to errors as it’s quite long by itself.

But as I said, this is a proof of concept rather than an actual implementation, this is just to demonstrate that it is possible to translate messages in ebuilds without filling the ebuilds with the messages in 20 different languages. Of course to avoid adding that big boilerplate code it should go either in portage itself in some way (but that makes adoption of translation a very long term idea, maybe EAPI=1 related) or in a more feasible i18n.eclass, that would handle all of it, included caching the value returned by $(messages_locale) so that it’s not called four times, but once only, and converting from UTF-8 (the usual encoding for in-tree files) to the local encoding, with iconv, if present.

This works well for the long log messages that are added at the postinst phase for instance, because they rarely change between one version and the next one and so have time to be translated. It doesn’t really fly for the short informative messages we have around, nor it works fine for eclasses messages.

For those, what I can think of on the fly is to try to standardise the strings as much as possible (for instance by letting the eclasses to the job), and then use gettext to translate those, with an “app-i18n/portage-i18n” package where the eclass can get their data from. I’ll try to see if I can get a proof of concept of that too.