This Time Self-Hosted
dark mode light mode Search

The UTF-8 security challenge

I make no mystery of the fact I like my surname to be spelt correctly, even if that’s internationally difficult. I don’t think that’s too much to ask sincerely; if you want to refer to me and you don’t know how to spell my surname, you have a few other options, starting from my nickname (“Flameeyes”), which I keep on using everywhere, included the domain of this blog because, well, it’s a name as good as my “real” name. While I know other developers, starting from Donnie, prefer to be recognized mainly by their real name; since I know my name is difficult to type for most English speakers, I don’t usually ask that much; Flameeyes was, after all, more unique to me than “Diego Pettenò”, since of the latter there are other three just in my city.

But even without going with nicknames, that might not sound “professional”, I’m fine with being called Diego (in Gentoo I’m the only one; for what concern multimedia areas, I’m Diego #2 since “the other Diego” Biurrun takes due priority), or since a few months ago Diego Elio (I don’t pretend to be unique in the world but when I chose my new name, beside choosing my grandfather’s name, I also checked I wouldn’t step in the shoes of another developer), or, if you really really need to type my name in full, “Diego Petteno`” (yes there is an ascii character to represent my accent and it’s not the usual single quotation mark; even the quotation mark, though, works as a tentative, like for banks and credit cards . If you’re in a particular good mood and want to tease me around you could also use 炎目 (which is probably a too literal translation of “Flameeyes” in kanji); I think the only person ever using that to call me has been Chris (White), and it also does not solve the issue of UTF-8.

Turns out it’s not that easy at all. I probably have gone a little overboard the other day about one GLSA mistyping my name (it still does), because our security guys are innocent on the matter: glsa-check breaks with UTF-8 in the GLSA XML files (which is broken of glsa-check, since you should not assume anything about the encoding of XML files, each file declares its own encoding!), which makes it hard to type my name; tthe reason why I was surprised (and somewhat annoyed) is that I was expecting it to be typed right for once, py handled it and I’m sure he has the ò character on his keyboard.

Curious about this, I also wanted to confirm how the other distributions would handle my name. A good chance to do that was provided by CVE-2008-4316 (which I discussed briefly already ). The results are funny, disappointing and interesting at the same time.

The oCERT advisory has a broken encoding and shows the “unknown character” symbol (�); on the other hand, Will’s mail at SecurityFocus shows my name properly. Debian cuts my surname, while Ubuntu simply mistype it; on the other hand, Red Hat is showing it properly; score one for Red Hat.

One out of four distributions (Gentoo has no GLSA on the matter, but I know what would have happened, nor the CVE links to other distributions, just a few more security-focused sites I’m not interested about in this momet) handle my name correctly, that’s not really good. Especially, I’m surprised that the one distribution getting it right is Red Hat, since the other two are the ones I usually see called in the mix when people talk about localising Free Software packages. Gentoo at least does not pretend to be ready for internationalisation in the first place (although we have a GLEP that does ).

Okay I certainly am a nit-picker, but we’re in 2009, there are good ways to handle UTF-8, and the only obstacles I see nowadays are very old legacy software and English speakers who maintain that seven bits are enough to encode the world, which is not true by a long shot.

Comments 15
  1. I guess not many people know of the compose key in X11 ïţ ḷêţś ýøũ ṫỳṕë ļíķè ṫħïş… with a normal english keyboard.

  2. Is it about your name or UTF-8 xml implementation?Ok, you’ve got Chinese name, why don’t you translate it to English as well? My Russian name includes “ь” which is softener of a previous character. Without it, it sounds different, it sounds like … English name ;-).That’s why I don’t ask banks to spell it like: “lьkov” or “l’kov” (which is “translit” version) on a credit card. Do you really want to spell my name or any Chinese name even “he has it on his keyboard”? Yes, this is annoying although technicality it’s possible.

  3. btw, I know some people want to change their co-name as “Total” to be the last one in the budgeting spread sheet or as “‘ or ‘1’=’1 ” which might work as sql injection in some systems.I’m usually trying “O’brain” to verify it first. I’ll be using “Petteno'” from now. Sorry for the typo, but ‘ works better in this case. LOL.

  4. Actually my name is Italian, not Chinese at all :)And yes I would like for where it’s used in an official fashion for it to be spelt correctly, just like I would spell yours correctly (if I did know the correct spelling of course). If I did misspell someone’s name on my blog and they asked me to correct, even months, years after the post, I’d sincerely gladly do so. Why? Because I think names are important.I can give up to practicality of using “Petteno`” or “Petteno’”, although preferring the former, but since “Petteno” is a totally different surname here, I’m not really keen on dropping the accent at all.And in this case, I confirmed that Will’s mail was correct just to make sure they could have just copy-pasted my name if they were unable to type it in directly. I’m not pretending everybody knows how to use the compose key, as nico noted. I’m just asking for _one_ little step. When I give credit to someone from a Gentoo bugs, I usually either make sure I type the name right or copy-paste it; I can’t see why the same cannot apply to security advisories.

  5. I think my point was about how it should spelled by you. If you use it with an article written in English, you should use English name with in English alphabet only, A-Z. Simple like that. Not Italian , not Chinese 炎目.Right now is like an article written half in Italian half in English.What’s why people don’t understand your name.Cheers,Антон

  6. Sorry but people’s names are untranslatable; they might be transliterated but not translated. And there is no “English alphabet”; the A-Z alphabet is the _Latin_ alphabet, and it contains ligatures that allows my name to be spelt correctly as Pettenò.And people are not supposed to _understand_ my name; I’m not even asking them to properly pronounce it, but I want to make sure that it’s not mistaken for a totally different surname, from a vastly different family tree at that point. Just like the more widely known difference between “Müller”:http://en.wikipedia.org/wik…, “Muller”:http://en.wikipedia.org/wik… and “Mueller”:http://en.wikipedia.org/wik… there are different people bearing both surnames Pettenò and Petteno; if it was obvious that one is just a best-match for the other (like Pettenò and Petteno` or Pettenò and Petteno’) it would be mostly fine (yet annoying because it’s just a way to workaround a technical issue). But changing the whole surname is a no-go for me.Transliteration _is_ standardised and is not done as the first person writing the word pleases. The correct transliteration of my name into 7-bit ASCII (which again is *not* an “English alphabet”; the English language includes also characters like æ in words like Encyclopædia even though its use is rare nowadays) would be “Diego Petteno`” and _not_ “Diego Petteno”. Or you can just call me Flameeyes, I don’t mind at all.

  7. Anton, you give the feeling of being a microsoft user, who don’t care about others. Being able to write names and places correctly is important, or else it will be difficult to know whom or which place you are referring to, for example I could ask you if you have been in Lempaala, this could be the town of Lempaala (located in Karelia occupied by Russia) or the town Lempäälä (outside Tampere, Finland). The problem grows even more when considering none Latin alphabets or character based written languages.Another advantage with using UTF-8 is that it’s the same regardless which operating system you use, so no fuss with finding out the right character setup for a file you are reading.Please start to care about other people and learn how to use your keyboard to type new characters, no one requires you to star type Chinese or any other none Latin alphabet using language.

  8. Ok, guys.I don’t live in Europe and not aware about your culture and language problem much. But here is the link http://en.wikipedia.org/wik…So it does exist, 26 chars, A-Z.Trizt: I’m not Anton bloody hell! From now on, call me Антон as I typed it in my last post. LOL.You still didn’t get my point? Yes, European languages are all based on Latin, so you can ask people type it correctly. But be ready that you’ll need to type all other languages in the world, some Asian like Chinese, Thai etc you’ll need to copy-paste and you won’t have any idea how to pronounce it. BTW, in case with Chinese you actually need to translate names, so it’s common for them to have two names: Chinese and English. How can I make myself more clear?

  9. Trizt: I could also ask you if you have been in Moscow. There are 22 places in the World called http://en.wikipedia.org/wik…So in your example, take dictionary, take English map and learn hard how to call it in English since you use it. I personally have no idea what’s is ä, ü and many other characters from UTF-8 you guys are using. So it’s actually you like a microsoft user, asks people to use Latin, not even English :)Fine, let’s use unicode names, but the whole range. Be ready to copy-paste Asian, Middle East, 100 dialects of Indian languages and and 1000s Africans (just because Ubuntu was born there)

  10. More often than not, foreign names get “translated” in one way or another when crossing borders. Venice, Rome, Naples uzw. Yes, names do matter – but things start to get silly once you’ve got to copy+paste the string of symbols representing a name because there’s no way you can actually parse the phonetic meaning. UTF8 is really nice but imho a little beside the point.There are lots of languages that use latin-based alphabets that produce names a little harder to reproduce in English than Italian. Example, a typical Scandinavian name: Bjørn Åge Hærstøl. Probably rendered in an English context as “Bjorn Age Harstol”, hardly even a translitteration but rather a new simplified name that resembles the Norwegian one enough to figure out it’s the same guy – as per context.- Rasmus (or, rewound and translated back to modern Italian – “Erasmo”)

  11. Rasmus, the names could also be written as Bjoern Aage Haerstoel, which is far close to the original Norwegean than the today oversimplified “English” way to type, but it’s really not that difficult to type it correctly with English keyboard, for some reason they seem to know all those characters when they try to be “cool and 3133t”.

  12. Trizt, that’s exactly my point, oversimplifying is _not_ nice, when you can do just the same; I’m not pretending to always be called in pure total UTF-8, but it’s not difficult to add a ‘ or a ` to the oversimplified (and with a different meaning at the source) “Petteno”?

  13. I think you can look to the Esperanto people for a way to work with names.They have been having the problem of international names for over a decade, now, since at Esperanto meetings you have people from every country of the world. At the same time most people there are quite sensitive to language issues (they chose to learn an international language to be able to talk on neutral ground, after all).They use several different ways, but the most popular for news articles is to write every name twice: Once in a version which is pronouncable mostly correct for every Esperanto speaker (so it can be read fluidly) and in brackets after that the name as seen in the person’s home language. They also do this in the case that the home language and Esperanto share the same alphabet, because the same alphabet doesn’t imply the same pronounciation (just look at spanish, english and german).The Esperanta Wikipedia uses the inverse system: original name (Esperanto version). That way they can be sure that entries are identified by the original name, so they don’t have duplicate entries and searching by name gets easier, since the original version is authoritative most of the time.Five examples (inverse them for the wikipedia version):* Barak Obama (Barack Obama)* Ĝorĝ Buŝ (George Bush)* Flejm-Ais (Flameeyes)* Uel red rider (well read reader)The clear advantage is: Everyone can pronounce the name, and it will never sound too far off the original. Still you can always find all articles in about the person written in any language by just searching for the name in brackets.-> http://eo.wikipedia.org/wik

Leave a Reply to DebianCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.