This Time Self-Hosted
dark mode light mode Search

UTF-8 Forever

Ok I know I’m being quite an ass for posting so much blog entries in a short time, but I had part of them already in mind in the last days and I just hadn’t had time to actually write them.

This entry has a title that would probably be misleading at the end as I’m going to summarize a couple of things here…

Starting from the topic in the title, I’m really loving UTF-8 these days. With the surname I have (“Pettenò”) is quite important being sure that the final ò is handled correcly by computers.. it already happened that I received (snail) mails with the surname mangled with the wrong encoding, and sometime I just get mails with “Petteno” as receiver (which is completely another surname).
Luckily, using UTF-8 is possible represent my surname as well as kloeri’s (Østergaard, sorry for naming you but was the only other one with special chars in name that I had in mind) without having to mangle encodings, and at the same time also writing ?? without having to install special supports for extra encodings or force disabling latin-extended characters (for who’s wondering what is wrote there, it’s “baka” .. and if you don’t know, just google :P).
Unfortunately, UTF-8 is not a magic solution as there is also someone who fails to write ChangeLogs with my name spelt right.. eh Mike? (I am refering to late ChangeLogs from vapier where I had to fix my surname as it was using broken, nothing personal 🙂 ).

Then I must say that lately my (real) life is being really fooled up and strange. Maybe it’s just the weather, maybe the time that’s passing, but I really feel depressed in the last days. More probable, is the knowledge that the someone I care about is happy, but with someone else, that’s making me feel strange. While I’m happy for her being happy, I’m totally sad as I know I won’t be able to be at her side for all the time we have in front of us. This feeling is really messing me up, so I don’t really know if I’ll be present or not on IRC, if I’ll look at bugs or if I’ll complete GNOME porting without pauses… I think, I hope, to remain the same as usual, also because it helps me not to think of her, but I can’t really say what I’m going to do.

On a little more happy note, after being published on NewsForge, becoming the Developer of the Week (of three weeks ago) and now becoming Deputy Lead for G/BSD project, I’m starting thinking that I’m not wasting others’ time every day all the day, so I feel relieved on the “professional” part of my life.. I just hope I’ll be able to continue like this after I find a job, as I’m have been paid yet for the translation, and I don’t have any other income. It sucks not being able to test G/FBSD 6 just because the only other machine requires a damn PC100 memory stick to work (the memory I had was faulty, I had to trash it).

Comments 6
  1. I U+2665 your blog. (Let’s see what it does with UTF-8: ♥)I’m a huge proponent of UTF-8 (my name is Thompson, after all) and am very glad that Gentoo makes it a default in so many places. One example: the MySQL configuration defaults sanely to UTF-8 even for client transfers whereas Debian (cough, cough) is still stuck on ISO-8859-1. Getting people to understand character encodings and to use UTF-8 is often a harder sell, especially when software doesn’t default to it. I’m not always so sure about software implements blogs–hence my feeding that BLACK HEART SUIT character into this message as a test to see if I see UTF-8 in comments.I wish I could offer you consolation about your life situation, but I don’t have anything really helpful. Hang in there!

  2. If you think that is bad, imagine if said woman has your daughter.I see your point on UTF-8, however as an English speaker, all other character sets have been a huge pain for me through the years. So often they do not work correctly and you get the wrong characters all over. I have no interest in figuring this out, it is just a hassle, especially when I am trying to set up a minimal install on an embedded device. ASCII is all that I want for that.

  3. @Mike Rogers: I totally hear you on the daughter thing. I’ve been dealing with that for the past 3.5 years. I love my daughter and wouldn’t trade her for the world, but it’s not easy knowing I still have 15 years tied to a woman I once wanted to marry and now practically can’t stand.Okay, on-topic stuff now. :PUTF-8 is indeed awesome. It’s the primary reason I decided to pick up Python 3 despite a lot of libraries not being updated for it. Defaulting to UTF-8 is simply the best way to go right now. Eventually, UTF-16 will take its place, but that’s quite a while off. We need to get people to use UTF-8 first. :)Best of luck with your woman troubles, Diego. There are decent ones out there. If you find a second one, give me her number! 😛 Doing what you’re doing (focusing on your programming, maintenance, etc) is probably the best thing you could do.

  4. Ironically, the 4th paragraph only has two question marks instead of what I assume was intended to be Japanese or something.

  5. Yeah that’s what happens when you go through a b2evo update, a WordPress import and then an export. Yes I need to figure out which characters they were and fix it again, it’s just that I never re-configured ibus on my laptop…

  6. @sporkbox: I’m afraid I see UTF-16 as a dead end. Let’s see: big endian or little endian? Clumsy handling of codepoints outside of the BMP (that surrogate-pair business). Incompatibility with ASCII-friendly API’s (MS Windows has to expose two sets of API’s because of its use of UTF-16). Non-guaranteed resynchronization with a text stream after a span of garbled input.So why use UTF-16? Proponents like the certainty of addressing characters in a string because index * sizeof(uint16_t) works–or it *did* work before we had surrogate pairs.But what about the space UTF-16 saves for most of the Asian languages (or anything else with codepoints in the range U+0800 to U+FFFE)? These are the only codepoints where UTF-16 has an advantage over UTF-8, yet because ASCII codepoints take twice the space in UTF-16 than in UTF-8, marked-up texts (for example, web pages) often take up more space in UTF-16 than in UTF-8.So I’ve never understood the attraction of transferring information in UTF-16 by streams across networks, via API calls, or even as storage in files. With UTF-8, you can say goodbye to byte-order marks, surrogate pairs, and resynchronization issues.The only really good usage case I can imagine for UTF-16 is as an internal representation. Even then, programs can’t blithely assume that characters have fixed offsets within strings. A program doing that could use UTF-32, but that wastes one whole byte per character (the full Unicode range is 21 bits, so 11 bits go unused). A program could use only three bytes per character (you could call it UTF-24, I suppose), but both the UTF-32 and UTF-24 solutions are still up against another difficulty: they still can’t assume fixed-width indexing because Unicode allows combining characters. (Here’s an example. Someone writing an academic paper we had to cite had a really crazy symbol–k with a cirumflex over it–which we pronounced k-hat. There’s no precomposed form in Unicode for that, but you can make what you need with combining characters. Whether you can make your browser display it is another matter, but you can see it in Vim by typing ctl-V then the sequence u0302k then <enter>.) If you’re counting characters in a string, you have to account for the fact that combining characters don’t really count as characters but only modify the real characters that follow.And @Mike Rogers, I have to point out even English texts often have fluffy little niceities like curly quote marks and m-dashes. Yes, you can write everything in English using only ASCII, but a lot of people aren’t content with that.Since it is unlikely that someone puts combining characters in front of ASCII punctuation, most string operations valid for ASCII strings work just fine for UTF-8 strings. So long as you don’t assume that offsets in strings are the same as character positions (assumptions that generally hold true for filesystem API’s), you don’t need to sweat that your embedded device supports UTF-8.

Leave a Reply to Mike RogersCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.