Interesting notes about filesystems and UTF-8

You probably know already that I’m an UTF-8 enthusiast; I use UTF-8 extensively in everything I do, mail, writings, irc, and whatever; not only because my name can only be spelled right when UTF-8 is used, but also because it really makes it nicer to write text that has proper arrows rather than semigraphical arrows, and proper ellipsis as well as dashes.

On Linux, UTF-8 is not always easy to get right, there is quite a bit of software out there that does not play nice with UTF-8 and unicode, included our GLSA handling software, and that can really be a bother to me. There are also problems when interfacing to filesystems like FAT that don’t support UTF-8 in any way.

Not so on Mac OS X usually, because the system was there designed entirely to make use of Unicode and UTF-8, included the filesystem, HFS+. There is, though, one big problem with this: since there are many ways to produce the same character in UTF-8, using either single codepoints or more complex, but easier to compare in case-insensitive way, combined diacritics markers. Since HFS+ can be case-insensitive (and indeed it is by default, and has to be for the operating system volume), Apple decided to force the use of the latter format for UTF-8 text in HFS+: all the file names are normalised before being used. This works fine for them, and the filenames are usually readable from Linux just as fine.

But there is a problem. Since I have lots of music on iTunes to be synced on my iPod, I usually keep my main music archive in OS X, and then rsync it over repeatedly on Linux so I can play it with my main system (or at least try to since most of the audio players I found are sucky for what I need). In my music archive, I have many tracks from Hikaru Utada (宇多田ヒカル), which are named with the original titles (most of them come from the iTunes Store itself; others are ripped from my CD); one EP I have is titled SAKURAドロップス now in this title there are two characters that are decomposed in base and marker (ド and プ). While it might not be obvious, I’ll just rely on Michael Kaplan to explain you why that happens.

Now, the the synced file maintains the normalised filename, which is fine. The problem is that something does not work right on zsh, gnome-terminal, or both. On Gentoo, with local gnome-terminal, both when showing me the completion alternatives, and when actually completing the filename, instead of ド I get ト<3099> on Fedora via SSH, the completion alternatives are fine, while it still gets the non-recomposed version on the commandline after completion.

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s