Locales, NLS, kernels and filesystems

One issue that is yet to be solved easily by most distribution (at least those not featuring extensive graphical configuration utilities, like Fedora and Ubuntu do), is most likely localisation.

There is an interesting blog post by Wouter Verhelst on Planet Debian that talks about setting locales variables. It’s a very interesting reading, as it clarifies pretty well the different variables.

One related issue seems to be understanding the meaning of the NLS settings that are available in the kernel configuration for Linux. Some users seem to think that you have to enable the codepages in there to be able to use a certain locale as system locale.

This is not the case, the NLS settings in there are basically only used for filesystems, and in particular only VFAT and NTFS filesystems. The reason of this lies in the fact that both filesystems are case-insensitive.

In usual Unix filesystems, like UFS, EXT2/3/4, XFS, JFS, ReiserFS and so on, file names are case sensitive, and they end up being just a string of arbitrary characters. On VFAT and NTFS, instead, the filenames are case *in*sensitive.

For case sensitivity, you need equivalence tables, and those are defined by different NLS values. For instance, for Western locale, the character ‘i’ and ‘I’ are equivalent, but in Turkish, they are not, as ‘i’ pairs with ‘İ’ and ‘I’ with ‘ı’ (if you wish to get more information about this, I’d refer you to Michael S. Kaplan’s blog on the subject).

So when you need to support VFAT or NTFS, you need to support the right NLS table, or your filesystem will end up corrupted (on Turkish charset, you can have two files called “FAIL” and “fail” as the two letters are not just the same). This is the reason why you find the NLS settings in the filesystems section.

Of course, one could say that HFS+ used by MacOS is also case-insensitive, so NLS settings should apply to that too, no? Well, no. I admit I don’t know much about historical HFS+ filesystems, as I only started using MacOS from version 10.3, but at least since then, the filenames are saved encoded in UTF-8, which has very well defined equivalence tables. So there is no need for option selections, the equivalence table is defined as part of the filesystem itself.

Knowing this, why VFAT does not work properly with UTF-8, as stated by the kernel when you mount it as iocharset=utf-8? The problem is that VFAT works on a per-character equivalence basis, and UTF-8 is a variable-size encoding, which does not suit well VFAT.

Unfortunately, make oldconfig and make xconfig seem to replace, at least on my system, the default charset with UTF-8 every time, maybe because UTF-8 is the system encoding I’m using. I guess I should look up to see if it’s worth to report a bug about this, or if I can fix it myself.

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.

Exit mobile version