Locales, NLS, kernels and filesystems

One issue that is yet to be solved easily by most distribution (at least those not featuring extensive graphical configuration utilities, like Fedora and Ubuntu do), is most likely localisation.

There is an interesting blog post by Wouter Verhelst on Planet Debian that talks about setting locales variables. It’s a very interesting reading, as it clarifies pretty well the different variables.

One related issue seems to be understanding the meaning of the NLS settings that are available in the kernel configuration for Linux. Some users seem to think that you have to enable the codepages in there to be able to use a certain locale as system locale.

This is not the case, the NLS settings in there are basically only used for filesystems, and in particular only VFAT and NTFS filesystems. The reason of this lies in the fact that both filesystems are case-insensitive.

In usual Unix filesystems, like UFS, EXT2/3/4, XFS, JFS, ReiserFS and so on, file names are case sensitive, and they end up being just a string of arbitrary characters. On VFAT and NTFS, instead, the filenames are case *in*sensitive.

For case sensitivity, you need equivalence tables, and those are defined by different NLS values. For instance, for Western locale, the character ‘i’ and ‘I’ are equivalent, but in Turkish, they are not, as ‘i’ pairs with ‘İ’ and ‘I’ with ‘ı’ (if you wish to get more information about this, I’d refer you to Michael S. Kaplan’s blog on the subject).

So when you need to support VFAT or NTFS, you need to support the right NLS table, or your filesystem will end up corrupted (on Turkish charset, you can have two files called “FAIL” and “fail” as the two letters are not just the same). This is the reason why you find the NLS settings in the filesystems section.

Of course, one could say that HFS+ used by MacOS is also case-insensitive, so NLS settings should apply to that too, no? Well, no. I admit I don’t know much about historical HFS+ filesystems, as I only started using MacOS from version 10.3, but at least since then, the filenames are saved encoded in UTF-8, which has very well defined equivalence tables. So there is no need for option selections, the equivalence table is defined as part of the filesystem itself.

Knowing this, why VFAT does not work properly with UTF-8, as stated by the kernel when you mount it as iocharset=utf-8? The problem is that VFAT works on a per-character equivalence basis, and UTF-8 is a variable-size encoding, which does not suit well VFAT.

Unfortunately, make oldconfig and make xconfig seem to replace, at least on my system, the default charset with UTF-8 every time, maybe because UTF-8 is the system encoding I’m using. I guess I should look up to see if it’s worth to report a bug about this, or if I can fix it myself.

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.

5 thoughts on “Locales, NLS, kernels and filesystems

  1. Actually, it isn’t quite that simple… (If you thought that _was_ simple).NLS *does* affect case-sensitive file systems such as EXT3, just not as badly. For example, every time I mess up and replace utf-8 with iso8859-1 in my kernel configuration my ext3 file system goes berserk and replaces all “å”, “ä”, and “ö” with “Ã¥”, “ä”, and “ö” in all my filenames…Also note that NTFS is case-sensitive, and that thus “FILE.TXT” and “file.txt” always are different files. Windows hides this from the user, but Linux don’t do that by default (mount option exist). SMBFS and CIFS are, however, case-insensitive like FAT by default (CIFS is case-sensitive if you use the POSIX extensions though).Additionally, utf-8 does in fact specify multiple different equivalence tables depending on locale. For example “i” and “I” would be considered equivalent in English, but not in Turkish, even on utf-8. Of course, this don’t matter on case-sensitive file systems such as ext3, and I haven’t found any configuration option for it anywere, so I’m assuming that a case-insensitive file system using utf-8 (such as CIFS) uses the default collation (which is correct for English, and as correct as possible in as many other languages as possible without breaking correctness with English).

    Like

  2. The reason for having the codepages and NLS modules in the kernel is not only that Windows-based filesystems are case-sensitive. The true reason is that they always store filenames in UCS-2, and the kernel needs to convert this to the form understandable to utilities such as “ls”. See more details in http://bugs.debian.org/cgi-…So, if you mount a Windows-based filesystem with iocharset different from that of your current locale, “ls” will display non-English characters incorrectly (VFAT has a special-case “utf8” option that overrides iocharset, but still works incorrectly WRT the case, see http://bugs.debian.org/cgi-….

    Like

  3. Hello,I believe you’ve made a mistake. See paragraph 4:”The reason of this lies in the fact that both filesystems are case-insensitive.”Now paragraph 5:”On VFAT and NTFS, instead, the filenames are case sensitive.”Anyway, thank you for your articles and good luck :)

    Like

  4. Okay I fixed the typo about VFAT insensitivity ;)And I admit I didn’t know that about NTFS, nice to know at least, it might come useful.As for UTF-8, I think that the distinction between multiple collation tables can be made moot by normalising the characters. If I recall correctly, that’s what OSX does for HFS+. This way it should be possible to have a character that is uniquely tied to another (like i/I). I didn’t check the code that handles that, maybe I could if I get more time.And for ext3, it does seem strange to me, because I used to have a system with no VFAT support and no NLS either, and it worked quite fine, and the way the filenames appear it only depends on my current (userland) locale. And I don’t see any mount option to override nls or iocharsets, so it doesn’t sound like the right thing…

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s