This Time Self-Hosted
dark mode light mode Search

Why strcasecmp() and similar functions should not be used when parsing data

It might sound obvious to most experienced programmers, but it certainly is not obvious to most, which I’m afraid is a very bad thing since I’d really like to expect people who write code to understand at least a little bit of logic behind it.

I’m not going to talk about the problems regarding case insensitive comparison and locale settings (just remember that i and I are not the same character in Turkish), which still I expect most developers to ignore, but totally beside the point here, they are justified by not being linguists (unless they are Turks and then I’d worry).

What I’m talking about is the logic behind the comparison at all. In a normal string comparison you have a very easy workflow, each character of the string is compared, drop by at the first one that differs, and finish when they both arrive to the end. When you want to compare two strings case-independently, the comparison cannot just happen over the characters by themselves, they have to have the same case.

To achieve that you have many different options: lookup equivalence tables (up to 256 by 256 elements for ascii), lookup case-changing tables (twice), check if the character is in a given range, and so on. At any rate, it’s much more work than a simple comparison.

You can expect the library you’re using to be optimised enough so that the comparison does not take too long, so using strcasecmp() for a one-shot comparison is fine. What is not fine is, though, when you do parsing using it, like taking some token out of a file, and then start comparing it case-insentive to a series of known tokens. That’s a no-no since you’re going to require lookups or transformations many times in a row.

The easy way out of this is to ensure that all the reference tokens have a given case (lowercase or uppercase does not matter), and then convert the read token to the same case, so that you can just use the standard, fast, and absolutely non-complex case-sensitive string comparison.

It’s not that difficult, is it?

Update (2017-04-28): I feel very sad to have found out over a year and a half later that Michael died. The links in this and other posts to his blog are now linked to the archive kindly provided and set up by Jan Kučera. Thank you, Jan. And thank you, Michael.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.