In my daily job (which actually was a weekend job, but that’s beside the point now), I’ve recently had to extend a regular expression that is used for validating a string as an email address. It made me shiver, because it does not really work, and I don’t like having to make changes that don’t really work, but just seem to.
The email address format is quite complex, and indeed you really cannot be sure about the validity of an email address until you actually try to deliver something to that. But trying to get it right with a simple regular expression is deemed to fail, for very good reason.
Indeed if you Google for “valid email address regexp” you are brought to an interesting article about valid email address and regular expressions that gives you a lot of very interesting information. Although I don’t agree with all of it, and I’m sure they are also not counting on at least a few important things.
In particular, I got two email addresses that didn’t work properly in the regular expression, the first fixed the second not: flameeyes+something@gmail.com (which is an extension of my usual email address) and diego.elio@pettenò.es (which is my email using IDN domain). Now, the latter is not really well supported out there, because of the security risks for homograph attacks and in anti-Unicode developers out there; the former instead is a very common and standard form for addresses and should thus taken into consideration. If you follow what the article above says, you would also probably go around rejecting or removing the string “nospam” from addresses; a long time ago, my email address when posting on Usenet and elsewhere was nospam@daps.cjb.net … a simple redirect to my main email address, but a lot of spambots decided to remove “nospam” and ended up without an username part to send their email. Fun isn’t it?
Now, there is no way you can find a way to whitelist all the possibly valid email addresses out there; the obvious way would be to just allow a subset, as the article I listed above suggested, and deal with problems on a case by case basis. I don’t like that approach, because a single user that does try to use their (valid) email address and finds it refused is often a scorned user (it happened to me way too many times with just using the extension, I don’t really use normally the IDN domain).
What is my suggested approach then? Well, since you can be sure that an email is 100% valid, you should work it on the other way: reject what you’re sure is invalid, and then let the rest through; the possible rejected messages are probably less bothersome than users being unable to register or not getting the mail they expect. So how would I do that?
First of all, I’m afraid I have to say I don’t know well enough the email standards to know exactly what is and what is not allowed to a certainty, which means I would have to document myself to write code to do that; I would probably do it, if I needed to, but right now my job does not require me to do this (the code that I “fixed” is already legacy), and thus I won’t be doing this very likely.
Anyway the general sense of rules I can be sure enough of is that there are different validation rules for the two parts that compose an email address: the username and the domain (which can actually not be a domain at all, but let’s get back to that later). For instance in the username you can have the +
symbol that selects an extension, but you cannot have it in the domain part. So my first step would be divide the two parts, this also has the nice side effect of ensuring that the at-symbol is present.
Now you get to validate the username, with whatever rules there are for allowed characters (I have no idea whether punycode is used or which character classes are needed; for sure they are documented, somewhere. I remember it has to start with letters or numbers, no special characters; I’m not sure if all but a few characters are valid or not. Whatever the way you can make sure that the username at least looks valid much more easily that you can check a domain.
For the domain part, well, you have some more complicated choices; the article above does an offline check on validity of the domain; this makes sense to a point, but I sincerely prefer a more thorough route:
- first the domain part can actually be a host, in the form of
[127.0.0.2]
, and that case should probably be considered, as well; - second the domain part can be IDN-based, and if you do support them (you should) you probably want to make sure it’s correctly encoded in punycode, there are IDN libraries out there that can take care of that;
- once you have the domain cleaned up and eventually punycode-encoded, you can check with DNS whether it has an MX record or not; this solves all the messy talk that the article above has about supporting .museum or checking against .nospam; DNS resolution is usually fast enough that you can check this inline in the registration; given that most users will use the same email providers, having a local cache like
nscd
can help lots; - if you really want to check that the email is not absolutely invalid, you may want to try connecting to the mail server, and ask if the user exists; this won’t work on most servers, but you can probably cut down some more invalid addresses this way.
Now of course this is no trivial logic that has to be implemented, and it becomes even more problematic if you also want to label the email addresses with the sender’s name, since the RFC gets so obscure that even git fails to meet the specifications that vger (the Kernel.org mailing list server) wants respected! For this reason maybe this is more suited for a library, if it doesn’t exists already; if somebody knows of something that do verify email addresses this way, please do let me know, I’m interested (Ruby and Python mostly but whatever is good.)