Recipe 1.19. Validating an Email Address
===
Discussion
Most email address validation is done with naive regular expressions like the ones given above. Unfortunately, these regular expressions are usually written too strictly, and reject many email addresses. This is a common source of frustration for people with unusual email addresses like joe(and-mary)@example.museum, or people taking advantage of special features of email, as in
[email protected]. The regular expressions given above err on the opposite side: they'll accept some syntactically invalid email addresses, but they won't reject valid addresses.
Why not give a simple regular expression that always works? Because there's no such thing. The definition of the syntax is anything but simple. Perl hacker Paul Warren defined an 6343-character regular expression for Perl's Mail::RFC822::Address module, and even it needs some preprocessing to accept absolutely every allowable email address. Warren's regular expression will work unaltered in Ruby, but if you really want it, you should go online and find it, because it would be foolish to try to type it in.
Check validity, not correctness
Even given a regular expression or other tool that infallibly separates the RFC822 compliant email addresses from the others, you can't check the validity of an email address just by looking at it; you can only check its syntactic correctness.
It's easy to mistype your username or domain name, giving out a perfectly valid email address that belongs to someone else. It's trivial for a malicious user to make up a valid email address that doesn't work at allI did it earlier with the
[email protected] nonsense. !@ is a valid email address according to the regexp test, but no one in this universe uses it. You can't even compare the top-level domain of an address against a static list, because new top-level domains are always being added. Syntactic validation of email addresses is an enormous amount of work that only solves a small portion of the problem.
The only way to be certain that an email address is valid is to successfully send email to it. The only way to be certain that an email address is the right one is to send email to it and get the recipient to respond. You need to weigh this additional work (yours and the user's) against the real value of a verified email address.
It used to be that a user's email address was closely associated with their online identity: most people had only the email address their ISP gave them. Thanks to today's free web-based email, that's no longer true. Email verification no longer works to prevent duplicate accounts or to stop antisocial behavior onlineif it ever did.
This is not to say that it's never useful to have a user's working email address, or that there's no problem if people mistype their email addresses. To improve the quality of the addresses your users enter, without rejecting valid addresses, you can do three things beyond verifying with the permissive regular expressions given above:
1.
Use a second naive regular expression, more restrictive than the ones given above, but don't prohibit addresses that don't match. Only use the second regular expression to advise the user that they may have mistyped their email address. This is not as useful as it seems, because most typos involve changing one letter for another, rather than introducing nonalphanumerics where they don't belong.
def probably_valid?(email)
valid = '[A-Za-z\d.+-]+' #Commonly encountered email address characters
(email =~ /#{valid}@#{valid}\.#{valid}/) == 0
end
#These give the correct result.
probably_valid? '[email protected]' # => true
probably_valid? '[email protected]' # => true
probably_valid? '[email protected]' # => true
probably_valid? 'joe@examplecom' # => false
probably_valid? '[email protected]' # => true
probably_valid? 'joe@localhost' # => false
# This address is valid, but probably_valid thinks it's not.
probably_valid? 'joe(and-mary)@example.museum' # => false
# This address is valid, but certainly wrong.
probably_valid? '[email protected]' # => true
2.
Extract from the alleged email address the hostname (the "example.com" of
[email protected]), and do a DNS lookup to see if that hostname accepts email. A hostname that has an MX DNS record is set up to receive mail. The following code will catch most domain name misspellings, but it won't catch any username misspellings. It's also not guaranteed to parse the hostname correctly, again because of the complexity of RFC822.
require 'resolv'
def valid_email_host?(email)
hostname = email[(email =~ /@/)+1..email.length]
valid = true
begin
Resolv::DNS.new.getresource(hostname, Resolv::DNS::Resource::IN::MX)
rescue Resolv::ResolvError
valid = false
end
return valid
end
#example.com is a real domain, but it won't accept mail
valid_email_host?('[email protected]') # => false
#lcqkxjvoem.mil is not a real domain.
valid_email_host?('[email protected]') # => false
#oreilly.com exists and accepts mail, though there might not be a 'joe' there.
valid_email_host?('[email protected]') # => true
3.
Send email to the address the user input, and ask the user to verify receipt. For instance, the email might contain a verification URL for the user to click on. This is the only way to guarantee that the user entered a valid email address that they control. See Recipes 14.5 and 15.19 for this.
This is overkill much of the time. It requires that you add special workflow to your application, it significantly raises the barriers to use of your application, and it won't always work. Some users have spam filters that will treat your test mail as junk, or whitelist email systems that reject all email from unknown sources. Unless you really need a user's working email address for your application to work, very simple email validation should suffice.
See Also
*
Recipe 14.5, "Sending Mail"
*
Recipe 15.19, "Sending Mail with Rails"
*
See the amazing colossal regular expression for email addresses at http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html