regular expression/regex(4)Validating E-mail Address with regular expression

使用正则表达式验证电子邮件地址

    首先附加上Java编程语言和正则表达式文章(Regular Expresssion and the Java Programming Language)上用于验证电子邮件地址的代码:

/*
* Checks for invalid characters
* in email addresses
*/
public class EmailValidation {
   public static void main(String[] args)
                                 throws Exception {
                                
      String input = "@sun.com";
      //Checks for email addresses starting with
      //inappropriate symbols like dots or @ signs.
      Pattern p = Pattern.compile("^\\.|^\\@");
      Matcher m = p.matcher(input);
      if (m.find())
         System.err.println("Email addresses don't start" +
                            " with dots or @ signs.");
      //Checks for email addresses that start with
      //www. and prints a message if it does.
      p = Pattern.compile("^www\\.");
      m = p.matcher(input);
      if (m.find()) {
        System.out.println("Email addresses don't start" +
                " with \"www.\", only web pages do.");
      }
      p = Pattern.compile("[^A-Za-z0-9\\.\\@_\\-~#]+");
      m = p.matcher(input);
      StringBuffer sb = new StringBuffer();
      boolean result = m.find();
      boolean deletedIllegalChars = false;

      while(result) {
         deletedIllegalChars = true;
         m.appendReplacement(sb, "");
         result = m.find();
      }

      // Add the last segment of input to the new String
      m.appendTail(sb);

      input = sb.toString();

      if (deletedIllegalChars) {
         System.out.println("It contained incorrect characters" +
                           " , such as spaces or commas.");
      }
   }
}
    熟悉java语言的一看就知道Regular Expression and the Java Programming Language 这篇文章中主要是检查电子邮件地址中的非法字符。由于代码中注释已经很清楚了,所以下面就讲述一个直接检查正则表达式是否正确的情况。

    由于现在网络域名的变更,给电子邮件地址的验证增加了负担,例如:现在有了.museum的顶级域名,它很不同于以前的顶级域名例如:.com/.cn/.org等,所以我们在书写正则表达式的时候就不能设置最后的字符长度为{2,4},而当我们把这个长度扩大之后,就又引起了另外一个问题:非法的邮件地址也被视为正确。所以有的人称这种方式为:Trade-offs(交换,协定)。以下给出一个可以验证99%的邮件地址的正则表达式:
    [A-Za-z0-9._%#~-]+@[A-Za-z0-9_.-]+\.[A-Za-z]{2,4}  当然你也可以用\w替换 [A-Za-z_0-9],就变成了下面的形式:[\w\.%#~-]+@[\w.-]+\.[A-Za-z]{2,4}。当然这个例子不能匹配到.museum。 如果要包含.museum,你可以使用[\w\.%#~-]+@[\w.-]+\.[A-Za-z]{2,6}然而,现在又出现了一个问题,这个正则表达式匹配[email protected] 更多可能的是john忘记了加入.com顶级域名而不是自己创建了一个.office的顶级域名,这个域名没有得到ICANN的批准。
    这样就产生了另外一个(trade-off)协定。你想要这个正则表达式去检查是否这个顶级域名存在?任何有2-4个字符的结合将要这样做,它几乎涵盖了所有现存的和计划过的顶级域名除了.museum。但是,它也匹配有违法的顶级域名像:[email protected]。此表达式没有过度的对顶级域名进行限制。你如果想要你的正则表达式的准确率提高,你必须更新你的正则表达式只要有新的顶级域名被创建,不管这个顶级域名是国家代码或者所属于某个领域。
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.(?:[A-Za-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)$ 这个表达式可以允许任意两个字符国家代码的顶级域名,和仅有的几个具体类别的顶级域名。在你读这篇文章的时候,可能这个表达式已经过时了。如果你使用这个正则表达式,我推荐你将它存储为你应用程序的全局常量,这样你以后只要修改这个地方就可以了。你也可以使用同样的方式列出来所有的国家代码,尽管有大约200个国家代码。
    电子邮件也可能在子域的服务器上,例如:[email protected]。以上所有的正则表达式都匹配这个电子邮件,因为我包含了一个.(dot)在@符号后面的字符类里面,然而,这个正则表达式也匹配[email protected],但是这个不是有效的电子邮件,因为有连续的..(dot)。你可以在上面任意的一个正则表达式里使用(?:[A-Za-z0-9-]+\.)+ 来替代[A-Za-z0-9.-]+\.来排除上面的邮件地址。我把.(dot)从字符类里面移除来了,取而代之的是使用一个字符类后面紧跟一个.(dot)。例如:\b[A-Za-z0-9._%+-]+@(?:[A-Za-z0-9-]+\.)+[A-Za-z]{2,4}\b会匹配[email protected]而不匹配[email protected]
    另外一个trade-off(交换,协定)是正则表达式只允许英文字体,数字和一写特殊符号。主要的原因是我不相信所有我的电子邮件软件能够处理的更过情况。尽管John.O'[email protected]是个语义有效的电子邮件,但是有个风险就是有的软件将撇号误解析为受限的引号。例如:盲目的将电子邮件插入到SQL将会失败,因为单引号是受限的。当然,域名包含有非英文字符已经好几年了。大部分的软件甚至是域名注册者仍然使用他们习惯使用的37个字符。
    结论:决定使用哪个或那种正则表达式来验证电子邮件,是否你要努力匹配一个电子邮件地址或者是一些模糊定义的地址。你需要考虑这种trade-off。匹配了一个不合法的邮件地址有多糟糕?没有去匹配一些合法的邮件有多糟糕?你的正则表达式有多么复杂?你以后要改变你的正则表达式将要花费多少成本?对于以上问题的不同的答案,你可能需要不同的正则表达式作为解决方案。
    以下简单附属了官方的对于邮件规定的 RFC2822标准:
The Official Standard: RFC 2822

Maybe you're wondering why there's no "official" fool-proof regex to match email addresses. Well, there is an official definition, but it's hardly fool-proof.

The official standard is known as RFC 2822. It describes the syntax that valid email addresses must adhere to. You can (but you shouldn't--read on) implement it with this regular expression:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

This regex has two parts: the part before the @, and the part after the @. There are two alternatives for the part before the @: it can either consist of a series of letters, digits and certain symbols, including one or more dots. However, dots may not appear consecutively or at the start or end of the email address. The other alternative requires the part before the @ to be enclosed in double quotes, allowing any string of ASCII characters between the quotes. Whitespace characters, double quotes and backslashes must be escaped with backslashes.

The part after the @ also has two alternatives. It can either be a fully qualified domain name (e.g. regular-expressions.info), or it can be a literal Internet address between square brackets. The literal Internet address can either be an IP address, or a domain-specific routing address.

The reason you shouldn't use this regex is that it only checks the basic syntax of email addresses. [email protected] would be considered a valid email address according to RFC 2822. Obviously, this email address won't work, since there's no "nospam" top-level domain. It also doesn't guarantee your email software will be able to handle it. Not all applications support the syntax using double quotes or square brackets. In fact, RFC 2822 itself marks the notation using square brackets as obsolete.

We get a more practical implementation of RFC 2822 if we omit the syntax using double quotes and square brackets. It will still match 99.99% of all email addresses in actual use today.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

A further change you could make is to allow any two-letter country code top level domain, and only specific generic top level domains. This regex filters dummy email addresses like [email protected]. You will need to update it as new top-level domains are added.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b

So even when following official standards, there are still trade-offs to be made. Don't blindly copy regular expressions from online libraries or discussion forums. Always test them on your own data and with your own applications.

你可能感兴趣的:(编程,正则表达式,SQL Server,网络应用,Office)