jdk6之前的正则表达式不支持命名捕获组功能,只能通过捕获组的索引来访问捕获组.当正则表达式比较复杂的时候,里面含有大量的捕获组和非捕获组,通过从左至右数括号来得知捕获组的计数也是一件很烦人的事情;而且这样做代码的可读性也不好,当正则表达式需要修改的时候也会改变里面捕获组的计数.解决这个问题的方法是通过给捕获组命名来解决,就像Python, PHP, .Net 以及Perl这些语言里的正则表达式一样.
新引入的命名捕获组支持如下:
(1) (?X) to define a named group "NAME"
(2) \k to backref a named group "NAME"
(3) ${NAME} to reference to captured group in matcher's replacement str
(4) group(String NAME) to return the captured input subsequence by the given "named group"
举两个例子来看一下:
public static void indexedCaptureTest(){//jdk6之前的使用方式
String names = "fred or barney";
Matcher m = Pattern.compile("(\\w+) or (\\w+)").matcher(names);
if(m.find()){
System.out.println(m.group(1)+","+m.group(2));
}
}
public static void namedCaptureTest(){//jdk7可以给捕获组命名
String names = "fred or barney";
Matcher m = Pattern.compile("(?\\w+) or (?\\w+)").matcher(names);
if(m.find()){
System.out.println(m.group("name1")+","+m.group("name2"));
}
}
再看一下反向引用和替换字符串的例子:
String input = "aabbbccdddef";
如何把这个字符串拆成[aa, bbb, cc, ddd, e, f]这样的数组?
public static void indexedCaptureReplace(){
String input = "aabbbccdddef";
String regex = "((.)+?)(?!\\2)";
String temp = input.replaceAll(regex, "$1,");
String[] arr = temp.split(",");
System.out.println(java.util.Arrays.toString(arr));
}
public static void namedCaptureReplace(){
String input = "aabbbccdddef";
String regex = "(?(?.)+?)(?!\\k)";//好丑陋的实现!ugly!
String temp = input.replaceAll(regex, "${name2},");
String[] arr = temp.split(",");
System.out.println(java.util.Arrays.toString(arr));
}
参考: http://www.iteye.com/news/6195。但是,这里面的说法在jdk的实际实现中有改动,主要是在${}这块。
Pattern类的doc:
Back references
\n Whatever the nth capturing group matched
\k Whatever the named-capturing group "name" matched
Matcher类的public Matcher appendReplacement(StringBuffer sb, String replacement):
The replacement string may contain references to subsequences captured during the previous match:
Each occurrence of ${name} or $g will be replaced by the result of evaluating the corresponding group(name) or group(g) respectively.
For $g, the first number after the $ is always treated as part of the group reference. Subsequent numbers are incorporated into g if they would form a legal group reference. Only the numerals '0' through '9' are considered as potential components of the group reference.
If the second group matched the string "foo", for example, then passing the replacement string "$2bar" would cause "foobar" to be appended to the string buffer.A dollar sign ($) may be included as a literal in the replacement string by preceding it with a backslash (\$).