前些天同事问我正则表达式为什么匹配不出来数据,在工具上验证均是正常的。当时看了一头蒙,好久不处理都忘记api中类怎么处理了。当时第一反应直接跟源码查看原因;一看后更有点小蒙,虽然问题解决了,也勾起我对java中正则解析处理方式的好奇。
简单正则表达式处理:
@Test public void testDemo(){ String str = "this is my test 52, i want find all number"; Pattern patter = Pattern.compile("\\d+"); Matcher matcher = patter.matcher(str); while(matcher.find()){ System.out.println(matcher.group()); } }上述例子就是根据一个字符串根据正则表达式获取所有整数,并打印出来。针对这样一个程序的底层是怎么处理的呢?
首先Pattern通过compile方法自动生成new Pattern对象:
public static Pattern compile(String regex) { return new Pattern(regex, 0); }
private Pattern(String p, int f) { pattern = p; flags = f; // Reset group index count capturingGroupCount = 1; localCount = 0; if (pattern.length() > 0) { compile(); } else { root = new Start(lastAccept); matchRoot = lastAccept; } }
可以看出Pattern类中设计的构造为私有的,只允许通过compile方法进行创建Pattern对象。其中f为匹配标志,可能包括 CASE_INSENSITIVE
、MULTILINE
、DOTALL
、UNICODE_CASE
、CANON_EQ
、UNIX_LINES
、LITERAL
和COMMENTS
的位掩码
在构造中进行初始化类中基础属性赋值,根据前面的单元测试直接走的为compile()方法。在这个方法中又做了什么处理呢?
查看对应compile方法部分代码为:
if (! has(LITERAL)) RemoveQEQuoting(); // Allocate all temporary objects here. buffer = new int[32]; groupNodes = new GroupHead[10]; if (has(LITERAL)) { // Literal pattern handling matchRoot = newSlice(temp, patternLength, hasSupplementary); matchRoot.next = lastAccept; } else { // Start recursive descent parsing matchRoot = expr(lastAccept); // Check extra pattern characters if (patternLength != cursor) { if (peek() == ')') { throw error("Unmatched closing ')'"); } else { throw error("Unexpected internal error"); } } } // Peephole optimization if (matchRoot instanceof Slice) { root = BnM.optimize(matchRoot); if (root == matchRoot) { root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot); } } else if (matchRoot instanceof Begin || matchRoot instanceof First) { root = matchRoot; } else { root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot); } // Release temporary storage temp = null; buffer = null; groupNodes = null; patternLength = 0; compiled = true; }
// Start recursive descent parsing matchRoot = expr(lastAccept);进行开始递归解析传入的指定规则;例如:\\d+,进行解析后封装成Nodeduixang ,赋值给matchRoot对象。
具体expr源码为:
private Node expr(Node end) { Node prev = null; Node firstTail = null; Node branchConn = null; for (;;) { Node node = sequence(end);//真正封装Node对象 Node nodeTail = root; //double return if (prev == null) { prev = node; firstTail = nodeTail; } else { // Branch if (branchConn == null) { branchConn = new BranchConn(); branchConn.next = end; } if (node == end) { // if the node returned from sequence() is "end" // we have an empty expr, set a null atom into // the branch to indicate to go "next" directly. node = null; } else { // the "tail.next" of each atom goes to branchConn nodeTail.next = branchConn; } if (prev instanceof Branch) { ((Branch)prev).add(node); } else { if (prev == end) { prev = null; } else { // replace the "end" with "branchConn" at its tail.next // when put the "prev" into the branch as the first atom. firstTail.next = branchConn; } prev = new Branch(prev, node, branchConn); } }
//判断匹配符中是否含有|符号
if (peek() != '|') {
return prev;
}
next();
}
}
sequence方法处理:
private Node sequence(Node end) { Node head = null; Node tail = null; Node node = null; LOOP: for (;;) { int ch = peek(); switch (ch) { case '(': // Because group handles its own closure, // we need to treat it differently node = group0(); // Check for comment or flag group if (node == null) continue; if (head == null) head = node; else tail.next = node; // Double return: Tail was returned in root tail = root; continue; case '[': node = clazz(true); break; case '\\': ch = nextEscaped(); if (ch == 'p' || ch == 'P') { boolean oneLetter = true; boolean comp = (ch == 'P'); ch = next(); // Consume { if present if (ch != '{') { unread(); } else { oneLetter = false; } node = family(oneLetter).maybeComplement(comp); } else { unread(); node = atom(); } break; case '^': next(); if (has(MULTILINE)) { if (has(UNIX_LINES)) node = new UnixCaret(); else node = new Caret(); } else { node = new Begin(); } break; case '$': next(); if (has(UNIX_LINES)) node = new UnixDollar(has(MULTILINE)); else node = new Dollar(has(MULTILINE)); break; case '.': next(); if (has(DOTALL)) { node = new All(); } else { if (has(UNIX_LINES)) node = new UnixDot(); else { node = new Dot(); } } break; case '|': case ')': break LOOP; case ']': // Now interpreting dangling ] and } as literals case '}': node = atom(); break; case '?': case '*': case '+': next(); throw error("Dangling meta character '" + ((char)ch) + "'"); case 0: if (cursor >= patternLength) { break LOOP; } // Fall through default: node = atom(); break; } node = closure(node); if (head == null) { head = tail = node; } else { tail.next = node; tail = node; } } if (head == null) { return end; } tail.next = end; root = tail; //double return return head; }
在代码中
if (peek() != '|')
如果为ture时表明匹配符中不含有|直接返回Node对象。否则进行组装Branch对象。针对Branch对象为:
static final class Branch extends Node { Node[] atoms = new Node[2]; int size = 2; Node conn; Branch(Node first, Node second, Node branchConn) { conn = branchConn; atoms[0] = first; atoms[1] = second; } void add(Node node) { if (size >= atoms.length) { Node[] tmp = new Node[atoms.length*2]; System.arraycopy(atoms, 0, tmp, 0, atoms.length); atoms = tmp; } atoms[size++] = node; } boolean match(Matcher matcher, int i, CharSequence seq) { for (int n = 0; n < size; n++) { if (atoms[n] == null) { if (conn.next.match(matcher, i, seq)) return true; } else if (atoms[n].match(matcher, i, seq)) { return true; } } return false; } boolean study(TreeInfo info) { int minL = info.minLength; int maxL = info.maxLength; boolean maxV = info.maxValid; int minL2 = Integer.MAX_VALUE; //arbitrary large enough num int maxL2 = -1; for (int n = 0; n < size; n++) { info.reset(); if (atoms[n] != null) atoms[n].study(info); minL2 = Math.min(minL2, info.minLength); maxL2 = Math.max(maxL2, info.maxLength); maxV = (maxV & info.maxValid); } minL += minL2; maxL += maxL2; info.reset(); conn.next.study(info); info.minLength += minL; info.maxLength += maxL; info.maxValid &= maxV; info.deterministic = false; return false; } }
if (prev instanceof Branch) { ((Branch)prev).add(node); }
好了,到此就属于
Pattern patter = Pattern.compile("\\d+");
方法执行完成了,那下一步
Matcher matcher = patter.matcher(str);
又做了哪些处理呢?
public Matcher matcher(CharSequence input) { if (!compiled) { synchronized(this) { if (!compiled) compile(); } } Matcher m = new Matcher(this, input); return m; }默认情况下compiled是false的,但是执行完Parttern.compile方法后自动设定为true(表明已经对匹配符做了处理),然后进行创建Matcher对象,把对应的Patter和需要匹配的字符串传给Matcher对象属性
查看Matcher对象
Matcher() { } /** * All matchers have the state used by Pattern during a match. */ Matcher(Pattern parent, CharSequence text) { this.parentPattern = parent; this.text = text; // Allocate state storage int parentGroupCount = Math.max(parent.capturingGroupCount, 10); groups = new int[parentGroupCount * 2]; locals = new int[parent.localCount]; // Put fields into initial states reset(); }
Matcher构造中调用方法reset,源码为:
public Matcher reset() { first = -1; last = 0; oldLast = -1; for(int i=0; i<groups.length; i++) groups[i] = -1; for(int i=0; i<locals.length; i++) locals[i] = -1; lastAppendPosition = 0; from = 0; to = getTextLength(); return this; }
获得了Matcher对象后,需要进行获取匹配规则进行查询指定字符串中所处的位置并进行输出
while(matcher.find()){ System.out.println(matcher.group()); }matcher.find()方法源码为:
public boolean find() { int nextSearchIndex = last; if (nextSearchIndex == first) nextSearchIndex++; // If next search starts before region, start it at region if (nextSearchIndex < from) nextSearchIndex = from; // If next search starts beyond region then it fails if (nextSearchIndex > to) { for (int i = 0; i < groups.length; i++) groups[i] = -1; return false; } return search(nextSearchIndex); }
boolean search(int from) { this.hitEnd = false; this.requireEnd = false; from = from < 0 ? 0 : from; this.first = from; this.oldLast = oldLast < 0 ? from : oldLast; for (int i = 0; i < groups.length; i++) groups[i] = -1; acceptMode = NOANCHOR; boolean result = parentPattern.root.match(this, from, text); if (!result) this.first = -1; this.oldLast = this.last; return result; }
parentPattern.root.match(this, from, text)
其中parentPattern.root为Pattern中根据匹配符组装的Node对象。然后调用Node对象的match方法进行处理查找所在指定字符串的位置值,并且把数据记录到Matcher对象的groups数组属性中。并且方法返回true.然后调用matcher.group()方法进行获取指定索引查出groups中对应的字符串数据
public String group(int group) { if (first < 0) throw new IllegalStateException("No match found"); if (group < 0 || group > groupCount()) throw new IndexOutOfBoundsException("No group " + group); if ((groups[group*2] == -1) || (groups[group*2+1] == -1)) return null; return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString(); }