If you've struggled with regular expressions that took hours to match when you needed them to complete in seconds, this article is for you. Java developer Cristian Mocanu explains where and why the regex pattern-matching engine tends to stall, then shows you how to make the most of backtracking rather than getting lost in it, how to optimize greedy and reluctant quantifiers, and why possessive quantifiers, independent grouping, and lookarounds are your friends.
Writing a regular expression is more than a skill -- it's an art.
-- Jeffrey Friedl
In this article I introduce some of the common weaknesses in regular expressions using the default java.util.regex
package. I explain why backtracking is both the foundation of pattern matching with regular expressions and a frequent bottleneck in application code, why you should exercise caution when using greedy and reluctant quantifiers, and why it is essential to benchmark your regex optimizations. I then introduce several techniques for optimizing regular expressions, and discuss what happens when I run my new expressions through the Java pattern-matching engine.
For the purpose of this article I assume that you already have some experience using regular expressions and are most interested in learning how to optimize them in Java code. Topics covered include simple and automated optimization techniques as well as how to optimize greedy and reluctant quantifiers using possessive quantifiers, independent grouping, and lookarounds. See the Resources section for anintroduction to regular expressions in Java.
Notation |
---|
I use double quotes ("") to delimit regular expressions and input strings, X, Y, Z to denote regular sub-expressions or a portion of a regular expression, and a, b, c, d (et-cetera) to denote single characters. |
The java.util.regex
package uses a type of pattern-matching engine called a Nondeterministic Finite Automaton, or NFA. It's called nondeterministic because while trying to match a regular expression on a given string, each character in the input string might be checked several times against different parts of the regular expression. This is a widely used type of engine also found in .NET, PHP, Perl, Python, and Ruby. It puts great power into the hands of the programmer, offering a wide range of quantifiers and other special constructs such as lookarounds, which I'll discuss later in the article.
At heart, the NFA uses backtracking. Usually there isn't only one way to apply a regular expression on a given string, so the pattern-matching engine will try to exhaust all possibilities until it declares failure. To better understand the NFA and backtracking, consider the following example:
The regular expression is "sc(ored|ared|oring)x
" The input string is "scared
"
First, the engine will look for "sc
" and find it immediately as the first two characters in the input string. It will then try to match "ored
" starting from the third character in the input string. That won't match, so it will go back to the third character and try "ared
". This will match, so it will go forward and try to match "x
". Finding no match there, it will go back again to the third character and search for "oring
". This won't match either, and so it will go back to the second character in the input string and try to search for another "sc
". Upon reaching the end of the input string it will declare failure.
With the above example you've seen how the NFA uses backtracking for pattern matching, and you've also discovered one of the problems with backtracking. Even in the simple example above the engine had to backtrack several times while trying to match the input string to the regular expression. It's easy to imagine what could happen to your application performance if backtracking got out of hand. An important part of optimizing a regular expression is minimizing the amount of backtracking that it does.
The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically. I will discuss some of them later in the article. Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.
Because of backtracking, regular expressions encountered in real-world application scenarios can sometimes take hours to completely match. Worse, it takes much longer for the engine to declare that a regular expression did not match an input string than it does to find a successful match. This is an important fact to remember. Whenever you want to test the speed of a regular expression, test it mostly on strings that it does not match. Among those, especially use strings that almostmatch, because those take the longest to complete.
Now let's consider some of the ways you can optimize your regular expressions for backtracking.
Later in the article I'll get into the more involved ways you can optimize regular expressions in Java. To start, though, here are a few simple optimizations that could save you time:
Pattern.compile()
instead of the more direct Pattern.matches()
. Not compiling the regular expression can be costly if Pattern.matches()
is used over and over again with the same expression, for example in a loop, because thematches()
method will re-compile the expression every time it is used. Also remember that you can re-use the Matcher
object for different input strings by calling the method reset()
.(X|Y|Z)
" have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of "(abcd|abef)
" use "ab(cd|ef)
". The latter is faster because the NFA will try to match ab
and won't try any of the alternatives if it doesn't find it. (In this case there are only two alternatives. If there were many alternatives the gains in speed would be more impressive.) Alternation really can slow down your programs. The expression ".*(abcd|efgh|ijkl).*
" was three times slower in my test than using three calls to String.indexOf()
, one for each alternative in the regular expression.(?:X)
" instead of "(X)
".As I mentioned before, the java.util.regex
engine can optimize a regular expression several ways when it is compiled. For example, if the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.
Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression "\d{100}
" is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.
Using benchmarks |
---|
After you have identified a possible improvement of a regular expression, even if you are certain that it will improve the speed, make a benchmark and compare the results against the previous expression. If the engine was able to internally optimize the previous expression better than the new one, it could lead to unexpected performance penalties. For instance, the Java regex engine was not able to optimize the expression " |
Whenever you write complex regular expressions, try to find a way to write them such that the regex engine will be able to recognize and optimize for these particular situations. For instance, don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match, as shown in the example above.
You have some basic ideas of how to optimize your regular expressions, as well as some of the ways you can let the regex engine do the work for you. Now let's talk about optimizing greedy and reluctant quantifiers. A greedy quantifier such as "*
" or "+
" will first try to match as many characters as possible from an input string, even if this means that the input string will not have sufficient characters left in it to match the rest of the regular expression. If this happens, the greedy quantifier will backtrack, returning characters until an overall match is found or until there are no more characters. A reluctant (or lazy) quantifier, on the other hand, will first try to match as few characters in the input string as possible.
So for example, say you want to optimize a sub-expression like ".*a
". If the charactera is located near the end of the input string it is better to use the greedy quantifier "*
". If the character is located near the beginning of the input string it would be better to use the reluctant quantifier "*?
" and change the sub-expression to ".*?a
". Generally, I've noticed that the lazy quantifier is a little faster than its greedy counterpart.
Another tip is to be specific when writing a regular expression. Use general sub-constructs like ".*
" sparingly because they can backtrack a lot, especially when the rest of the expression can't match the input string. For example, if you want to retrieve everything between two as in an input string, instead of using "a(.*)a
", it's much better to use "a([^a]*)a
".
Possessive quantifiers and independent grouping are the most useful operators for optimizing regular expressions. Use them whenever you can to dramatically improve the execution time of your expressions. Possessive quantifiers are denoted by the extra "+
" sign, such as in the expression "X?+", "X*+", "X++
". The notation for an independent grouping is "(?>X)
".
I have successfully used both possessive quantifiers and independent grouping to reduce the execution time of regular expressions from a few minutes to a few seconds. Both operators are allowed to disable the backtracking behavior of the pattern-matching engine for the group to which they are applied. They will try to match their expression as any greedy quantifier would, but if they are able to match it, they will not give back what they have matched, even if this causes the overall regular expression to fail.
The difference between them is subtle. You can see it best by comparing the possessive quantifier "(X)*+
" and the independent grouping "(?>X)*
". In the former case, the possessive quantifier will disable backtracking for both the X sub-expression and the "*
" quantifier. In the latter case, only backtracking for the X sub-expression will be disabled, while the "*
" operator, being outside the group, is not affected by the independent grouping and is free to backtrack.
Now let's consider an optimization example. Say you're trying to match the sub-expression "[^a]*a
" on a long input string containing only the character b repeated many times. This expression will fail because the input string does not contain any instances of the character a. Because the pattern engine doesn't know this, it will try to match the expression "[^a]*
". Because "*
" is a greedy quantifier, it will grab all the characters until the end of the input string, and then it will backtrack, giving back one character at a time in the search for a match.
The expression will fail only when it can't backtrack anymore, which can take some time. Worse, because the "[^a]*
" grabbed all characters that weren't a, even backtracking is useless.
The solution is to change the expression "[^a]*a
" to "[^a]*+a
" using the possessive quantifier "*+
". This new expression fails faster because once it has tried to match all the characters that are not a it doesn't backtrack; instead it fails right there.
If you want to write a regular expression that matches any character except some, you could easily write something like "[^abc]*
" which means: Match any characters except a or b or c. But what if you wanted it to match strings like "cab" or "cba", but not "abc"?
For this you could use the lookaround constructs. The java.util.regex
package has four of them:
(?=X)
"(?!X)
"(?<=X)
"(?<!X)
"The word positive in this case means that you want the expression to match, while the word negative means that you don't want the expression to match. Lookaheadmeans that you want to search to the right of your current position in the input string. Lookbehind means that you want to search to the left. Remember that the lookaround constructs only peek forward or backward; they don't actually change the current position in the input string. That said, you could use something like "((?!abc).)*
" using the negative lookahead operator "?!
" to match any sequence of characters but not "abc" in the given order.
Lookaround constructs help you to be more specific when writing regular expressions, which can have a big affect on matching performance. Listing 1 shows a very common example: using a regular expression to match HTML fields.
Regular expression: "<img.*src=(\S*)/>"
Input string 1: "<img border=1 src=image.jpg />"
Input string 2: "<img src=src=src=src= .... many src= ... src=src="
With the regular expression in Listing 1, the goal is to match the contents of the "src
" attribute from an HTML image
tag. I especially simplified the expression, assuming that there will be no other attributes after "src
", to be able to focus on its performance aspects.
Why not be lazy? |
---|
You might be thinking that I could have used the reluctant quantifier " |
The expression is fast enough when matching the input "string 1
", but it takes a very long time to declare failure in its attempt to match the input "string 2
(time growing exponentially with the length of the input string). It fails because there is no "/>
" at the end of the input string. To optimize this expression, look at the first ".*
" construct. It is supposed to match any attributes that come before "src
" but is too generic and it matches too much. In fact, the construct should only match any attributes except "src
".
The rewritten expression "<img((?!src=).)*src=(\S*)/>
" will handle a large, non-matching string almost a hundred times faster then the previous one!
Sometimes the regex Pattern
class will throw a StackOverflowError
. This is a manifestation of the known bug #5050507, which has been in the java.util.regex
package since Java 1.4. The bug is here to stay because it has "won't fix" status. This error occurs because the Pattern
class compiles a regular expression into a small program which is then executed to find a match. This program is used recursively, and sometimes when too many recursive calls are made this error occurs. See thedescription of the bug for more details. It seems it's triggered mostly by the use of alternations.
If you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.
Regular expressions shouldn't take hours to match, especially for applications that only have seconds to spare. In this article I've introduced some of the weak points of the java.util.regex
package and shown you how to work around them. Simple bottlenecks like backtracking just require a little finesse whereas culprits like greedy and reluctant quantifiers require more careful consideration. In some cases you can replace them completely, in others you simply have to "lookaround" them. Either way, you've learned some good tricks for coaxing speed out of your regular expressions.
Let me know what you think about the workarounds I've proposed, and be sure to share your optimizing tips with other JavaWorld readers in the <a href="http://www.javaworld.com/javaforums/newpost.php?Cat=0&Board=112069">discussion thread about optimizing regular expressions in Java</a>.
Cristian Mocanu is a Java team leader at 1&1 Internet AG, Romania. He is a Sun Certified Programmer, Business Component Developer, and Architect with more than five years experience working with enterprise Java.java.util.regex
package and demonstrates a practical application of regular expressions.Pattern.matches
throws StackOverFlow
") is a known bug in thejava.util.regex
package since Java 1.4.java.util.regex
to learn more about the Pattern
and Matcher
classes and for a summary of regular-expression constructs.