Java 1.4 supports regular expressions with Sun's java.util.regex package. Although there are competing packages available for previous versions of Java, Sun is poised to become the standard. Sun's package uses a Traditional NFA match engine. For an explanation of the rules behind a Traditional NFA engine, see Section 1.2 .
java.util.regex supports the metacharacters and metasequences listed in Table 1-10 through Table 1-14 . For expanded definitions of each metacharacter, see Section 1.2.1 .
Sequence
Meaning
\a
Alert (bell).
\b
Backspace, x08 , supported only in character class.
\e
ESC character, x1B .
\n
Newline, x0A .
\r
Carriage return, x0D .
\f
Form feed, x0C .
\t
Horizontal tab, x09 .
\0 octal
Character specified by a one-, two-, or three-digit octal code.
\x hex
Character specified by a two-digit hexadecimal code.
\u hex
Unicode character specified by a four-digit hexadecimal code.
\c char
Named control character.
Class
Meaning
[...]
A single character listed or contained in a listed range.
[^...]
A single character not listed and not contained within a listed range.
.
Any character, except a line terminator (unless DOTALL mode).
\w
Word character, [a-zA-Z0-9_] .
\W
Non-word character, [^a-zA-Z0-9_] .
\d
Digit, [0-9] .
\D
Non-digit, [^0-9] .
\s
Whitespace character, [ \t\n\f\r\x0B] .
\S
Non-whitespace character, [^ \t\n\f\r\x0B] .
\p{ prop }
Character contained by given POSIX character class, Unicode property, or Unicode block.
\P{ prop }
Character not contained by given POSIX character class, Unicode property, or Unicode block.
Sequence
Meaning
^
Start of string, or after any newline if in MULTILINE mode.
\A
Beginning of string, in any match mode.
$
End of string, or before any newline if in MULTILINE mode.
\Z
End of string but before any final line terminator, in any match mode.
\z
End of string, in any match mode.
\b
Word boundary.
\B
Not-word-boundary.
\G
Beginning of current search.
(?= ... )
Positive lookahead.
(?! ... )
Negative lookahead.
(?<= ... )
Positive lookbehind.
(? ... )
Negative lookbehind.
Modifier/sequence
Mode character
Meaning
Pattern.UNIX_LINES
d
Treat \n as the only line terminator.
Pattern.DOTALL
s
Dot (.) matches any character, including a line terminator.
Pattern.MULTILINE
m
^ and $ match next to embedded line terminators.
Pattern.COMMENTS
x
Ignore whitespace and allow embedded comments starting with # .
Pattern.CASE_INSENSITIVE
i
Case-insensitive match for ASCII characters.
Pattern.UNICODE_CASE
u
Case-insensitive match for Unicode characters.
Pattern.CANON_EQ
Unicode "canonical equivalence" mode where characters or sequences of a base character and combining characters with identical visual representations are treated as equals.
(? mode )
Turn listed modes (idmsux ) on for the rest of the subexpression.
(?- mode )
Turn listed modes (idmsux ) off for the rest of the subexpression.
(? mode :...)
Turn listed modes (idmsux ) on within parentheses.
(?- mode :...)
Turn listed modes (idmsux ) off within parentheses.
#.. .
Treat rest of line as a comment in /x mode.
Sequence
Meaning
(...)
Group subpattern and capture submatch into \1 ,\2 ,... and $1 , $2 ,....
\ n
Contains text matched by the n th capture group.
$ n
In a replacement string, contains text matched by the n th capture group.
(?:...)
Groups subpattern, but does not capture submatch.
(?> ... )
Disallow backtracking for text matched by subpattern.
... |...
Try subpatterns in alternation.
*
Match 0 or more times.
+
Match 1 or more times.
?
Match 1 or 0 times.
{ n }
Match exactly n times.
{ n ,}
Match at least n times.
{ x ,y }
Match at least x times, but no more than y times.
*?
Match 0 or more times, but as few times as possible.
+?
Match 1 or more times, but as few times as possible.
??
Match 0 or 1 times, but as few times as possible.
{ n ,}?
Match at least n times, but as few times as possible.
{ x ,y }?
Match at least x times, no more than y times, and as few times as possible.
*+
Match 0 or more times, and never backtrack.
++
Match 1 or more times, and never backtrack.
?+
Match 0 or 1 times, and never backtrack.
{ n }+
Match at least n times, and never backtrack.
{ n ,}+
Match at least n times, and never backtrack.
{ x ,y }+
Match at least x times, no more than y times, and never backtrack.
Java 1.4 introduces two main classes, java.util.regex.Pattern and java.util.regex.Matcher ; an exception, java.util.regex.PatternSyntaxException ; and a new interface, CharSequence . Additionally, Sun upgraded the String class to implement the CharSequence interface and to provide basic pattern-matching methods. Pattern objects are compiled regular expressions that can be applied to many strings. A Matcher object is a match of one Pattern applied to one string (or any object implementing CharSequence ).
Backslashes in regular expression String literals need to be escaped. So \n (newline) becomes \\n when used in a Java String literal that is to be used as a regular expression.
java.lang.String |
New methods for pattern matching.
Return true if regex matches the entire String .
Return an array of the substrings surrounding matches of regex .
Return an array of the substrings surrounding the first limit -1 matches of regex .
Replace the substring matched by regex with replacement .
Replace all substrings matched by regex with replacement .
java.util.regex.Pattern |
extends Object and implements Serializable |
Models a regular expression pattern.
Construct a Pattern object from regex .
Construct a new Pattern object out of regex and the OR'd mode-modifier constants flags .
Return the Pattern 's mode modifiers.
Construct a Matcher object that will match this Pattern against input .
Return true if regex matches the entire string input .
Return the regular expression used to create this Pattern .
Return an array of the substrings surrounding matches of this Pattern in input .
Return an array of the substrings surrounding the first limit matches of this pattern in regex .
java.util.regex.Matcher |
extends Object |
Models a regular expression pattern matcher and pattern matching results.
Append substring preceding match and replacement to sb .
Appends substring following end of match to sb .
Index of the first character after the end of the match.
Index of the first character after the text captured by group .
Find the next match in the input string.
Find the next match after character position, start .
Text matched by this Pattern .
Text captured by capture group, group .
Number of capturing groups in Pattern .
True if match is at beginning of input.
Return true if Pattern matches entire input string.
Return Pattern object used by this Matcher .
Replace every match with replacement .
Replace first match with replacement .
Reset this matcher so that the next match starts at the beginning of the input string.
Reset this matcher with new input .
Index of first character matched.
Index of first character matched in captured substring, group .
java.util.regex.PatternSyntaxException |
implements Serializable |
Thrown to indicate a syntax error in a regular expression pattern.
Construct an instance of this class.
Return error description.
Return error index.
Return a multiline error message containing error description, index, regular expression pattern, and indication of the position of the error within the pattern.
Return the regular expression pattern that threw the exception.
java.lang.CharSequence |
implemented by CharBuffer, String, StringBuffer |
Defines an interface for read-only access so that regular expression patterns may be applied to a sequence of characters.
Return the character at the zero-based position, index .
Return the number of characters in the sequence.
Return a subsequence including the start index and excluding the end index.
Return a String representation of the sequence.
This package supports Unicode 3.0, although \w , \W , \d , \D , \s , and \S support only ASCII. You can use the equivalent Unicode properties \p{L} , \P{L} , \p{Nd} , \P{Nd} , \p{Z} , and \P{Z} . The word boundary sequences, \b and \B , do understand Unicode.
For supported Unicode properties and blocks, see Table 1-2 . This package supports only the short property names, such as \p{Lu} , and not \p{Lowercase_Letter} . Block names require the In prefix and support only the name form without spaces or underscores; for example, \p{InGreekExtended} , not \p{In_Greek_Extended} or \p{In Greek Extended} .
//Match Spider-Man, Spiderman, SPIDER-MAN, etc. public class StringRegexTest { public static void main(String[ ] args) throws Exception { String dailybugle = "Spider-Man Menaces City!"; //regex must match entire string String regex = "(?i).*spider[- ]?man.*"; if (dailybugle.matches(regex)) { //do something } } }
//Match dates formatted like MM/DD/YYYY, MM-DD-YY,... import java.util.regex.*; public class MatchTest { public static void main(String[ ] args) throws Exception { String date = "12/30/1969"; Pattern p = Pattern.compile("(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)"); Matcher m = p.matcher(date); if (m.find( )) { String month = m.group(1); String day = m.group(2); String year = m.group(3); } } }
//Convert
to
for XHTML compliance import java.util.regex.*; public class SimpleSubstitutionTest { public static void main(String[ ] args) { String text = "Hello world.
"; try { Pattern p = Pattern.compile("
", Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(text); String result = m.replaceAll("
"); } catch (PatternSyntaxException e) { System.out.println(e.getMessage( )); } catch (Exception e) { System.exit( ); } } }
//urlify - turn URL's into HTML links import java.util.regex.*; public class Urlify { public static void main (String[ ] args) throws Exception { String text = "Check the website, http://www.oreilly.com/catalog/repr."; String regex = "\\b # start at word\n" + " # boundary\n" + "( # capture to $1\n" + "(https?|telnet|gopher|file|wais|ftp) : \n" + " # resource and colon\n" + "[\\w/\\#~:.?+=&%@!\\-] +? # one or more valid\n" + " # characters\n" + " # but take as little\n" + " # as possible\n" + ")\n" + "(?= # lookahead\n" + "[.:?\\-] * # for possible punc\n" + "(?: [^\\w/\\#~:.?+=&%@!\\-] # invalid character\n" + "| $ ) # or end of string\n" + ")"; Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE + Pattern.COMMENTS); Matcher m = p.matcher(text); String result = m.replaceAll("$1"); } }
Java NIO , by Ron Hitchens (O'Reilly), shows regular expressions in the context of Java's new I/O improvements.
Mastering Regular Expressions , Second Edition, by Jeffrey E. F. Friedl (O'Reilly), covers the details of Java regular expressions on pages 378-391.
Sun's online documentation at http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html .