阅读更多
DiveIntoPython(六)
英文书地址:
http://diveintopython.org/toc/index.html
Chapter 7. Regular Expressions
7.1.Diving In
Strings have methods for searching (index, find, and count), replacing (replace), and parsing (split), but they are limited to the simplest of cases. The search methods look for a single, hard-coded substring, and they are always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace and split methods have the same limitations.
7.2.Case Study:Street Addresses
example 7.1.Matching at the End of a String
>>> s = '100 NORTH MAIN ROAD'
>>> s.replace('ROAD','RD.')
'100 NORTH MAIN RD.'
>>> s = '100 NORTH BROAD ROAD'
>>> s.replace('ROAD','RD.')
'100 NORTH BRD. RD.'
>>> s[:-4] + s[-4:].replace('ROAD','RD.')
'100 NORTH MAIN RD.'
>>> import re
>>> re.sub('ROAD$','RD.',s)
'100 NORTH MAIN RD.'
My goal is to standardize a street address so that 'ROAD' is always abbreviated as 'RD.'. At first glance, I thought this was simple enough that I could just use the string method replace. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, 'ROAD', was a constant. And in this deceptively simple example, s.replace does indeed work.
The problem here is that 'ROAD' appears twice in the address, once as part of the street name 'BROAD' and once as its own word. The replace method sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed.
Take a look at the first parameter: 'ROAD$'. This is a simple regular expression that matches 'ROAD' only when it occurs at the end of a string. The $ means “end of the string”. (There is a corresponding character, the caret ^, which means “beginning of the string”.)
Using the re.sub function, you search the string s for the regular expression 'ROAD$' and replace it with 'RD.'. This matches the ROAD at the end of the string s, but does not match the ROAD that's part of the word BROAD, because that's in the middle of s.
example 7.2.Matching Whole Words
>>> s = '100 BROAD'
>>> re.sub('ROAD$','RD.',s)
'100 BRD.'
>>> re.sub('\\bROAD$','RD.',s)
'100 BROAD'
>>> re.sub(r'\bROAD$','RD.',s)
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$','RD.',s)
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b','RD.',s)
'100 BROAD RD. APT. 3'
What I really wanted was to match 'ROAD' when it was at the end of the string and it was its own whole word, not a part of some larger word. To express this in a regular expression, you use \b, which means “a word boundary must occur right here”.In Python, this is complicated by the fact that the '\' character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python.
To work around the backslash plague, you can use what is called a raw string, by prefixing the string with the letter r. This tells Python that nothing in this string should be escaped; '\t' is a tab character, but r'\t' is really the backslash character \ followed by the letter t. I recommend always using raw strings when dealing with regular expressions; otherwise, things get too confusing too quickly
7.3.Case Study:Roman Numberals
In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.
I = 1
V = 5
X = 10
L = 50
C = 100
D = 500
M = 1000
Characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8.
The tens characters (I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”). The number 40 is written as XL (10 less than 50), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (10 less than 50, then 1 less than 5).
The fives characters can not be repeated. The number 10 is always represented as X, never as VV. The number 100 is always C, never LL.
Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much. DC is 600; CD is a completely different number (400, 100 less than 500). CI is 101; IC is not even a valid Roman numeral (because you can't subtract 1 directly from 100; you would need to write it as XCIX, for 10 less than 100, then 1 less than 10).
7.3.1.Checking for Thousands
example7.3.Checking for Thousands
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern,'M')
<_sre.SRE_Match object at 0x01373E90>
>>> re.search(pattern,'MM')
<_sre.SRE_Match object at 0x01373E58>
>>> re.search(pattern,'MMM')
<_sre.SRE_Match object at 0x01373E20>
>>> re.search(pattern,'MMMM')
>>> re.search(pattern,'')
<_sre.SRE_Match object at 0x01373E58>
^ to match what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the M characters were, which is not what you want. You want to make sure that the M characters, if they're there, are at the beginning of the string.
M? to optionally match a single M character. Since this is repeated three times, you're matching anywhere from zero to three M characters in a row.
$ to match what precedes only at the end of the string. When combined with the ^ character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the M characters.
The essence of the re module is the search function, that takes a regular expression (pattern) and a string ('M') to try to match against the regular expression. If a match is found, search returns an object which has various methods to describe the match; if no match is found, search returns None, the Python null value.
7.3.2.Checking for Hundreds
100 = C
200 = CC
300 = CCC
400 = CD
500 = D
600 = DC
700 = DCC
800 = DCCC
900 = CM
example 7.4.Checking for Hundreds
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
>>> re.search(pattern,'MCM')
<_sre.SRE_Match object at 0x0132F4A0>
>>> re.search(pattern,'MD')
<_sre.SRE_Match object at 0x013752E0>
>>> re.search(pattern,'MMMCCC')
<_sre.SRE_Match object at 0x0132F4A0>
>>> re.search(pattern,'MCMC')
Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical bars: CM, CD, and D?C?C?C? (which is an optional D followed by zero to three optional C characters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first one that matches, and ignores the rest.
7.4.Using the {n,m} Syntax
example 7.6.The New Way:From n O m
>>> pattern = '^M{0,3}$'
>>> re.search(pattern,'M')
<_sre.SRE_Match object at 0x01373F70>
>>> re.search(pattern,'MMM')
<_sre.SRE_Match object at 0x0137A058>
This pattern says: “Match the start of the string, then anywhere from zero to three M characters, then the end of the string.” The 0 and 3 can be any numbers; if you want to match at least one but no more than three M characters, you could say M{1,3}.
7.4.1.Checking for Tens and Ones
example 7.7.Checking for Tens
>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
>>> re.search(pattern, 'MCMXL')
example 7.8.Validating Roman Numberals with {n,m}
>>> pattern = '^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')
7.5.Verbose Regular Expressions
So far you've just been dealing with what I'll call “compact” regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee that you'll be able to understand it six months later. What you really need is inline documentation.
A verbose regular expression is different from a compact regular expression in two ways:
Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They're not matched at all. (If you want to match a space in a verbose regular expression, you'll need to escape it by putting a backslash in front of it.)
Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a # character and goes until the end of the line. In this case it's a comment within a multi-line string instead of within your source code, but it works the same way.
example 7.9.Regular Expressions with Inline Comments
>>> pattern = """
^ # beginning of string
M{0,4} # thousands - 0 to 4 M's
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
# or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
# or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
# or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
"""
>>> re.search(pattern, 'M', re.VERBOSE)
<_sre.SRE_Match object at 0x01376110>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)
<_sre.SRE_Match object at 0x01321F20>
>>> re.search(pattern, 'MMMMDCCCLXXXVIII', re.VERBOSE)
<_sre.SRE_Match object at 0x01376070>
>>> re.search(pattern, 'M')
The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them: re.VERBOSE is a constant defined in the re module that signals that the pattern should be treated as a verbose regular expression.
7.6.Case study:Parsing Phone Numbers
So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where.
Here are the phone numbers I needed to be able to accept:
800-555-1212
800 555 1212
800.555.1212
(800) 555-1212
1-800-555-1212
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234
Quite a variety! In each of these cases, I need to know that the area code was 800, the trunk was 555, and the rest of the phone number was 1212. For those with an extension, I need to know that the extension was 1234.
example 7.10.Finding Numbers
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212')
What's \d{3}? Well, the {3} means “match exactly three numeric digits”; it's a variation on the {n,m} syntax you saw earlier. \d means “any numeric digit” (0 through 9). Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”.
example 7.11.Finding the Extension
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')
>>> phonePattern.search('800-555-1212-12345').groups()
('800', '555', '1212', '12345')
>>> phonePattern.search('800-555-1212')
>>>
example 7.12.Handling Different Separators
>>> phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')
>>> phonePattern.search('800 500 1211 1234').groups()
('800', '500', '1211', '1234')
>>> phonePattern.search('800-500-1211-1234').groups()
('800', '500', '1211', '1234')
\D+. What the heck is that? Well, \D matches any character except a numeric digit, and + means “1 or more”. So \D+ matches one or more characters that are not digits.
Using \D+ instead of - means you can now match phone numbers where the parts are separated by spaces instead of hyphens.
example 7.13.Handling Numbers Without Separators
>>> phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
>>> phonePattern.search('80055512121234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800.555.1212 x1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '')
>>> phonePattern.search('(800)5551212 x1234')
>>>
Remember that + means “1 or more”? Well, * means “zero or more”. So now you should be able to parse phone numbers even when there is no separator character at all.
example7.14.Handling Leading Characters
>>> phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
>>> phonePattern.search('(800)5551212 ext. 1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '')
>>> phonePattern.search('work 1-(800) 555.1212 #1234')
>>>
Why doesn't this phone number match? Because there's a 1 before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (\D*). Aargh.
example 7.15.Phone Number,Wherever I May Find Ye
>>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
<_sre.SRE_Match object at 0x0145A338>
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '')
>>> phonePattern.search('80055512121234')
<_sre.SRE_Match object at 0x0145A338>
>>> phonePattern.search('80055512121234').groups()
('800', '555', '1212', '1234')
Note the lack of ^ in this regular expression. You are not matching the beginning of the string anymore. There's nothing that says you need to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there.
example 7.16.Parsing Phone Numbers(Final Version)
>>> phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
(\d{3}) # area code is 3 digits (e.g. '800')
\D* # optional separator is any number of non-digits
(\d{3}) # trunk is 3 digits (e.g. '555')
\D* # optional separator
(\d{4}) # rest of number is 4 digits (e.g. '1212')
\D* # optional separator
(\d*) # extension is optional and can be any number of digits
$ # end of string
''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
('800', '555', '1212', '')
You should now be familiar with the following techniques:
^ matches the beginning of a string.
$ matches the end of a string.
\b matches a word boundary.
\d matches any numeric digit.
\D matches any non-numeric character.
x? matches an optional x character (in other words, it matches an x zero or one times).
x* matches x zero or more times.
x+ matches x one or more times.
x{n,m} matches an x character at least n times, but not more than m times.
(a|b|c) matches either a or b or c.
(x) in general is a remembered group. You can get the value of what matched by using the groups() method of the object returned by re.search.
Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough about them to know when they are appropriate, when they will solve your problems, and when they will cause more problems than they solve.