r'^' r'(1[-\s.])?' # optional '1-', '1.' or '1' r'(\()?' # optional opening parenthesis r'\d{3}' # the area code r'(?(2)\))' # if there was opening parenthesis, close it r'[-\s.]?' # followed by '-' or '.' or space r'\d{3}' # first 3 digits r'[-\s.]?' # followed by '-' or '.' or space r'\d{4}$' # last 4 digits
import re numbers = [ "123 555 6789", "1-(123)-555-6789", "(123-555-6789", "(123).555.6789", "123 55 6789" ] for number in numbers: pattern = re.match(r'^' r'(1[-\s.])?' # optional '1-', '1.' or '1' r'(\()?' # optional opening parenthesis r'\d{3}' # the area code r'(?(2)\))' # if there was opening parenthesis, close it r'[-\s.]?' # followed by '-' or '.' or space r'\d{3}' # first 3 digits r'[-\s.]?' # followed by '-' or '.' or space r'\d{4}$\s*',number) # last 4 digits if pattern: print '{0} is valid'.format(number) else: print '{0} is not valid'.format(number)
123 555 6789 is valid 1-(123)-555-6789 is valid (123-555-6789 is not valid (123).555.6789 is valid 123 55 6789 is not valid
正则表达式是 python 的一个很好的功能,但是调试它们很艰难,而且正则表达式很容易就出错。
幸运的是,python 可以通过对 re.compile
或 re.match
设置 re.DEBUG
(实际上就是整数 128) 标志就可以输出正则表达式的解析树。
import re numbers = [ "123 555 6789", "1-(123)-555-6789", "(123-555-6789", "(123).555.6789", "123 55 6789" ] for number in numbers: pattern = re.match(r'^' r'(1[-\s.])?' # optional '1-', '1.' or '1' r'(\()?' # optional opening parenthesis r'\d{3}' # the area code r'(?(2)\))' # if there was opening parenthesis, close it r'[-\s.]?' # followed by '-' or '.' or space r'\d{3}' # first 3 digits r'[-\s.]?' # followed by '-' or '.' or space r'\d{4}$', number, re.DEBUG) # last 4 digits if pattern: print '{0} is valid'.format(number) else: print '{0} is not valid'.format(number)
at_beginning max_repeat 0 1 subpattern 1 literal 49 in literal 45 category category_space literal 46 max_repeat 0 2147483648 in category category_space max_repeat 0 1 subpattern 2 literal 40 max_repeat 0 2147483648 in category category_space max_repeat 3 3 in category category_digit max_repeat 0 2147483648 in category category_space subpattern None groupref_exists 2 literal 41 None max_repeat 0 2147483648 in category category_space max_repeat 0 1 in literal 45 category category_space literal 46 max_repeat 0 2147483648 in category category_space max_repeat 3 3 in category category_digit max_repeat 0 2147483648 in category category_space max_repeat 0 1 in literal 45 category category_space literal 46 max_repeat 0 2147483648 in category category_space max_repeat 4 4 in category category_digit at at_end max_repeat 0 2147483648 in category category_space 123 555 6789 is valid 1-(123)-555-6789 is valid (123-555-6789 is not valid (123).555.6789 is valid 123 55 6789 is not valid
在我解释这个概念之前,我想先展示一个例子。我们要从一段 html 文本寻找锚标签:
import re html = 'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' m = re.findall('<a.*>.*<\/a>', html) if m: print m
['<a href="http://pypix.com" title="pypix">Pypix</a>']
import re html = 'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' \ 'Hello <a href="http://example.com" title"example">Example</a>' m = re.findall('<a.*>.*<\/a>', html) if m: print m
['<a href="http://pypix.com" title="pypix">Pypix</a>Hello <a href="http://example.com" title"example">Example</a>']
这次模式匹配了第一个开标签和最后一个闭标签以及在它们之间的所有的内容,成了一个匹配而不是两个 单独的匹配。这是因为默认的匹配模式是“贪婪的”。
import re html = 'Hello <a href="http://pypix.com" title="pypix">Pypix</a>' \ 'Hello <a href="http://example.com" title"example">Example</a>' m = re.findall('<a.*?>.*?<\/a>', html) if m: print m
['<a href="http://pypix.com" title="pypix">Pypix</a>', '<a href="http://example.com" title"example">Example</a>']
下面的模式首先匹配 foo
,然后检测是否接着匹配 bar
import re strings = [ "hello foo", # returns False "hello foobar" ] # returns True for string in strings: pattern = re.search(r'foo(?=bar)', string) if pattern: print 'True' else: print 'False'
这看起来似乎没什么用,因为我们可以直接检测 foobar
不是更简单么。然而,它也可以用来前向否定界定。 下面的例子匹配foo
,当且仅当它的后面没有跟着 bar
import re strings = [ "hello foo", # returns True "hello foobar", # returns False "hello foobaz"] # returns True for string in strings: pattern = re.search(r'foo(?!bar)', string) if pattern: print 'True' else: print 'False'
后向界定符类似,但是它查看当前匹配的前面的模式。你可以使用 (?>
下面的模式匹配一个不是跟在 foo
后面的 bar
import re strings = [ "hello bar", # returns True "hello foobar", # returns False "hello bazbar"] # returns True for string in strings: pattern = re.search(r'(?<!foo)bar',string) if pattern: print 'True' else: print 'False'
import re strings = [ "<pypix>", # returns true "<foo", # returns false "bar>", # returns false "hello" ] # returns true for string in strings: pattern = re.search(r'^(<)?[a-z]+(?(1)>)$', string) if pattern: print 'True' else: print 'False'
表示分组 (<)
import re string = 'Hello foobar' pattern = re.search(r'(f.*)(b.*)', string) print "f* => {0}".format(pattern.group(1)) # prints f* => foo print "b* => {0}".format(pattern.group(2)) # prints b* => bar
现在我们改动一点点,在前面加上另外一个分组 (H.*)
import re string = 'Hello foobar' pattern = re.search(r'(H.*)(f.*)(b.*)', string) print "f* => {0}".format(pattern.group(1)) # prints f* => Hello print "b* => {0}".format(pattern.group(2)) # prints b* => bar
模式数组改变了,取决于我们在代码中怎么使用这些变量,这可能会使我们的脚本不能正常工作。 现在我们不得不找到代码中每一处出现了模式数组的地方,然后相应地调整下标。 如果我们真的对一个新添加的分组的内容没兴趣的话,我们可以使它“不被捕获”,就像这样:
import re string = 'Hello foobar' pattern = re.search(r'(?:H.*)(f.*)(b.*)', string) print "f* => {0}".format(pattern.group(1)) # prints f* => foo print "b* => {0}".format(pattern.group(2)) # prints b* => bar
通过在分组的前面添加 ?:
像前面那个例子一样,这又是一个防止我们掉进陷阱的方法。我们实际上可以给分组命名, 然后我们就可以通过名字来引用它们,而不再需要使用数组下标。格式是:(?Ppattern)
import re string = 'Hello foobar' pattern = re.search(r'(?P<fstar>f.*)(?P<bstar>b.*)', string) print "f* => {0}".format(pattern.group('fstar')) # prints f* => foo print "b* => {0}".format(pattern.group('bstar')) # prints b* => bar
import re string = 'Hello foobar' pattern = re.search(r'(?P<hi>H.*)(?P<fstar>f.*)(?P<bstar>b.*)', string) print "f* => {0}".format(pattern.group('fstar')) # prints f* => foo print "b* => {0}".format(pattern.group('bstar')) # prints b* => bar print "h* => {0}".format(pattern.group('hi')) # prints b* => Hello
在 Python 中 re.sub()
让我们来看看这个例子,这是一个 e-mail 模板:
import re template = "Hello [first_name] [last_name], \ Thank you for purchasing [product_name] from [store_name]. \ The total cost of your purchase was [product_price] plus [ship_price] for shipping. \ You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. \ Sincerely, \ [store_manager_name]" # assume dic has all the replacement data # such as dic['first_name'] dic['product_price'] etc... dic = { "first_name" : "John", "last_name" : "Doe", "product_name" : "iphone", "store_name" : "Walkers", "product_price": "$500", "ship_price": "$10", "ship_days_min": "1", "ship_days_max": "5", "store_manager_name": "DoeJohn" } result = re.compile(r'\[(.*)\]') print result.sub('John', template, count=1)
注意到每一个替换都有一个共同点,它们都是由一对中括号括起来的。我们可以用一个单独的正则表达式 来捕获它们,并且用一个回调函数来处理具体的替换。
import re template = "Hello [first_name] [last_name], \ Thank you for purchasing [product_name] from [store_name]. \ The total cost of your purchase was [product_price] plus [ship_price] for shipping. \ You can expect your product to arrive in [ship_days_min] to [ship_days_max] business days. \ Sincerely, \ [store_manager_name]" # assume dic has all the replacement data # such as dic['first_name'] dic['product_price'] etc... dic = { "first_name" : "John", "last_name" : "Doe", "product_name" : "iphone", "store_name" : "Walkers", "product_price": "$500", "ship_price": "$10", "ship_days_min": "1", "ship_days_max": "5", "store_manager_name": "DoeJohn" } def multiple_replace(dic, text): pattern = "|".join(map(lambda key : re.escape("["+key+"]"), dic.keys())) return re.sub(pattern, lambda m: dic[m.group()[1:-1]], text) print multiple_replace(dic, template)
更重要的可能是知道在什么时候不要使用正则表达式。在许多情况下你都可以找到 替代的工具。
你应该使用使用 HTML 解析器,Python 有很多选择:
后面两个即使是处理畸形的 HTML 也能很优雅,这给大量的丑陋站点带来了福音。
ElementTree 的一个例子:
from xml.etree import ElementTree tree = ElementTree.parse('filename.html') for element in tree.findall('h1'): print ElementTree.tostring(element)