之前讲过关于Python正则的,都是理论的东西,现在讲讲Python正则re模块。理论链接:http://blog.csdn.net/qunxingvip/article/details/45286869
导入re模块:import re
查看帮助文档:print re._doc_
下面就是输出的帮助文档:
Support for regular expressions (RE).
This module provides regular expression matching operations similar to
those found in Perl. It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.
Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves. You can
concatenate ordinary characters, so last matches the string 'last'.
The special characters are:
"." Matches any character except a newline.
"^" Matches the start of the string.
"$" Matches the end of the string or just before the newline at
the end of the string.
"*" Matches 0 or more (greedy) repetitions of the preceding RE.
Greedy means that it will match as many repetitions as possible.
"+" Matches 1 or more (greedy) repetitions of the preceding RE.
"?" Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n} Matches from m to n repetitions of the preceding RE.
{m,n}? Non-greedy version of the above.
"\\" Either escapes special characters or signals a special sequence.
[] Indicates a set of characters.
A "^" as the first character indicates a complementing set.
"|" A|B, creates an RE that will match either A or B.
(...) Matches the RE inside the parentheses.
The contents can be retrieved or matched later in the string.
(?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below).
(?:...) Non-grouping version of regular parentheses.
(?P<name>...) The substring matched by the group is accessible by name.
(?P=name) Matches the text matched earlier by the group named name.
(?#...) A comment; ignored.
(?=...) Matches if ... matches next, but doesn't consume the string.
(?!...) Matches if ... doesn't match next.
(?<=...) Matches if preceded by ... (must be fixed length).
(?if not preceded by ... (must be fixed length).
(?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
the (optional) no pattern otherwise.
The special sequences consist of "\\" and a character from the list
below. If the ordinary character is not on the list, then the
resulting RE will match the second character.
\number Matches the contents of the group of the same number.
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit; equivalent to the set [0-9].
\D Matches any non-digit character; equivalent to the set [^0-9].
\s Matches any whitespace character; equivalent to [ \t\n\r\f\v].
\S Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v].
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
\W Matches the complement of \w.
\\ Matches a literal backslash.
This module exports the following functions:
match Match a regular expression pattern to the beginning of a string.
search Search a string for the presence of a pattern.
sub Substitute occurrences of a pattern found in a string.
subn Same as sub, but also return the number of substitutions made.
split Split a string by the occurrences of a pattern.
findall Find all occurrences of a pattern in a string.
finditer Return an iterator yielding a match object for each match.
compile Compile a pattern into a RegexObject.
purge Clear the regular expression cache.
escape Backslash all non-alphanumerics in a string.
Some of the functions in this module takes flags as optional parameters:
I IGNORECASE Perform case-insensitive matching.
L LOCALE Make \w, \W, \b, \B, dependent on the current locale.
M MULTILINE "^" matches the beginning of lines (after a newline)
as well as the string.
"$" matches the end of lines (before a newline) as well
as the end of the string.
S DOTALL "." matches any character at all, including the newline.
X VERBOSE Ignore whitespace and comments for nicer looking RE's.
U UNICODE Make \w, \W, \b, \B, dependent on the Unicode locale.
This module also defines an exception 'error'.
上面说了基本语法和一些函数的使用。基本语法在上面链接已经说明。下面介绍主要函数的使用。
查看帮助:
help(re.match)
Help on function match in module re:
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning a match object, or None if no match was found.
re.match(pattern, string, flags=0)
功能:从字符串string第一个位置开始匹配,根据建立的pattern规则匹配,返回匹配规则的的字符串。如果没有匹配成功返回:None.flags是可选参数,用于控制正则表达式的匹配方式。
例子:
import re
pattern='[w]{3}.[a-z]+.(com)'
str1="www.baidu.com.net"
str2="http:www.baidu.com.net"
re1=re.match(pattern,str1)
print re1.group(0)
re2=re.match(pattern,str2)
print re2.group(0)
匹配开始位置是www.xxx.com 的网址,第一个输出www.baidu.com ,第二个竟然报错了,因为第一个不匹配,但是说明文档说的是返回None的。
查看帮助:
help(re.search)
Help on function search in module re:
search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found.
re.search(pattern, string, flags=0)
功能:在字符串string中找到一个满足pattern匹配模式的字符串,不存在的返回None
例子:
import re
pattern='[w]{3}\.[a-z]+\.(com)'
str1="china.www.baidu.com.net"
str2="http:www.baiducom.net"
re1=re.search(pattern,str1)
print re1.group()
re2=re.search(pattern,str2)
print re2.group()
第一个输出:www.baidu.com,第二个:报错,匹配失败
查看帮助:
help(re.sub)
Help on function sub in module re:
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
re.sub(pattern,repl,string,count=0,flags=0)
功能:将字符串string满足pattern规则的字符串替换成repl,count默认是0全部替换,若是2是指只替换前两个。
例子:
import re
pattern='[w]{3}\.[a-z]+\.(com)'
repl='www.google.com'
str3="i love www.baidu.com,tom love www.baid.com"
re3=re.sub(pattern,repl,str3,1)
print re3
输出:i love www.google.com,tom love www.baid.com
与re.sub 差不多只是在返回时候还返回替换字符的个数
例子:
import re
pattern='[w]{3}\.[a-z]+\.(com)'
repl='www.google.com'
str3="i love www.baidu.com,tom love www.baid.com"
re3=re.subn(pattern,repl,str3,2)
print re3
输出:(‘i love www.google.com,tom love www.google.com’, 2)
查看帮助:
help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
re.split(pattern,string,maxsplit=0,flags=0)
功能:根据pattern规则把字符串string分离,保存在list中。maxsplit是最大分类个数,默认最大。
例子:
import re
str="xiaoming,xiaohua,xiaoli,xiaoqiang,xiaozhang"
pattern=","
print re.split(pattern,str)
输出结果:[‘xiaoming’, ‘xiaohua’, ‘xiaoli’, ‘xiaoqiang’, ‘xiaozhang’]
查看帮助:
help(re.findall)
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
re.findall(pattern, string, flags=0)
功能:在字符串string中找出所有满足正则的字符串,并存在列表list中,没有列表为空
例子:
import re
str="xiaoming,xiaohua,xiaoli,xiaoqiang,xiaozhang"
pattern="\w+"
print re.findall(pattern,str)
结果和上面的一样但是理解一样不一样的:[‘xiaoming’, ‘xiaohua’, ‘xiaoli’, ‘xiaoqiang’, ‘xiaozhang’]
和 findall 类似,在字符串中找到正则表达式所匹配的所有子串,并组成一个迭代器返回
例子:
import re
str="xiaoming,xiaohua,xiaoli,xiaoqiang,xiaozhang"
pattern="\w+"
re4= re.finditer(pattern,str)
for i in re4:
print i.group()
迭代器,通过for循环输出
for i in re4:
... print i.group()
...
xiaoming
xiaohua
xiaoli
xiaoqiang
xiaozhang
查看帮助:
help(re.compile)
Help on function compile in module re:
compile(pattern, flags=0)
Compile a regular expression pattern, returning a pattern object.
re.compile(pattern, flags=0)
功能:把正则表达式pattern转化成正则表达式对象
例子:
import re
str="xiaoming,xiaohua,xiaoli,xiaoqiang,xiaozhang"
pattern="\w+"
patternobj=re.compile(pattern)
re4= re.finditer(pattern,str)
for i in re4:
print i.group()
结果和上一个一样,感觉就是转成对象,在进行其他操作。
查看帮助:
help(re.purge)
Help on function purge in module re:
purge()
Clear the regular expression cache
功能:清除缓存的正则表达式
查看帮助:
help(re.escape)
Help on function escape in module re:
escape(pattern)
Escape all non-alphanumeric characters in pattern.
功能:对字符串中的非字母数字进行转义,具体什么意思我就不知道了。
例子:
>>> pattern
'\\w+'
>>> re.escape(pattern)
'\\\\w\\+'
看,不一样了。具体我真的不懂了。
I IGNORECASE Perform case-insensitive matching.
L LOCALE Make \w, \W, \b, \B, dependent on the current locale.
M MULTILINE "^" matches the beginning of lines (after a newline)
as well as the string.
"$" matches the end of lines (before a newline) as well
as the end of the string.
S DOTALL "." matches any character at all, including the newline.
X VERBOSE Ignore whitespace and comments for nicer looking RE's.
U UNICODE Make \w, \W, \b, \B, dependent on the Unicode locale.