Python re库指南

Mind Map

Python RE

或到这里查看高清图

import re

pattern = ...
string = ...

regex = re.compile(pattern)
match = regex.search(string)

{m,n}
{0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?.
$
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.
foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
(?imsx-imsx:...)
(Zero or more letters from the set i, m, s, x, optionally followed by - followed by one or more letters from the same set.) The letters set or removes the corresponding flags: re.I (ignore case), re.M (multi-line), re.S (dot matches all), and re.X (verbose), for the part of the expression.
\A v.s. ^
When not in MULTILINE mode, \A and ^ are effectively the same. In MULTILINE mode, they’re different: \A still matches only at the beginning of the string, but ^ may match at any location inside the string that follows a newline character.
\Z v.s. $
- \Z: Matches only at the end of the string.
- $: Matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.

\b
- Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters.Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
- Word boundary. This is a zero-width assertion.They don’t cause the engine to advance through the string; instead, they consume no characters at all, and simply succeed or fail. For example, \b is an assertion that the current position is located at a word boundary; the position isn’t changed by the \b at all. This means that zero-width assertions should never be repeated, because if they match once at a given location, they can obviously be matched an infinite number of times.

re.L(re.Locale)
- Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale. This flag can be used only with bytes patterns. The use of this flag is discouraged as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales. Unicode matching is already enabled by default in Python 3 for Unicode (str) patterns, and it is able to handle different locales/languages. Corresponds to the inline flag (?L).
- Locales are a feature of the C library intended to help in writing programs that take account of language differences. For example, if you’re processing encoded French text, you’d want to be able to write \w+ to match words, but \w only matches the character class [A-Za-z] in bytes patterns; it won’t match bytes corresponding to é or ç. If your system is configured properly and a French locale is selected, certain C functions will tell the program that the byte corresponding to é should also be considered a letter. Setting the LOCALE flag when compiling a regular expression will cause the resulting compiled object to use these C functions for \w; this is slower, but also enables \w+ to match French words as you’d expect.
search() vs. match()