《Python数据分析技术栈》第03章 01 正则表达式(Regular expressions)

01 正则表达式(Regular expressions)

《Python数据分析技术栈》第03章 01 正则表达式(Regular expressions)

A regular expression is a pattern containing both characters (like letters and digits) and metacharacters (like the * and $ symbols). Regular expressions can be used whenever we want to search, replace, or extract data with an identifiable pattern, for example, dates, postal codes, HTML tags, phone numbers, and so on. They can also be used to validate fields like passwords and email addresses, by ensuring that the input from the user is in the correct format.

正则表达式是一种包含字符(如字母和数字)和元字符(如 * 和 $ 符号)的模式。正则表达式可用于搜索、替换或提取具有可识别模式的数据,例如日期、邮政编码、HTML 标记、电话号码等。正则表达式还可用于验证密码和电子邮件地址等字段,确保用户的输入格式正确。

使用正则表达式解决问题的步骤(Steps for solving problems with regular expressions)

Support for regular expressions is provided by the re module in Python, which can be imported using the following statement:

Python 中的 re 模块提供了对正则表达式的支持,可以使用以下语句导入该模块:

import re

If you have not already installed the re module, go to the Anaconda Prompt and enter the following command:

如果尚未安装 re 模块,请转到 Anaconda 提示符并输入以下命令:

pip install re

Once the module is imported, you need to follow the following steps.

模块导入后,您需要遵循以下步骤。

Define and compile the regular expression: After the re module is imported, we define the regular expression and compile it. The search pattern begins with the prefix “r” followed by the string (search pattern). The “r” prefix, which stands for a raw string, tells the compiler that special characters are to be treated literally and not as escape sequences. Note that this “r” prefix is optional. The compile function compiles the search pattern into a byte code as follows and the search string (and) is passed as an argument to the compile function.

定义并编译正则表达式 导入 re 模块后,我们将定义正则表达式并对其进行编译。搜索模式以前缀 "r "开头,后跟字符串(搜索模式)。r "前缀代表原始字符串,它告诉编译器特殊字符应按字面意思处理,而不是作为转义序列。请注意,"r "前缀是可选的。编译函数将搜索模式编译成如下字节码,搜索字符串(和)作为参数传递给编译函数。

search_pattern=re.compile(r'and')

Locate the search pattern (regular expression) in your string: In the second step, we try to locate this pattern in the string to be searched using the search method. This method is called on the variable (search_pattern) we defined in the previous step.

在字符串中找到搜索模式(正则表达式): 第二步,我们尝试使用搜索方法在要搜索的字符串中找到该模式。该方法在上一步中定义的变量 (search_pattern) 上调用。

search_pattern.search('Today and tomorrow')

A match object is returned since the search pattern (“and”) is found in the string (“Today and tomorrow”).

由于在字符串(“今天和明天”)中找到了搜索模式(“和”),因此将返回一个匹配对象。

简写(Shortcut (combining steps 2 and 3))

The preceding two steps can be combined into a single step, as shown in the following statement:

前面两个步骤可以合并为一个步骤,如下所示:

re.search('and','Today and tomorrow')

Using one line of code, as defined previously, we combine the three steps of defining, compiling, and locating the search pattern in one step.

使用前面定义的一行代码,我们就可以一步完成定义、编译和定位搜索模式的三个步骤。

正则表达式的 Python 函数(Python functions for regular expressions)

We use regular expressions for matching, splitting, and replacing text, and there is a separate function for each of these tasks. Table 3-1 provides a list of all these functions, along with examples of their usage.

我们使用正则表达式来匹配、分割和替换文本,每种任务都有一个单独的函数。表 3-1 列出了所有这些函数及其使用示例。

re.findall( ): Searches for all possible matches of the regular expression and returns a list of all the matches found in the string.

re.findall( ): 搜索正则表达式的所有可能匹配项,并返回字符串中所有匹配项的列表。

re.findall('3','98371234')

re.search( ): Searches for a single match and returns a match object corresponding to the first match found in the string.

re.search(): 搜索单个匹配,并返回一个与字符串中找到的第一个匹配对应的匹配对象。

re.search('3','98371234')

re.match( ): This function is similar to the re.search function. The limitation of this function is that it returns a match object only if the pattern is present at the beginning of the string.

re.match(): 该函数与 re.search 函数类似。该函数的局限性在于,只有当模式出现在字符串开头时,它才会返回匹配对象。

re.match('3','98371234')

re.split( ): Splits the string at the locations where the search pattern is found in the string being searched.

re.split(): 在搜索字符串中找到搜索模式的位置分割字符串。

re.split('3','98371234')

re.sub( ): Substitutes the search pattern with another string or pattern.

re.sub(): 用另一个字符串或模式替换搜索模式。

re.sub('3','three','98371234')

元角色(Metacharacters)

Metacharacters are characters used in regular expressions that have a special meaning. These metacharacters are explained in the following, along with examples to demonstrate their usage.

元字符是正则表达式中使用的具有特殊含义的字符。下文将解释这些元字符,并举例说明其用法。

Dot (.) metacharacter

This metacharacter matches a single character, which could be a number, alphabet, or even itself.

该元字符匹配单个字符,可以是数字、字母,甚至是字符本身。

In the following example, we try to match three-letter words (from the list given after the comma in the following code), starting with the two letters “ba”

在下面的示例中,我们尝试匹配以两个字母 "ba "开头的三个字母的单词(从下面代码中逗号后给出的列表中选择)。

re.findall("ba.","bar bat bad ba. ban")

Note that one of the results shown in the output, “ba.”, is an instance where the . (dot) metacharacter has matched itself.

请注意,输出中显示的结果之一 "ba. "是一个 .(点)元字符与自身匹配的实例。

Square brackets ([]) as metacharacters

To match any one character among a set of characters, we use square brackets ([ ]). Within these square brackets, we define a set of characters, where one of these characters must match the characters in our text.

要匹配一组字符中的任何一个字符,我们使用方括号([ ])。在这些方括号中,我们定义了一组字符,其中一个字符必须与文本中的字符相匹配。

Let us understand this with an example. In the following example, we try to match all strings that contain the string “ash”, and start with any of following characters – ‘c’, ‘r’, ‘b’, ‘m’, ‘d’, ‘h’, or ‘w’

让我们通过一个例子来理解这一点。在下面的示例中,我们尝试匹配所有包含字符串 “ash”,并以以下任意字符开头的字符串–“c”、“r”、“b”、“m”、“d”、"h "或 "w

regex=re.compile(r'[crbmdhw]ash')
regex.findall('cash rash bash mash dash hash wash crash ash')

Note that the strings “ash” and “crash” are not matched because they do not match the criterion (the string needs to start with exactly one of the characters defined within the square brackets).

请注意,字符串 "ash "和 "crash "不匹配,因为它们不符合标准(字符串必须以方括号内定义的一个字符开头)。

Question mark (?) metacharacter

This metacharacter is used when you need to match at most one occurrence of a character. This means that the character we are looking for could be absent in the search string or occur just once. Consider the following example, where we try to match strings starting with the characters “Austr”, ending with the characters, “ia”, and having zero or one occurrence of each the following characters – “a”, “l”, “a”, “s”

该元字符用于最多匹配一个出现过的字符。这意味着我们要查找的字符可能在搜索字符串中不存在或只出现一次。请看下面的示例,我们尝试匹配以字符 "Austr "开头,以字符 "ia "结尾,并且以下每个字符出现 0 次或 1 次的字符串–“a”、“l”、“a”、“s”

regex=re.compile(r'Austr[a]?[l]?[a]?[s]?ia')
regex.findall('Austria Australia Australasia Asia')

Asterisk (*) metacharacter

This metacharacter can match zero or more occurrences of a given search pattern. In other words, the search pattern may not occur at all in the string, or it can occur any number of times.

该元字符可以匹配给定搜索模式的零次或多次出现。换句话说,搜索模式可能在字符串中完全不出现,也可能出现任意多次。

Let us understand this with an example, where we try to match all strings starting with the string, “abc”, and followed by zero or more occurrences of the digit –“1”

让我们通过一个示例来理解这一点:我们尝试匹配所有以字符串 "abc "开头,后面出现 0 个或多个数字 "1 "的字符串。

re.findall("abc[1]*","abc1 abc111 abc1 abc abc111111111111 abc01")

Note that in this step, we have combined the compilation and search of the regular expression in one single step.

请注意,在这一步中,我们将正则表达式的编译和搜索合并为一个步骤。

Backslash (\) metacharacter

The backslash symbol is used to indicate a character class, which is a predefined set of characters. In Table 3-2, the commonly used character classes are explained.

反斜线符号用于表示字符类,即一组预定义的字符。表 3-2 解释了常用的字符类别。

常用的字符类别(Commonly Used Character)

\d Matches a digit (0–9)
\D Matches any character that is not a digit
\w Matches an alphanumeric character, which could be a lowercase letter (a–z), an uppercase letter (A–Z), or a digit (0–9)
\W Matches any character which is not alphanumeric
\s Matches any whitespace character
\S Matches any non-whitespace character

\d 匹配数字(0-9)
\D 可匹配非数字的任何字符
\w 与字母数字字符匹配,可以是小写字母 (a-z)、大写字母 (A-Z) 或数字 (0-9)
\W 匹配任何非字母数字字符
\s 匹配任何空白字符
\S 匹配任何非空格字符

反斜杠符号的另一种用法: 转义元字符(Another usage of the backslash symbol: Escaping metacharacters)

As we have seen, in regular expressions, metacharacters like . and *, have special meanings. If we want to use these characters in the literal sense, we need to “escape” them by prefixing these characters with a (backslash) sign. For example, to search for the text W.H.O, we would need to escape the . (dot) character to prevent it from being used as a regular metacharacter.

正如我们所见,在正则表达式中,.和 * 等元字符具有特殊含义。如果我们想在字面意义上使用这些字符,就需要在这些字符前加上反斜杠符号,以 "转义 "这些字符。例如,要搜索文本 W.H.O,我们需要转义 .(点)字符,以防止它被用作正则元字符。

regex=re.compile(r'W\.H\.O')
regex.search('W.H.O norms')

Plus (+) metacharacter

This metacharacter matches one or more occurrences of a search pattern. The following is an example where we try to match all strings that start with at least one letter.

该元字符可匹配一个或多个搜索模式的出现。下面是一个例子,我们尝试匹配所有至少以一个字母开头的字符串。

re.findall("[a-z]+123","a123 b123 123 ab123 xyz123")

Curly braces {} as metacharacters

Using the curly braces and specifying a number within these curly braces, we can specify a range or a number representing the number of repetitions of the search pattern.

使用大括号并在大括号内指定一个数字,我们就可以指定一个范围或一个代表搜索模式重复次数的数字。

In the following example, we find out all the phone numbers in the format “xxx-xxx-xxxx” (three digits, followed by another set of three digits, and a final set of four digits, each set separated by a “-” sign).

在下面的示例中,我们将找出所有格式为 "xxx-xxx-xxxx “的电话号码(三位数,然后是另一组三位数,最后是一组四位数,每组之间用”-"号隔开)。

regex=re.compile(r'[\d]{3}-[\d]{3}-[\d]{4}')
regex.findall('987-999-8888 99122222 911-911-9111')

Only the first and third numbers in the search string (987-999-8888, 911-911-9111) match the pattern. The \d metacharacter represents a digit.

只有搜索字符串(987-999-8888、911-911-9111)中的第一个和第三个数字符合模式。元字符 \d 表示数字。

If we do not have an exact figure for the number of repetitions but know the maximum and the minimum number of repetitions, we can mention the upper and lower limit within the curly braces. In the following example, we search for all strings containing a minimum of six characters and a maximum of ten characters.

如果我们没有重复次数的精确数字,但知道最大和最小重复次数,则可以在大括号内注明上限和下限。在下面的示例中,我们搜索所有最少包含 6 个字符、最多包含 10 个字符的字符串。

regex=re.compile(r'[\w]{6,10}')
regex.findall('abcd abcd1234,abc$$$$$,abcd12 abcdef')

Dollar ($) metacharacter

This metacharacter matches a pattern if it is present at the end of the search string.

如果该元字符出现在搜索字符串的末尾,则与模式匹配。

In the following example, we use this metacharacter to check if the search string ends with a digit.

在下面的示例中,我们使用该元字符来检查搜索字符串是否以数字结尾。

re.search(r'[\d]$','aa*5')

Caret (^) metacharacter

The caret (^) metacharacter looks for a match at the beginning of the string.

粗体 (^) 元字符会在字符串开头查找匹配。

In the following example, we check if the search string begins with a whitespace.

在下面的示例中,我们将检查搜索字符串是否以空格开头。

re.search(r'^[\s]',' a bird')

你可能感兴趣的:(Python数据分析技术栈,python,数据分析,python,数据分析,正则表达式)