正则表达式具有秘密力量

Sherlock Holmes is a fictional character, who is a detective by profession. He observes and deduces results from the very little information that he has.

夏洛克·福尔摩斯 ( Sherlock Holmes)是一个虚构人物,是一名职业侦探。 他从很少的信息中观察并推论出结果。

When it comes to finding some text, programmers become a detective. A common approach which we always follow is the “command F”, to search anything in a file. But what if we don't know what we are looking for, like all email ids present in a file, the content of an XML tag, any string that begins with let's say “sh” and ends with “es” (like sherlock holmes), etc.

当要查找文本时,程序员会成为侦探。 我们始终遵循的常见方法是“命令F”,以搜索文件中的任何内容。 但是,如果我们不知道要查找的内容,例如文件中存在的所有电子邮件ID,XML标签的内容,以“ sh”开头和以“ es”结尾的任何字符串(例如sherlock holmes),该怎么办? )等

And there we need some sort of power like detective sherlock holmes. Believe me, regular expressions (commonly called regex) makes life a lot easier in case of finding unknowns.

在这里,我们需要某种力量,例如侦探福尔摩斯。 相信我,如果发现未知数, 则正则表达式 (通常称为regex)会使生活变得更加轻松。

正则表达式具有秘密力量_第1张图片
Fig 2: Sherlock trying to find unknown from knowns 图2:夏洛克试图从已知中寻找未知

常用表达 (Regular Expressions)

The regular expression also called regex, is basically a pattern used to match a string. It works for nearly all IDE and most of the programming language supports it too like java, javascript, shell programs, awk, sed, grep, mongo, MySQL, Android, etc. In this blog, I will try to explain how to use regex with proper examples.

正则表达式也称为regex,基本上是用于匹配字符串的模式。 它适用于几乎所有IDE,并且大多数编程语言也支持它,例如java,javascript,shell程序, awk ,sed,grep, mongo , MySQL , Android等。在此博客中,我将尝试解释如何使用regex。有适当的例子。

1.基本匹配器 (1. Basic Matchers)

It stands for simple text which we want to find out. Regex abc means to find a letter a then b and then c. Below is a list to explain the same. Try it.

它代表我们要查找的简单文本。 正则表达式abc的意思是先找到字母a,然后是b ,然后是c 。 以下是解释相同内容的列表。 试试吧 。

This is the simplest case everyone used it already.

这是每个人都已经使用过的最简单的情况。

2.特殊字符 (2. Special Characters)

Regex has special meaning for a few characters which we use while finding letters. We will discuss them in details shortly, some of them are “.”,”$”,”^” etc. Handling of them is easy with backslash “\”. Try it.

正则表达式对于我们在查找字母时使用的一些字符具有特殊含义。 我们将在稍后详细讨论它们,其中一些是“。”,“ $”,“ ^”等。使用反斜杠“ \”可以很容易地处理它们。 试试吧 。

正则表达式具有秘密力量_第2张图片

3.1角色类别 (3.1 Character Class)

With a “character class”, also called “character set”, you can tell the regex engine to match only one out of several characters. Example regex gr[ae]y can match with grey or gray, but not with graey. Try it.

使用“字符类”(也称为“字符集”),您可以告诉正则表达式引擎仅匹配几个字符中的一个。 例如正则表达式GR [AE] Y可以配合灰色灰色 ,但不与graey。 试试吧 。

正则表达式具有秘密力量_第3张图片

3.2否定字符类 (3.2 Negated Character Classes)

If we don't want gray as output we can use “ ^” to negate character. Example regex gr[^a]y will match grey, but not gray. Try it.

如果我们不希望使用灰色作为输出,则可以使用“ ^”取反字符。 示例正则表达式gr [^ a] y将匹配灰色 ,但不匹配灰色。 试试吧 。

正则表达式具有秘密力量_第4张图片

4.速记字符类 (4. Shorthand Character Classes)

Some character classes are popular and used ofter, hence there is a list of shorthand character classes available. Try it with character class and with shorthand.

一些字符类很受欢迎并且经常使用,因此有速记字符类列表可用。 尝试使用字符类和速记 。

正则表达式具有秘密力量_第5张图片
list of shorthand and their mapping with a character class. 速记列表及其与字符类的映射。
正则表达式具有秘密力量_第6张图片
Character class regex. 字符类正则表达式。
正则表达式具有秘密力量_第7张图片
Shorthand regex. 速记正则表达式。

5.元字符 (5. Meta Characters)

These are special characters used in the regex. Below is a list of metacharacters and their meaning.

这些是正则表达式中使用的特殊字符。 以下是元字符及其含义的列表。

Meta Characters 元字符

5.1点 (5.1 The Dot)

It checks any character (except the newline character). Try it.

它检查任何字符(换行符除外)。 试试吧 。

正则表达式具有秘密力量_第8张图片

5.2插入符号和美元 (5.2 The Caret and Dollar)

Caret ^ symbol is used to check if the matching character is the first characterof the input string and $ symbol for the end of the input string.

插入符^符号用于检查匹配字符是否为输入字符串的第一个字符, $符号是否为输入字符串的末尾。

Try it for $ and for caret ^.

尝试$和插入符号^ 。

正则表达式具有秘密力量_第9张图片
^ Caret example. ^插入符号示例。
正则表达式具有秘密力量_第10张图片
$ Dollar example. $美元示例。

5.3重复 (5.3 Repetition)

Star * ,It is used for zero or more matches. Try it

星号* ,用于零个或多个比赛。 试试吧

正则表达式具有秘密力量_第11张图片

Plus + ,It is used to get 1 or more matches. Try it

加号+ ,用于获得1个或多个匹配项。 试试吧

正则表达式具有秘密力量_第12张图片

Question mark ? , It makes the preceding letter optional.It is used to get 0 or 1 matches. Try it

问号? ,使前一个字母为可选,用于获取0或1个匹配项。 试试吧

正则表达式具有秘密力量_第13张图片

Middle brackets {}, It defines the range of repetitions of the letter. Try it

中括号{} ,它定义字母的重复范围。 试试吧

Multiple letters repeat , Keeping letters in small brackets enable repeat on both of them. Try it

多个字母重复 ,将字母放在小括号中可以使两个字母都重复。 试试吧

正则表达式具有秘密力量_第14张图片

5.4竖线| (5.4 Vertical bar |)

Vertical bar | is like or operator in the regex. It is used to define alteration. Example (T|t)he regex would match both The and the. Try it

竖条| 就像正则表达式中的或运算符。 它用于定义更改。 例如(T | T),他正则表达式将同时匹配 。 试试吧

正则表达式具有秘密力量_第15张图片
Alteration regex example 变更正则表达式示例

6.往后看 (6. Lookbehind and lookahead)

正则表达式具有秘密力量_第16张图片
regex lookaround 正则表达式环顾

Negative lookahead : It helps to get all matches not followed by given pattern.A(?!B) In below example we are trying to fetch “The” if it is not followed by “day”. Try it.

负前瞻 :有助于获得所有未包含给定模式的匹配项。 A(?!B)在下面的示例中,我们尝试获取不带“ day”的The 。 试试吧 。

正则表达式具有秘密力量_第17张图片
negative lookahead. 负面的前瞻。

Positive lookahead: Simply put, Find expression A where expression B follows : A(?=B) In below example we are trying to fetch “The” if “day” follows it. Try it.

积极向前看:简而言之, 在表达式B后面找到表达式A: A(?=B)在下面的示例中,如果“ day”紧随其后,我们尝试获取“ The” 。 试试吧。

正则表达式具有秘密力量_第18张图片
positive lookahead 积极向前

Positive lookbehind: Simply put, Find expression A where expression B precedes: (?<=B)A In below example we are trying to “day” or “boy” if exists after “The/the”. Try it.

积极回望:简而言之, 查找表达式B前面的表达式A: (?<=B)A在下面的示例中,我们尝试将“ day”或“ boy”(如果在“ The / the”之后)存在。 试试吧。

positive lookbehind 正向后看

Negative lookbehind: Simply put, Find expression A where expression B does not precede: (? In below example we are trying to find “was” without preceding “day” to it. Try it.

负向后看:简而言之, 查找表达式B不在其前面的表达式A: (?在下面的示例中,我们尝试查找“ was”而不在其“ day”之前。 试试吧。

Atomic group: To understand atomic group lets try to understand back tracking in regex

原子组 要了解原子组,请尝试了解正则表达式中的回溯

回溯 (Backtracking)

Backtracking is what regular expressions do naturally during the course of matching when a match fails. For example, if I’m matching the expression,

当匹配失败时,正则表达式在匹配过程中自然会执行回溯。 例如,如果我匹配表达式

.+b

against the string

对弦

aaaaaabcd

then it will first match aaaaaabc on the .+ and compare b against the remaining d. This fails, so it backtracks a bit and matches aaaaaab for the .+ and then compares the final b against the c. This fails too, so it backtracks again and tries aaaaaa for the .+ and the matches the b against the b and succeeds. Try it.

然后它将首先匹配.+上的aaaaaabc并将b与其余d进行比较。 这失败了,因此它回溯了一点并为.+匹配aaaaaab ,然后将最后一个bc进行比较。 这也失败了,所以它再次回溯并为.+尝试aaaaaa并将bb匹配并成功。 试试吧 。

Example to explain backtracking. 解释回溯的示例。

Regular expressions are greedy, hence it first tries to match all the string which can be matched based on the pattern. As explained above how regex backtracks to find us the string we are looking for. Using the atomic group we can disable backtracking.

正则表达式是贪婪的,因此它首先尝试匹配所有可以根据模式匹配的字符串。 如上所述,正则表达式如何回溯以找到我们要查找的字符串。 使用原子组,我们可以禁用回溯。

Pattern: /(.*|b*)[ac]/bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: .*
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
bbabbbabbbbc
^ -- Continue explore other possibility with .*
-- Step back 1 character
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is found

With the atomic grouping, all possibilities of .* is cut off and limited to the first match. So after greedily eating the whole string and fail to match, the engine has to go for the b* pattern, where it successfully finds a match to the regex.

通过原子分组, .*所有可能性都将被切除,并仅限于第一个匹配项。 因此,在贪婪地吃掉整个字符串并且不匹配之后,引擎必须采用b*模式,在此模式下,它成功找到与正则表达式的匹配项。

Pattern: /((?>.*)|b*)[ac]/bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: (?>.*)
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
-- The atomic grouping will disallow .* to be backtracked and rematched
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
-- (?>.*) is atomic, check the next possibility by alternation: b*
bbabbbabbbbc
^ -- Starting to rematch with b*
bbabbbabbbbc
^ -- First match with b*, due to greedy quantifier
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is found

So if we use pattern as /((?>.*))[ac]/ There will be no match for the string bbabbbabbbbc Try it.

因此,如果我们将pattern用作/((?>.*))[ac]/ ,则字符串bbabbbabbbbc将没有匹配项。

use of the atomic group. 使用原子团。

While it would work if we allow backtracking. Try it.

如果我们允许回溯,那会起作用。 试试吧。

正则表达式具有秘密力量_第19张图片
removing ?> has matched the pattern. 删除?>已匹配该模式。

7.贪婪和懒惰的匹配 (7. Greedy and Lazy matching)

By default, regex consumes string greedily, which means it tries to match as much as possible. Let us try to get XML tags from the regex. Regex <.+>, first matches the character < literally (case sensitive), then .(dot) matches any character (except for line terminators), then the dot is repeated by the +(plus). The plus is greedy. Therefore, the engine will repeat the dot as many times as it can, then > matches the character > literally. Hence pattern moves to the end of the string. Try it.

默认情况下,正则表达式会贪婪地消耗字符串,这意味着它将尝试尽可能地匹配。 让我们尝试从正则表达式中获取XML标签。 正则表达式<。+>,首先与字符<逐字匹配 (区分大小写),然后..(点)与任何字符匹配(行终止符除外),然后点由+(plus)重复。 加号是贪婪的 。 因此,引擎将尽可能多地重复点,然后>匹配字符>从字面上。 因此,模式将移动到字符串的末尾。 试试吧。

正则表达式具有秘密力量_第20张图片
Greedy approach 贪婪的方法

But we wanted both the tags to be matched with this pattern. A slight change in a regex to get string lazily can solve this, by making the plus lazy instead of greedy.Adding ? sign after +(plus), tells the regex engine to repeat the plus as few times as possible. Try this.

但是我们希望两个标签都与此模式匹配。 稍稍更改正则表达式以延迟获取字符串就可以解决此问题,方法是使plus成为lazy而不是greedy +(加号)后的符号告诉正则表达式引擎尽可能少地重复加号 。 试试这个 。

正则表达式具有秘密力量_第21张图片
Lazy approach 懒惰的方法

We can create lazy quantifiers by putting a ? (question mark) after the plus,the dot, the star, the curly braces, and the question mark itself.

我们可以通过放置来创建惰性量词 (问号) 加号星号花括号问号本身之后。

结论 (Conclusion)

Combining all the above patterns we can find out any string, better to say any unknowns from a list of knowns. A regex is a great tool when it comes to debugging stuff, parsing, masking, etc. Refer to https://regex101.com/ this is an awesome site to practice regex. Post in comments in case of any doubts. Thanks for reading, be sherlock for your files now. :)

结合以上所有模式,我们可以找出任何字符串,最好是从已知列表中找出所有未知字符串。 regex是调试工具,解析,屏蔽等方面的好工具。请访问https://regex101.com/,这是一个练习regex的绝佳站点。 如有任何疑问,请发表评论。 感谢您的阅读,现在就成为您的文件。 :)

翻译自: https://medium.com/@ankurpratik/regular-expressions-sherlocks-secret-power-4f5710ad8fc0

你可能感兴趣的:(正则表达式,python)