正则表达式Regex

前言

今天补充一个知识点,正则表达(regular expression)。首先,什么是正则表达呢?正则表达可以理解为是一种pattern,用来匹配字符串。正则表达在许多场景下都有应用,比如爬虫、文本查到等,使用起来也非常灵活,入门很简单,但是要用得好却很难。在许多文本编辑器中都可以使用正则表达,而且语法差异不大,可以说是人手必备的技能啊。

四个主要概念

  1. Matching
  2. Memoization
  3. Alternation
  4. Repetition

使用方法

下面列举一些常用的匹配方法:

a    # match 'a'
()    # group
[]    # set
{}    # range
.    # any one char
*    # 0 or more
+    # 1 or more
?    # 0 or 1
^    # start with
$    # end with
|    # or
\    # escape

特殊:

[^ ]    # 在[]中^ 表示非
\d    # 表示数字
\w    # 表示字母、数字、下划线
\s    # 表示空格、\t、回车、换行、\f
\D    # 与\d相反
\W   # 与\w相反
\S   # 与\s相反
\1    # 表示第一个group

练习题

1. /[a-zA-Z]+/ 表示什么?

Answer:One or more letters of the (English) Latin alphabet, e.g. hello, world, 123username (note that the username part matches even if the 123 does not)

**2. /^[A-Za-z][a-z]*$/ 表示什么?**

Answer:A string comprised of one or more letters of the Latin alphabet, where the first is optionally uppercase i.e. a word, e.g. Alphabet

3. /p[aeiou]{0,2}t/ 表示什么?

Answer:Up to two (English) vowels between the letters p and t, e.g. pout, carpet, pt

4. /\s(\w+)\s\1/ 表示什么?

Answer:A whitespace character followed by one or more alphanumeric “word“ characters (memorized as the first group), followed by another (possibly different) whitespace character, followed by a copy of the first ”word“, e.g. test\ttesting (here \t means tab)

5. Match a price ($)

Answer: Depends on how stringently we want to handle cases like $001.230 - one possible solution: /$(0|[1-9][0-9]*)(.\d{1,2})?/

6. Match an Australian telephone number

Answer: /(+61|0)\d([ -]?\d){8}/

7. Remove HTML comments from a document

Answer: The HTML standard is a bit of a moving target. One possible solution can be: s///g

8. Validate an email address

Answer: ^[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+(.[a-zA-Z0-9_-]+)+$

如何掌握

要想熟练的掌握正则表达是需要大量练习的,方法其实很简单,多实操多积累,这里也推荐一个用于练习的网站Regex101,还能在上面查看语法。

你可能感兴趣的:(正则表达式Regex)