前言
今天补充一个知识点,正则表达(regular expression)。首先,什么是正则表达呢?正则表达可以理解为是一种pattern,用来匹配字符串。正则表达在许多场景下都有应用,比如爬虫、文本查到等,使用起来也非常灵活,入门很简单,但是要用得好却很难。在许多文本编辑器中都可以使用正则表达,而且语法差异不大,可以说是人手必备的技能啊。
四个主要概念
- Matching
- Memoization
- Alternation
- Repetition
使用方法
下面列举一些常用的匹配方法:
a # match 'a'
() # group
[] # set
{} # range
. # any one char
* # 0 or more
+ # 1 or more
? # 0 or 1
^ # start with
$ # end with
| # or
\ # escape
特殊:
[^ ] # 在[]中^ 表示非
\d # 表示数字
\w # 表示字母、数字、下划线
\s # 表示空格、\t、回车、换行、\f
\D # 与\d相反
\W # 与\w相反
\S # 与\s相反
\1 # 表示第一个group
练习题
1. /[a-zA-Z]+/ 表示什么?
Answer:One or more letters of the (English) Latin alphabet, e.g. hello, world, 123username (note that the username part matches even if the 123 does not)
**2. /^[A-Za-z][a-z]*$/ 表示什么?**
Answer:A string comprised of one or more letters of the Latin alphabet, where the first is optionally uppercase i.e. a word, e.g. Alphabet
3. /p[aeiou]{0,2}t/ 表示什么?
Answer:Up to two (English) vowels between the letters p and t, e.g. pout, carpet, pt
4. /\s(\w+)\s\1/ 表示什么?
Answer:A whitespace character followed by one or more alphanumeric “word“ characters (memorized as the first group), followed by another (possibly different) whitespace character, followed by a copy of the first ”word“, e.g. test\ttesting (here \t means tab)
5. Match a price ($)
Answer: Depends on how stringently we want to handle cases like $001.230 - one possible solution: /$(0|[1-9][0-9]*)(.\d{1,2})?/
6. Match an Australian telephone number
Answer: /(+61|0)\d([ -]?\d){8}/
7. Remove HTML comments from a document
Answer: The HTML standard is a bit of a moving target. One possible solution can be: s///g
8. Validate an email address
Answer: ^[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+(.[a-zA-Z0-9_-]+)+$
如何掌握
要想熟练的掌握正则表达是需要大量练习的,方法其实很简单,多实操多积累,这里也推荐一个用于练习的网站Regex101,还能在上面查看语法。