18. 正则表达式

[TOC]

简而言之，正则表达式是一种符号表示法，被用来识别文本模式。在某种程度上，它们与匹配文件和路径名的 shell 通配符比较相似，但其规模更庞大。

之前玩 iOS 的时候有写过一篇关于 Mac 上的正则表达式工具 Expressions 的文章：
传送门：：Expressions - 正则表达式工具@独木舟的木 ⭐️⭐️⭐️

`grep` 文本搜索工具

grep 命令之前在【Lunux 重定向】章节中有提及。它能使用正则表达式搜索文本，并把匹配的行打印出来。

语法：grep [options] regix [file...]

常用的 grep 选项	描述
-i	忽略大小写。不会区分大小写字符。也可用 --ignore-case 来指定。
-v	不匹配。通常，grep 程序会打印包含匹配项的文本行。这个选项导致 grep 程序只会输出不包含匹配项的文本行。也可用 --invert-match 来指定。
-c	打印匹配的数量（或者是不匹配的数目，若指定了 -v 选项），而不是文本行本身。也可用 --count 选项来指定。
-l	打印包含匹配项的文件名，而不是文本行本身，也可用 --files-with-matches 选项来指定。
-L	相似于 -l 选项，但是只是打印不包含匹配项的文件名。也可用 --files-without-match 来指定。
-n	在每个匹配行之前打印出其位于文件中的相应行号。也可用 --line-number 选项来指定。
-h	应用于多文件搜索，不输出文件名。也可用 --no-filename 选项来指定。

示例：

# 创建测试数据文件
$ ls /bin >dirlist-bin.txt
$ ls /usr/bin > dirlist-usr-bin.txt
$ ls /sbin > dirlist-sbin.txt
$ ls /usr/sbin > dirlist-usr-sbin.txt
$ ls -l dirlist*.txt
-rw-r--r-- 1 root root 1418 Jan 19 13:13 dirlist-bin.txt
-rw-r--r-- 1 root root 2256 Jan 19 13:13 dirlist-sbin.txt
-rw-r--r-- 1 root root 9196 Jan 19 13:13 dirlist-usr-bin.txt
-rw-r--r-- 1 root root 2162 Jan 19 13:14 dirlist-usr-sbin.txt

# 简单搜索：在所有列出的文件中搜索字符串 bzip，然后找到匹配项
$ grep bzip dirlist*.txt
dirlist-bin.txt:bzip2
dirlist-bin.txt:bzip2recover

# 只打印包含匹配项的文件名
$ grep -l bzip dirlist*.txt
dirlist-bin.txt

# 只打印不包含匹配项的文件名
$ grep -L bzip dirlist*.txt
dirlist-sbin.txt
dirlist-usr-bin.txt
dirlist-usr-sbin.txt

正则表达式元字符

元字符被用来指定更复杂的匹配项。正则表达式元字符由以下字符组成：

^ $ . [ ] { } - ? * + ( ) | \

其它所有字符都被认为是原义字符，虽然在个别情况下，反斜杠会被用来创建元序列，也允许元字符被转义为原义字符，而不是被解释为元字符。

# 1. 圆点字符（.），匹配任意字符
$ grep -h '.zip' dirlist*.txt
bunzip2
bzip2
bzip2recover
gunzip
gzip
gpg-zip
preunzip
prezip
prezip-bin

# 2. 插入符号(^)和美元符号($)被看作是锚（定位点）。这意味着正则表达式 只有在文本行的开头或末尾被找到时，才算发生一次匹配。
$ grep -h '^zip' dirlist*.txt
zip
zipdetails

$ grep -h 'zip$' dirlist*.txt
gunzip
gzip
gpg-zip
preunzip
prezip
zip

$ grep -h '^zip$' dirlist*.txt
zip

# 寻找第三个字是 j，第五个字是 r，的英文单词。
$ grep -i '^..j.r$' /usr/share/dict/words

中括号表达式和字符类

中括号表达式能够从一个指定的字符集合中匹配一个单个的字符。

# 匹配包含字符串“bzip”或者“gzip”的任意行：
$ grep -h '[bg]zip' dirlist-*.txt
bzip2
bzip2recover
gzip

# zip 的前一个字符是除了 b 和 g 之外的任意字符：
$ grep -h '[^bg]zip' dirlist-*.txt
bunzip2
gunzip
gpg-zip
preunzip
prezip
prezip-bin

传统的字符区域

# 描述大写字母 A-Z：
$ grep -h '^[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' dirlist-*.txt
MAKEDEV
NF
VGAuthService
X11
ModemManager
NetworkManager

# 同上
$ grep -h '^[A-Z]' dirlist-*.txt
MAKEDEV
NF
VGAuthService
X11
ModemManager
NetworkManager

$ grep -h '^[A-Za-z0-9]' dirlist*.txt

POSIX 字符集

$ ls /usr/sbin/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]*
/usr/sbin/ModemManager  /usr/sbin/NetworkManager

# 但是，通过这个命令我们得到了不同的结果，WHY？？？
$ ls /usr/sbin/[A-Z]*
/usr/sbin/bcache-super-show            /usr/sbin/locale-gen               /usr/sbin/update-ca-certificates
/usr/sbin/biosdecode                   /usr/sbin/logrotate                /usr/sbin/update-catalog
/usr/sbin/chat                         /usr/sbin/luksformat               
...
...

Unix 系统 ASCII 字符集排序规则：ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz。
不同于正常的字典顺序：aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ。
POSIX 标准介绍了一种叫做 locale 的概念，其可以被调整，来为某个特殊的区域，选择所需的字符集。

因此，[A-Z] 字符区域按照字典顺序解释的时候，包含除了小写字母 a 之外的所有字母。

POSIX 字符集	说明
[:alnum:]	字母数字字符。在 ASCII 中，等价于：[A-Za-z0-9]
[:word:]	与 [:alnum:] 相同, 但增加了下划线字符。
[:alpha:]	字母字符。在 ASCII 中，等价于：[A-Za-z]
[:blank:]	包含空格和 tab 字符。
[:cntrl:]	ASCII 的控制码。包含了 0 到 31，和 127 的 ASCII 字符。
[:digit:]	数字 0 到 9
[:graph:]	可视字符。在 ASCII 中，它包含 33 到 126 的字符。
[:lower:]	小写字母。
[:punct:]	标点符号字符。
[:print:]	可打印的字符。在 [:graph:] 中的所有字符，再加上空格字符。
[:space:]	空白字符，包括空格，tab，回车，换行，vertical tab, 和 form feed. 在 ASCII 中，等价于：[\t\r\n\v\f]
[:upper:]	大写字母。
[:xdigit:]	用来表示十六进制数字的字符。在 ASCII 中，等价于：[0-9A-Fa-f]

改进示例：

$ ls /usr/sbin/[[:upper:]]*
/usr/sbin/ModemManager  /usr/sbin/NetworkManager

基本正则表达式&扩展正则表达式

POSIX 把正则表达式的实现分成了两类：

基本正则表达式（BRE）；
扩展的正则表达式（ERE）；

grep 命令加上 -E 参数，表示支持扩展特性。

$ echo "AAA" | grep -E 'AAA|BBB'
AAA
$ echo "BBB" | grep -E 'AAA|BBB'
BBB
$ echo "CCC" | grep -E 'AAA|BBB'

限定符

？- 匹配一个元素零次或一次。
- - 匹配一个元素一次或多次。
{} - 匹配一个元素特定的次数。

限定符	含义
{n}	匹配前面的元素，如果它确切地出现了 n 次。
{n,m}	匹配前面的元素，如果它至少出现了 n 次，但是不多于 m 次。
{n,}	匹配前面的元素，如果它出现了 n 次或多于 n 次。
{,m}	匹配前面的元素，如果它出现的次数不多于 m 次。

用 `find` 查找丑陋的文件名

find 命令要求路径名精确地匹配正则表达式。

# 查找包含空格和其它潜在不规范字符的路径名：
$ find . -regex '.*[^-\_./0-9a-zA-Z].*'

用 locate 查找文件

locate 程序支持基本的（--regexp 选项）和扩展的（--regex 选项）正则表达式。

$ locate --regex '/bin/(bz|gz|zip)'
/bin/bzcat
/bin/bzcmp
/bin/bzdiff
/bin/bzegrep
/bin/bzexe
/bin/bzfgrep
/bin/bzgrep
/bin/bzip2
/bin/bzip2recover
/bin/bzless
/bin/bzmore
/bin/gzexe
/bin/gzip
/usr/bin/zipdetails
/usr/lib/klibc/bin/gzip

在 less 和 vim 中查找文本

less 和 vim 两者享有相同的文本查找方法。按下 / 按键，然后输入正则表达式，来执行搜索任务。