正则表达式是一个强大的匹配功能,支持 C、python 等多种语言,新兴时尚的 Swift,当然也少不了它。学习完本教程,您将感受到正则表达式赋予程序使用者的强大能力。
本教程首先介绍了 Swift 中各种匹配模式的使用,辅之以各色实例;然后讲解 NSRegularExpression,即我们所要使用的苹果提供的类;最后用一个比较复杂的实例挽总。本教程内容不光涉及正则表达式,也包括错误处理、闭包使用、文档读取与写入等,如有疏漏乃至谬误,请读者不吝赐教。
Part One —— Swift 正则表达式
正则表达式说来也很简单:给定一个 pattern (匹配模式,String 类型),看被检测的对象 String 是否满足这个 pattern,如果满足了,你可以获得对应的部分。
例如:apple
是一个 pattern,它能够匹配 apple tree
、I love apples.
这样的 String,获得的结果都是 apple
。
除此之外,正则表达式支持特定符号代表的省略的值,例如:d.g
可以匹配dog
、dig
、dag
等等 String,这就让正则的功能变得强大起来。
这些 pattern 有一套自己的规则,该规则是一般的语言所通用的,不同语言可能有部分微调。pattern 包括普通字符(例如,a 到 z 之间的字母)和特殊字符(称为”元字符”)。下表列出了所有 Swift 下的元字符(metacharacters)中的字符表达式,来自官方文档。
字符表达式 | 描述 | 注释 |
---|---|---|
\a | Match a BELL, \u0007 | |
\A | Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input. | 始终匹配输入的开端,不会 因为类型为 anchorsMatchLines 而改变,这是与^不同的地方。 |
\b, outside of a [Set] | Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored. | 连字符不是字符边界 |
\b, within a [Set] | Match a BACKSPACE, \u0008. | 退格键 |
\B | Match if the current position is not a word boundary. | |
\cX | Match a control-X character | |
\d | Match any character with the Unicode General Category of Nd (Number, Decimal Digit.) | 匹配数字,包括 Unicode 中的各种数字写法。 |
\D | Match any character that is not a decimal digit. | |
\e | Match an ESCAPE, \u001B. | |
\E | Terminates a \Q ... \E quoted sequence. | |
\f | Match a FORM FEED, \u000C. | 换页符 |
\G | Match if the current position is at the end of the previous match. | |
\n | Match a LINE FEED, \u000A. | 换行符 |
\N{UNICODE CHARACTER NAME} | Match the named character. | |
\p{UNICODE PROPERTY NAME} | Match any character with the specified Unicode Property. | 所有的 Unicode Property 可以点击查看 |
\P{UNICODE PROPERTY NAME} | Match any character not having the specified Unicode Property. | |
\Q | Quotes all following characters until \E. | |
\r | Match a CARRIAGE RETURN, \u000D. | 回车键 |
\s | Match a white space character. White space is defined as [\t\n\f\r\p{Z}]. | p{Z}包括 Unicode 行分隔、段落分隔、空格等,点击查看 |
\S | Match a non-white space character. | |
\t | Match a HORIZONTAL TABULATION, \u0009. | 水平制表 |
\uhhhh | Match the character with the hex value hhhh. | |
\Uhhhhhhhh | Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff. | 必须提供32位的 Unicode |
\w | Match a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. | |
\W | Match a non-word character. | |
\x{hhhh} | Match the character with hex value hhhh. From one to six hex digits may be supplied. | |
\xhh | Match the character with two digit hex value hh. | |
\X | Match a Grapheme Cluster. | 字形簇 |
\Z | Match if the current position is at the end of input, but before the final line terminator, if one exists. | |
\z | Match if the current position is at the end of input. | |
\n | Back Reference. Match whatever the nth capturing group matched. n must be a number ≥ 1 and ≤ total number of capture groups in the pattern. | n 是一个数字,对应着第几个子表达式 |
\0ooo | Match an Octal character. ooo is from one to three octal digits. 0377 is the largest allowed Octal character. The leading zero is required; it distinguishes Octal constants from back references. | |
[pattern] | Match any one character from the pattern. | 中括号代表只匹配其中之一 |
. | Match any character. | 如果类型为 dotMatchesLineSeparators,则可以匹配换行符,否则不能匹配 |
^ | Match at the beginning of a line. | |
$ | Match at the end of a line. | |
\ | Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . / |
下表列出了所有 Swift 下的元字符中的运算符。
运算符 | 描述 | 注释 |
---|---|---|
| | Alternation. A|B matches either A or B. | |
* | Match 0 or more times. Match as many times as possible. | |
+ | Match 1 or more times. Match as many times as possible. | |
? | Match zero or one times. Prefer one. | |
{n} | Match exactly n times. | |
{n,} | Match at least n times. Match as many times as possible. | |
{n,m} | Match between n and m times. Match as many times as possible, but not more than m. | |
*? | Match 0 or more times. Match as few times as possible. | |
+? | Match 1 or more times. Match as few times as possible. | |
?? | Match zero or one times. Prefer zero. | |
{n}? | Match exactly n times. | |
{n,}? | Match at least n times, but no more than required for an overall pattern match. | |
{n,m}? | Match between n and m times. Match as few times as possible, but not less than n. | |
*+ | Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match). | |
++ | Match 1 or more times. Possessive match. | |
?+ | Match zero or one times. Possessive match. | |
{n}+ | Match exactly n times. | |
{n,}+ | Match at least n times. Possessive Match. | |
{n,m}+ | Match between n and m times. Possessive Match. | |
(...) | Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match. | |
(?:...) | Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses. | |
(?>...) | Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>" | |
(?# ... ) | Free-format comment (?# comment ). | |
(?= ... ) | Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position. | |
(?! ... ) | Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position. | |
(?<= ... ) | Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.) | |
(? | Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.) | |
(?ismwx-ismwx:... ) | Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled. The flags are defined in Flag Options. | |
(?ismwx-ismwx) | Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.The flags are defined in Flag Options. |
如果不想为了英语文档而伤脑筋,推荐查看菜鸟教程之正则表达式来入门,但如果要更好的学习 Swift 正则,官方的文档需要参考。
Part Two —— NSRegularExpression 类
不如用一个实例来说明。现在给出一个 String
let sentence = "I'd like to follow my fellow to the fallow to see a hallow harrow."
do {
// [a-z] 表明该字母可以是a-z中的任意一个
let regex = try NSRegularExpression(pattern: "f[a-z]llow", options: [])
// matches 的类型是 NSTextCheckingResult 的数组
let matches = regex.matches(in: sentence, options: [], range: NSRange(location: 0, length: sentence.count))
print("\(matches.count) matches.")
} catch {
print(error.localizedDescription)
}
结果如下:
3 matches.
而如何获得 matches 中的具体匹配上的字符串呢?调用 NSTextCheckingResult 的 range 属性,将这一范围还原到原来的 sentence 中就可以了。
...
let matches = ...
print(...)
for (i, match) in matches.enumerated() {
let substring = (sentence as NSString).substring(with: match.range)
print("\(i) is " + substring + ".")
}
...
结果如下:
3 matches.
0 is follow.
1 is fellow.
2 is fallow.
还可以使用闭包来进行遍历:
// 直接对每一个 match 进行处理
regex.enumerateMatches(in: sentence, options: [], range: NSRange(location: 0, length: sentence.count), using: { result, _, _ in
guard let result = result else { return }
let substring = (sentence as NSString).substring(with: result.range)
print(substring)
})
结果如下:
follow
fellow
fallow