[JS] 正则表达式的回溯方式

1. 背景

Backtracking occurs when a regular expression pattern contains optional quantifiers or alternation constructs, and the regular expression engine returns to a previous saved state to continue its search for a match.

正则表达式中的回溯，发生在包含限定符（quantifiers）和替换构造（alternation constructs）的场景中，
这种情况下，正则表达式引擎会退回到前一个经历过的状态，
然后从这个状态开始继续往下匹配。

2. 无回溯时的线性时间比较

If a regular expression pattern has no optional quantifiers or alternation constructs, the regular expression engine executes in linear time. That is, after the regular expression engine matches the first language element in the pattern with text in the input string, it tries to match the next language element in the pattern with the next character or group of characters in the input string. This continues until the match either succeeds or fails. In either case, the regular expression engine advances by one character at a time in the input string.

不包含限定符（quantifiers）或替换构造（alternation constructs）的正则表达式，
会在线性时间内完成匹配。

正则表达式引擎，首先会从正则表达式中取出一个元素，然后拿着它和输入字符串进行匹配，
然后再在正则表达式中取下一个元素，与输入字符串的下一个字符进行匹配。

例如，

/e{2}\w\b/.exec('needing a reed');

尽管正则表达式中包含{2}，但它不是可选限定符，因此该正则表达式在匹配过程中不回溯，
整个匹配过程如下，

步骤	模式位置	字符串位置	结果
1	e	"needing a reed"	不匹配
2	e	"eeding a reed"	可能匹配
3	{2}	"eding a reed"	可能匹配
4	\w	"ding a reed"	可能匹配
5	\b	"ing a reed"	可能的匹配失败
6	e	"eding a reed"	可能匹配
7	{2}	"ding a reed"	可能的匹配失败
8	e	"ding a reed"	不匹配
9	e	"ing a reed"	不匹配
10	e	"ng a reed"	不匹配
11	e	"g a reed"	不匹配
12	e	" a reed"	不匹配
13	e	"a reed"	不匹配
14	e	" reed"	不匹配
15	e	"reed"	不匹配
16	e	"eed"	可能匹配
17	{2}	"ed"	可能匹配
18	\w	"d"	可能匹配
19	\b	""	可能匹配

3. 限定符或替换构造引起的回溯

大部分正则表达式引擎使用了非确定有限状态自动机（NFA），
与确定有限状态自动机（DFA）不同的是，
DFA由输入字符串驱动，NFA由正则表达式中的元素来却动。

Therefore, the regular expression engine tries to fully match optional or alternative subexpressions. When it advances to the next language element in the subexpression and the match is unsuccessful, the regular expression engine can abandon a portion of its successful match and return to an earlier saved state in the interest of matching the regular expression as a whole with the input string. This process of returning to a previous saved state to find a match is known as backtracking.

正则表达式引擎，会逐一取出正则表达式中的元素，与输入字符串进行匹配，
如果正则表达式的当前元素，无法匹配到字符串上时，
就会退回到之前的匹配状态，从那里开始再尝试重新的可能性，称之为回溯。

/.*(es)/.exec('Essential services are provided by regular expressions.');

步骤	模式位置	字符串位置	结果
1	.	'Essential services are provided by regular expressions.'	可能匹配
2	*	''	可能匹配
3	e	''	可能的匹配失败
4	*	'.'	可能匹配
5	e	'.'	可能的匹配失败
6	*	's.'	可能匹配
7	e	's.'	可能的匹配失败
8	*	'ns.'	可能匹配
9	e	'ns.'	可能的匹配失败
10	*	'ons.'	可能匹配
11	e	'ons.'	可能的匹配失败
12	*	'ions.'	可能匹配
13	e	'ions.'	可能的匹配失败
14	*	'sions.'	可能匹配
15	e	'sions.'	可能的匹配失败
16	*	'ssions.'	可能匹配
17	e	'ssions.'	可能的匹配失败
18	*	'essions.'	可能匹配
19	e	'essions.'	可能匹配
20	s	'ssions.'	可能匹配

首先，正则表达式引擎会使用.*（贪婪匹配）与整个字符串进行匹配，
然后再尝试匹配正则表达式后面的元素e，
这时候，输入字符串已经没有剩余的字符了，匹配失败。

接着，正则表达式引擎会进行回溯，返回到上一次成功匹配时的状态，
即，已经匹配了Essential services are provided by regular expressions，
但还剩下一个字符.没有匹配时的状态，再尝试匹配e。

发现这次也失败了，然后正则表达式引擎就会继续回溯，
直到回溯到匹配了输入字符串Essential services are provided by regular expr，
剩余essions.的状态。

最后尝试匹配e和s都成功了，匹配结束。

4. 嵌套可选限定符的指数时间比较

/^(a+)+$/.exec('aaaaa!');

步骤	模式位置	字符串位置	结果
1	^	'aaaaa!'	可能匹配
2	^a	'aaaaa!'	可能匹配
3	^(a+)	'!'	可能匹配
4	^(a+)+	'!'	可能匹配
5	^(a+)+$	'!'	可能的匹配失败
6	^(a+)	'a!'	可能匹配
7	^(a+)+	'a!'	可能匹配
8	^(a+)+$	'a!'	可能的匹配失败
9	^(a+)	'aa!'	可能匹配
10	^(a+)+	'aa!'	可能匹配
11	^(a+)+$	'aa!'	可能的匹配失败
12	^(a+)	'aaa!'	可能匹配
13	^(a+)+	'a!'	可能匹配
14	^(a+)+$	'a!'	可能的匹配失败
15	^(a+)+	'aaa!'	可能匹配
16	^(a+)+$	'aaa!'	可能的匹配失败
17	^(a+)	'aaaa!'	可能匹配
18	^(a+)+	'!'	可能匹配
19	^(a+)+$	'!'	可能的匹配失败
20	^(a+)+	'a!'	可能匹配
21	^(a+)+$	'a!'	可能的匹配失败
22	^(a+)+	'aa!'	可能匹配
23	^(a+)+$	'aa!'	可能的匹配失败
24	^(a+)+	'aaa!'	可能匹配
25	^(a+)+$	'aaa!'	可能的匹配失败
26	^(a+)+	'aaaa!'	可能匹配
27	^(a+)+$	'aaaa!'	可能的匹配失败

正则表达式引擎，首先会对字符串开头^进行匹配，
然后对(a+)进行贪婪匹配（吃掉5个a），
(a+)+对(a+)这个捕获组也会进行贪婪匹配，捕获组最多只能重复1次了，
输入字符串中的还剩余字符!，不能匹配正则表达式的$（字符串结尾），匹配失败。

接着正则表达式引擎就会回溯，先回到对捕获组的贪婪匹配上(a+)+，
因为已经是最少的重复次数了（1次），还要继续回溯，
回到(a+)的贪婪匹配上，这次只吃掉4个a，
接着再重复上面的捕获组贪婪匹配(a+)+，字符串结尾$，匹配失败。

这样一直进行和回溯下去，注意从(a+)匹配了2个a开始，
(a+)+的行为也要进行回溯了，(a+)+会贪婪匹配2个捕获组(aa)(aa)a!，
结果发现$不能匹配时，回溯到(a+)+，然后只匹配1个捕获组，(aa)aaa!，
结果$还是不能匹配，再回溯到(a+)的状态。

然后(a+)只能匹配1个a了，(a+)+贪婪匹配5个捕获组，(a)(a)(a)(a)(a)!，
最后的$不能匹配!时，先回溯捕获组的匹配，它先匹配4个捕获组，(a)(a)(a)(a)a!,
结果$不能匹配a!，然后回溯捕获组的匹配3个捕获组，(a)(a)(a)aa!。

如此进行下去，直到捕获组匹配只匹配了一个捕获组(a)aaaa!，
而$不能匹配aaaa!，所有的可能性都尝试过了，匹配失败。

5. CPU 100%的例子

正则表达式的回溯，会占用大量CPU时间，
以下示例可让CPU占用率飙升到100%，

console.time('backtracking');
/^((((((.*).)*.)*.)*.)*.)*x$/.exec('123456789012345!');
console.timeEnd('backtracking');

backtracking: 18457.3369140625ms

参考

Backtracking in Regular Expressions