glibc中的strstr的two-way算法,two-way算法主要依据Critical Factorization理论。
要理解Critical Factorization理论,先要理解字符串的period:
设w是定义在字符集A上的非空字符串。设|w|是w的长度。存在正整数p,对所有满足模p同余的
非负整数i,j (i,j < |x|),等式:
w[i] = w[j]
都成立。则p称为字符串w的period。w最小的period记作p(w)。
若存在正整数wp使等式:
w[i] = w[wp+i]
对所有使等式两边有意义的i都成立,那么wp称为字符串w的weak period。
若正整数lp称为非空字符串w在位置l的local period,则有定义在A上的字符串u, v, x,使
w=uv; |u|=l+1;|x|=lp;并使字符串r,s满足下列条件之一:
1. u = rx && v = xs
2. x = ru && v = xs (r不是空字符串)
3. u = rx && x = vs (s不是空字符串)
4. x = ru && x = vs (r,s不是空字符串)
字符串w在位置l的最小local period记作p(w, l)。若此时p(w, l)=p(w),则字符串对(u, v)
称为w的Critical Factorization, l称为critical position。
two-way算法的第一步就是找到Critical Factorization。使用maximal suffix方法,依据如下理论:
字符串w的子字符串记作w[i..e)=w[i]w[i+1]...w[e-1]。w的最大后缀定义为:
存在整数i(i >= 0 && i < |w|),使得对字符串w有意义的所有整数j满足下式:
u[j..|w|) <= u[i..|w|) (4)
字符串的逆向最大后缀的定义与上述定义相似,只是将(4)式改成:
u[j..|w|) >= u[i..|w|) (5)
可以证明字符串w(|w| >= 2)至少有一个critical factorization,且l < p。此外设v是w的
maximal suffix,且w = uv。设m是w的tilde maximal suffix,且w = nm。如果v <= m那么
(n,m)是w的是critical factorization。如果v > m那么(u, v)是w的critical factorization。
Boyer and Moore的BM算法
对齐后从匹配串后面开始匹配,不匹配时的移位算法:
1. 坏字符规则:源串中当前比较的字符不相等且不出现在匹配串中,则把匹配串移位到坏字符的背后
2. 好字符规则: 后移位数 = 好后缀的位置 - 搜索词中的上一次出现位置, 然后继续从后开始匹配
举例来说,如果字符串"ABCDAB"的后一个"AB"是"好后缀"。那么它的位置是5(从0开始计算,取最后的"B"的值),在"搜索词中的上一次出现位置"是1(第一个"B"的位置),所以后移 5 - 1 = 4位,前一个"AB"移到后一个"AB"的位置。
再举一个例子,如果字符串"ABCDEF"的"EF"是好后缀,则"EF"的位置是5 ,上一次出现的位置是 -1(即未出现),所以后移 5 - (-1) = 6位,即整个字符串移到"F"的后一位。
可以看到,"坏字符规则"只能移3位,"好后缀规则"可以移6位。所以,Boyer-Moore算法的基本思想是,每次后移这两个规则之中的较大值。
更巧妙的是,这两个规则的移动位数,只与搜索词有关,与原字符串无关。因此,可以预先计算生成《坏字符规则表》和《好后缀规则表》。使用时,只要查表比较一下就可以了。
KMP算法(Knuth–Morris–Pratt 三人提出的,这几个中性能相对差)
原理:
cababababadcaddecbd查询匹配ababad
ababad 不匹配的时候,
aba 我们可以直接整块右边移动2位继续当前位置继续往下比较,避免了6次冗余操作
下面附上glibc2.6的相关代码
char * STRSTR (const char *haystack_start, const char *needle_start) { const char *haystack = haystack_start; const char *needle = needle_start; size_t needle_len; /* Length of NEEDLE. */ size_t haystack_len; /* Known minimum length of HAYSTACK. */ bool ok = true; /* True if NEEDLE is prefix of HAYSTACK. */ /* Determine length of NEEDLE, and in the process, make sure HAYSTACK is at least as long (no point processing all of a long NEEDLE if HAYSTACK is too short). */ while (*haystack && *needle) ok &= *haystack++ == *needle++; if (*needle) return NULL; if (ok) return (char *) haystack_start; /* Reduce the size of haystack using strchr, since it has a smaller linear coefficient than the Two-Way algorithm. */ needle_len = needle - needle_start; haystack = strchr (haystack_start + 1, *needle_start); if (!haystack || __builtin_expect (needle_len == 1, 0)) return (char *) haystack; needle -= needle_len; haystack_len = (haystack > haystack_start + needle_len ? 1 : needle_len + haystack_start - haystack); /* Perform the search. Abstract memory is considered to be an array of 'unsigned char' values, not an array of 'char' values. See ISO C 99 section 6.2.6.1. */ if (needle_len < LONG_NEEDLE_THRESHOLD) return two_way_short_needle ((const unsigned char *) haystack, haystack_len, (const unsigned char *) needle, needle_len); return two_way_long_needle ((const unsigned char *) haystack, haystack_len, (const unsigned char *) needle, needle_len); } static RETURN_TYPE two_way_short_needle (const unsigned char *haystack, size_t haystack_len, const unsigned char *needle, size_t needle_len) { size_t i; /* Index into current byte of NEEDLE. */ size_t j; /* Index into current window of HAYSTACK. */ size_t period; /* The period of the right half of needle. */ size_t suffix; /* The index of the right half of needle. */ /* Factor the needle into two halves, such that the left half is smaller than the global period, and the right half is periodic (with a period as large as NEEDLE_LEN - suffix). */ suffix = critical_factorization (needle, needle_len, &period); /* Perform the search. Each iteration compares the right half first. */ if (CMP_FUNC (needle, needle + period, suffix) == 0) { /* Entire needle is periodic; a mismatch can only advance by the period, so use memory to avoid rescanning known occurrences of the period. */ size_t memory = 0; j = 0; while (AVAILABLE (haystack, haystack_len, j, needle_len)) { /* Scan for matches in right half. */ i = MAX (suffix, memory); while (i < needle_len && (CANON_ELEMENT (needle[i]) == CANON_ELEMENT (haystack[i + j]))) ++i; if (needle_len <= i) { /* Scan for matches in left half. */ i = suffix - 1; while (memory < i + 1 && (CANON_ELEMENT (needle[i]) == CANON_ELEMENT (haystack[i + j]))) --i; if (i + 1 < memory + 1) return (RETURN_TYPE) (haystack + j); /* No match, so remember how many repetitions of period on the right half were scanned. */ j += period; memory = needle_len - period; } else { j += i - suffix + 1; memory = 0; } } } else { /* The two halves of needle are distinct; no extra memory is required, and any mismatch results in a maximal shift. */ period = MAX (suffix, needle_len - suffix) + 1; j = 0; while (AVAILABLE (haystack, haystack_len, j, needle_len)) { /* Scan for matches in right half. */ i = suffix; while (i < needle_len && (CANON_ELEMENT (needle[i]) == CANON_ELEMENT (haystack[i + j]))) ++i; if (needle_len <= i) { /* Scan for matches in left half. */ i = suffix - 1; while (i != SIZE_MAX && (CANON_ELEMENT (needle[i]) == CANON_ELEMENT (haystack[i + j]))) --i; if (i == SIZE_MAX) return (RETURN_TYPE) (haystack + j); j += period; } else j += i - suffix + 1; } } return NULL; } static size_t critical_factorization (const unsigned char *needle, size_t needle_len, size_t *period) { /* Index of last byte of left half, or SIZE_MAX. */ size_t max_suffix, max_suffix_rev; size_t j; /* Index into NEEDLE for current candidate suffix. */ size_t k; /* Offset into current period. */ size_t p; /* Intermediate period. */ unsigned char a, b; /* Current comparison bytes. */ /* Invariants: 0 <= j < NEEDLE_LEN - 1 -1 <= max_suffix{,_rev} < j (treating SIZE_MAX as if it were signed) min(max_suffix, max_suffix_rev) < global period of NEEDLE 1 <= p <= global period of NEEDLE p == global period of the substring NEEDLE[max_suffix{,_rev}+1...j] 1 <= k <= p */ /* Perform lexicographic search. */ max_suffix = SIZE_MAX; j = 0; k = p = 1; while (j + k < needle_len) { a = CANON_ELEMENT (needle[j + k]); b = CANON_ELEMENT (needle[max_suffix + k]); if (a < b) { /* Suffix is smaller, period is entire prefix so far. */ j += k; k = 1; p = j - max_suffix; } else if (a == b) { /* Advance through repetition of the current period. */ if (k != p) ++k; else { j += p; k = 1; } } else /* b < a */ { /* Suffix is larger, start over from current location. */ max_suffix = j++; k = p = 1; } } *period = p; /* Perform reverse lexicographic search. */ max_suffix_rev = SIZE_MAX; j = 0; k = p = 1; while (j + k < needle_len) { a = CANON_ELEMENT (needle[j + k]); b = CANON_ELEMENT (needle[max_suffix_rev + k]); if (b < a) { /* Suffix is smaller, period is entire prefix so far. */ j += k; k = 1; p = j - max_suffix_rev; } else if (a == b) { /* Advance through repetition of the current period. */ if (k != p) ++k; else { j += p; k = 1; } } else /* a < b */ { /* Suffix is larger, start over from current location. */ max_suffix_rev = j++; k = p = 1; } } /* Choose the longer suffix. Return the first byte of the right half, rather than the last byte of the left half. */ if (max_suffix_rev + 1 < max_suffix + 1) return max_suffix + 1; *period = p; return max_suffix_rev + 1; } /* Return the first location of non-empty NEEDLE within HAYSTACK, or NULL. HAYSTACK_LEN is the minimum known length of HAYSTACK. This method is optimized for LONG_NEEDLE_THRESHOLD <= NEEDLE_LEN. Performance is guaranteed to be linear, with an initialization cost of 3 * NEEDLE_LEN + (1 << CHAR_BIT) operations. If AVAILABLE does not modify HAYSTACK_LEN (as in memmem), then at most 2 * HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching, and sublinear performance O(HAYSTACK_LEN / NEEDLE_LEN) is possible. If AVAILABLE modifies HAYSTACK_LEN (as in strstr), then at most 3 * HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching, and sublinear performance is not possible. */ static RETURN_TYPE two_way_long_needle (const unsigned char *haystack, size_t haystack_len, const unsigned char *needle, size_t needle_len) { size_t i; /* Index into current byte of NEEDLE. */ size_t j; /* Index into current window of HAYSTACK. */ size_t period; /* The period of the right half of needle. */ size_t suffix; /* The index of the right half of needle. */ size_t shift_table[1U << CHAR_BIT]; /* See below. */ /* Factor the needle into two halves, such that the left half is smaller than the global period, and the right half is periodic (with a period as large as NEEDLE_LEN - suffix). */ suffix = critical_factorization (needle, needle_len, &period); /* Populate shift_table. For each possible byte value c, shift_table[c] is the distance from the last occurrence of c to the end of NEEDLE, or NEEDLE_LEN if c is absent from the NEEDLE. shift_table[NEEDLE[NEEDLE_LEN - 1]] contains the only 0. */ for (i = 0; i < 1U << CHAR_BIT; i++) shift_table[i] = needle_len; for (i = 0; i < needle_len; i++) shift_table[CANON_ELEMENT (needle[i])] = needle_len - i - 1; /* Perform the search. Each iteration compares the right half first. */ if (CMP_FUNC (needle, needle + period, suffix) == 0) { /* Entire needle is periodic; a mismatch can only advance by the period, so use memory to avoid rescanning known occurrences of the period. */ size_t memory = 0; size_t shift; j = 0; while (AVAILABLE (haystack, haystack_len, j, needle_len)) { /* Check the last byte first; if it does not match, then shift to the next possible match location. */ shift = shift_table[CANON_ELEMENT (haystack[j + needle_len - 1])]; if (0 < shift) { if (memory && shift < period) { /* Since needle is periodic, but the last period has a byte out of place, there can be no match until after the mismatch. */ shift = needle_len - period; } memory = 0; j += shift; continue; } /* Scan for matches in right half. The last byte has already been matched, by virtue of the shift table. */ i = MAX (suffix, memory); while (i < needle_len - 1 && (CANON_ELEMENT (needle[i]) == CANON_ELEMENT (haystack[i + j]))) ++i; if (needle_len - 1 <= i) { /* Scan for matches in left half. */ i = suffix - 1; while (memory < i + 1 && (CANON_ELEMENT (needle[i]) == CANON_ELEMENT (haystack[i + j]))) --i; if (i + 1 < memory + 1) return (RETURN_TYPE) (haystack + j); /* No match, so remember how many repetitions of period on the right half were scanned. */ j += period; memory = needle_len - period; } else { j += i - suffix + 1; memory = 0; } } } else { /* The two halves of needle are distinct; no extra memory is required, and any mismatch results in a maximal shift. */ size_t shift; period = MAX (suffix, needle_len - suffix) + 1; j = 0; while (AVAILABLE (haystack, haystack_len, j, needle_len)) { /* Check the last byte first; if it does not match, then shift to the next possible match location. */ shift = shift_table[CANON_ELEMENT (haystack[j + needle_len - 1])]; if (0 < shift) { j += shift; continue; } /* Scan for matches in right half. The last byte has already been matched, by virtue of the shift table. */ i = suffix; while (i < needle_len - 1 && (CANON_ELEMENT (needle[i]) == CANON_ELEMENT (haystack[i + j]))) ++i; if (needle_len - 1 <= i) { /* Scan for matches in left half. */ i = suffix - 1; while (i != SIZE_MAX && (CANON_ELEMENT (needle[i]) == CANON_ELEMENT (haystack[i + j]))) --i; if (i == SIZE_MAX) return (RETURN_TYPE) (haystack + j); j += period; } else j += i - suffix + 1; } } return NULL; }
google下来的一些性能测试,随机字串性能测试比较
Boyer-Moore | 1734 ms |
Boyer-Moore-Horspool (Joel’s original implementation) | 1101 ms |
Boyer-Moore-Horspool (our implementation) | 1080 ms |
Turbo Boyer-Moore | 2683 ms |
strstr | 2116 ms |
memmem | 3047 ms |