28 Implement `strstr()`


title: Implement strstr()
tags:
- implement-strstr
- No.28
- simple
- string
- rabin-karp
- finite-automata
- kmp


Problem

Implement strStr().

Return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Example 1:

Input: haystack = "hello", needle = "ll"
Output: 2

Example 2:

Input: haystack = "aaaaa", needle = "bba"
Output: -1

Clarification:

What should we return when needle is an empty string? This is a great question to ask during an interview.

For the purpose of this problem, we will return 0 when needle is an empty string. This is consistent to C's strstr() and Java's indexOf().

Corner Cases

  • empty haystack:
haystack: ""
needle  : ""
  • short haystack:
haystack: "a"
needle  : "abcdefg"
  • empty needle
haystack: "abaskdjflsdf"
needle  : ""

Solutions

Rabin-Karp

Use Rabin-Karp algorithm to match string pattern. If function RabinKarp(String s) returns a hash value for any string with a certain length l, then compare the hash value between needle and substrings in length l in haystack.

If hash value hits the needle, then compare the string character by character. Else, skip. In another word,

HIT \subsetneq MATCH

The design of hash function RabinKarp() goes as following:

Take a large prime number q. Any substring in length l are hashed to this q pool. Thus the larger q is, the fewer frequently the spurious hits are.

Take a radix r for character set \Sigma, which usually is |\Sigma|. For ASCII, r=256.

Compute the hash value according to val = (r * val + s[i]) % q.

The expected running time is O(n):

class Solution {
    private int q = 2671; // Prime number
    private int r = 256;  // size of ASCII

    public int strStr(String haystack, String needle) {
        char[]  h_arr = haystack.toCharArray();
        char[]  n_arr = needle.toCharArray();
        int     lh    = haystack.length();
        int     ln    = needle.length();
        int     pn    = RabinKarp(n_arr);
        int     ph;
        boolean match;
        
        if (ln == 0)            { return 0;  }
        if (lh == 0 || lh < ln) { return -1; }

        for (int i=0; i

Knuth-Morris-Pratt & Finite Automata

Finite Automata is widely used in regular matching in the part of lexical analysis in compiler. And the ingenious constructing method of FA belongs to KMP.

For a pattern string p, suppose the length of it is m. Then we have m+1 kinds of states for FA, including an initialization state. And we have 256 characters for char type. Only when the state is transfered to the final one by input, we say that FA accept the input string(or char array).

Take input ababac and character set {a, b, c} as an instance:

state a b c accepted
0 1 0 0
1 1 2 0 a
2 3 0 0 b
3 1 4 0 a
4 5 2 0 b
5 1 4 6 a
6 c

When compute the transition table, use a state b to record the reset state for mis-matching input. Then b can be updated as b = dfa[x][p[i]]. This b means rolling backwards. It indicates the prefix p[0 : b] should overlap the suffix s[i-b : b] for mis-matching:

x = 3:
s[..., i-4, i-3, i-2, i-1, i]
          p[0,   1,   2,   3,   4, ...]

Running time is O(m \times |\Sigma|) = O(256m) for computing char transition table, O(n) for matching.

class Solution {
    private int[][] dfa;

    public int strStr(String haystack, String needle) {
        char[]  h_arr = haystack.toCharArray();
        char[]  n_arr = needle.toCharArray();
        int     lh    = haystack.length();
        int     ln    = needle.length();        
        int     q     = 0;
        
        if (ln == 0)            { return 0;  }
        if (lh == 0 || lh < ln) { return -1; }

        ASCIItransition(n_arr);

        for (int i=0; i

你可能感兴趣的:(28 Implement `strstr()`)