title: Implement strstr()
tags:
- implement-strstr
- No.28
- simple
- string
- rabin-karp
- finite-automata
- kmp
Problem
Implement strStr().
Return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.
Example 1:
Input: haystack = "hello", needle = "ll"
Output: 2
Example 2:
Input: haystack = "aaaaa", needle = "bba"
Output: -1
Clarification:
What should we return when needle
is an empty string? This is a great question to ask during an interview.
For the purpose of this problem, we will return 0 when needle
is an empty string. This is consistent to C's strstr() and Java's indexOf().
Corner Cases
- empty haystack:
haystack: ""
needle : ""
- short haystack:
haystack: "a"
needle : "abcdefg"
- empty needle
haystack: "abaskdjflsdf"
needle : ""
Solutions
Rabin-Karp
Use Rabin-Karp algorithm to match string pattern. If function RabinKarp(String s)
returns a hash value for any string with a certain length l
, then compare the hash value between needle
and substrings in length l
in haystack
.
If hash value hits the needle
, then compare the string character by character. Else, skip. In another word,
HIT \subsetneq MATCH
The design of hash function RabinKarp()
goes as following:
Take a large prime number q
. Any substring in length l
are hashed to this q
pool. Thus the larger q
is, the fewer frequently the spurious hits are.
Take a radix r
for character set \Sigma, which usually is |\Sigma|. For ASCII, r=256
.
Compute the hash value according to val = (r * val + s[i]) % q
.
The expected running time is O(n):
class Solution {
private int q = 2671; // Prime number
private int r = 256; // size of ASCII
public int strStr(String haystack, String needle) {
char[] h_arr = haystack.toCharArray();
char[] n_arr = needle.toCharArray();
int lh = haystack.length();
int ln = needle.length();
int pn = RabinKarp(n_arr);
int ph;
boolean match;
if (ln == 0) { return 0; }
if (lh == 0 || lh < ln) { return -1; }
for (int i=0; i
Knuth-Morris-Pratt & Finite Automata
Finite Automata is widely used in regular matching in the part of lexical analysis in compiler. And the ingenious constructing method of FA belongs to KMP.
For a pattern string p
, suppose the length of it is m
. Then we have m+1
kinds of states for FA, including an initialization state. And we have 256 characters for char
type. Only when the state is transfered to the final one by input, we say that FA accept the input string(or char
array).
Take input ababac
and character set {a, b, c}
as an instance:
state | a | b | c | accepted |
---|---|---|---|---|
0 | 1 | 0 | 0 | |
1 | 1 | 2 | 0 | a |
2 | 3 | 0 | 0 | b |
3 | 1 | 4 | 0 | a |
4 | 5 | 2 | 0 | b |
5 | 1 | 4 | 6 | a |
6 | c |
When compute the transition table, use a state b
to record the reset state for mis-matching input. Then b
can be updated as b = dfa[x][p[i]]
. This b
means rolling backwards. It indicates the prefix p[0 : b]
should overlap the suffix s[i-b : b]
for mis-matching:
x = 3:
s[..., i-4, i-3, i-2, i-1, i]
p[0, 1, 2, 3, 4, ...]
Running time is O(m \times |\Sigma|) = O(256m) for computing char
transition table, O(n) for matching.
class Solution {
private int[][] dfa;
public int strStr(String haystack, String needle) {
char[] h_arr = haystack.toCharArray();
char[] n_arr = needle.toCharArray();
int lh = haystack.length();
int ln = needle.length();
int q = 0;
if (ln == 0) { return 0; }
if (lh == 0 || lh < ln) { return -1; }
ASCIItransition(n_arr);
for (int i=0; i