Pattern Matching Problem : Given strings $T$ (text) and $P$ (pattern), find a substring of $T$ equal to $P$
Brute-Force Algorithm
Compare the pattern $P$ with the text $T$ until either
a match is found, or
all placements of the pattern have been tried
Input : text $T$ of size $n$ and pattern $P$ of size $m$
Output : starting index or $-1$
Algorithm BruteForceMatch(T, P)
for i = 0 to n - m // test shift i of the pattern
j = 0;
while j < m
if T[i + j] = P[j]
j = j + 1;
else
break while loop // mismatch
if j = m
return i // match at i
return -1; // no match anywhere
Skip the comparision by the number of overlapping numbers
Failure function $F(j)$
Defined as the size of overlapping numbers
the largest prefix of $P[0...j]$ that is also a suffix of $P[0...j]$
Algorithm KMPMatch(T, P)
F = failureFunction(P);
i = 0;
j = 0;
while i < n
if T[i] = P[i]
if j = m - 1
return i - j; // match
else
i = i + 1;
j = j + 1;
else
if j > 0
j = F[j - 1];
else
i = i + 1;
return -1; // no match
Algorithm failureFunction(P)
F[0] = 0;
i = 1;
j = 0;
while i < m
if P[i] = P[j] // we have matched j + 1 chars
F[i] = j + 1;
i = i + 1;
j = j + 1;
else if j > 0 // use failure function to shift P
j = F[j - 1];
else // no match
F[i] = 0;
i = i + 1;
Computing time of the failure function : $O(m)$
Running time of KMP's algorithm : $O(m + n)$ -> Optimal
if $P$ contains $c$, shift $P$ to align the last occurrence of $c$ in $P$ with $T[i]$
else shift $P$ to align $P[0]$ with $T[i+1]$
Last-Occurrence Function $L(\delta )$
The largest index $i$ such that $P[i]=\delta or -1$ if no such index exists
Initialize all array as $-1$ and scan pattern
Algorithm BoyerMooreMatch(T, P, Σ)
L = lastOccurenceFunction(P, Σ);
i = m - 1;
j = m - 1;
while i > n -1
if T[i] = P[j]
if j = 0
return i; // match at i
else
i = i - 1;
j = j - 1;
else // Character-jump
l = L[T[i]];
i = i + m - min(j, 1 + l);
j = m - 1;
return -1; // no match
Computing time of the last-occurrence function : $O(m+s)$
Scan a pattern $P$ : $O(m)$
Initialize a array $\sum$ with size $s$ : $O(s)$
Worst-case running time : $O(nm + s)$
Example of worst case
$T=aaa...a \ (a^n), P=baaa \ (ba^{m-1})$
알파벳 갯수가 적은 경우 주로 발생 (Images or DNA sequences)
알파벳 갯수가 많은 경우 잘 발생하지 않음 (English text)
Significantly faster than the Brute-Force algorithm on English text