CS/알고리즘
String Matching
재호맴매
2023. 11. 23. 15:33
String
- Definition : A sequence of characters
- $P$ : Patttern string of size $m$
- Pattern Matching Problem : Given strings $T$ (text) and $P$ (pattern), find a substring of $T$ equal to $P$
Brute-Force Algorithm
- Compare the pattern $P$ with the text $T$ until either
- a match is found, or
- all placements of the pattern have been tried
- Input : text $T$ of size $n$ and pattern $P$ of size $m$
- Output : starting index or $-1$
Algorithm BruteForceMatch(T, P)
for i = 0 to n - m // test shift i of the pattern
j = 0;
while j < m
if T[i + j] = P[j]
j = j + 1;
else
break while loop // mismatch
if j = m
return i // match at i
return -1; // no match anywhere
- Worst-case running time : $O(nm)$
- Example of worst case
- $T = aaa\cdots ah \ (a^nb)$, $P=aaah \ (a^{m-1}b)$
- 알파벳 갯수가 적은 경우 주로 발생 (Images나 DNA sequences)
- 알파벳 갯수가 많은 경우 잘 발생하지 않음 (English text)
Knuth-Morris-Pratt Algorithm
- Compare the pattern to the text in left-to-right
- Skip the comparision by the number of overlapping numbers
- Failure function $F(j)$
- Defined as the size of overlapping numbers
- the largest prefix of $P[0...j]$ that is also a suffix of $P[0...j]$
Algorithm KMPMatch(T, P)
F = failureFunction(P);
i = 0;
j = 0;
while i < n
if T[i] = P[i]
if j = m - 1
return i - j; // match
else
i = i + 1;
j = j + 1;
else
if j > 0
j = F[j - 1];
else
i = i + 1;
return -1; // no match
Algorithm failureFunction(P)
F[0] = 0;
i = 1;
j = 0;
while i < m
if P[i] = P[j] // we have matched j + 1 chars
F[i] = j + 1;
i = i + 1;
j = j + 1;
else if j > 0 // use failure function to shift P
j = F[j - 1];
else // no match
F[i] = 0;
i = i + 1;
- Computing time of the failure function : $O(m)$
- Running time of KMP's algorithm : $O(m + n)$ -> Optimal
Boyer-Moore Algorithm
- Boyer-Moore Heuristics
- Looking-glass heuristic : Compare moving backwards
- Character-jump (Bad-character) heuristic
- When a mismatch occurs at $T[i] = c$
- if $P$ contains $c$, shift $P$ to align the last occurrence of $c$ in $P$ with $T[i]$
- else shift $P$ to align $P[0]$ with $T[i+1]$
- Last-Occurrence Function $L(\delta )$
- The largest index $i$ such that $P[i]=\delta or -1$ if no such index exists
- Initialize all array as $-1$ and scan pattern
Algorithm BoyerMooreMatch(T, P, Σ)
L = lastOccurenceFunction(P, Σ);
i = m - 1;
j = m - 1;
while i > n -1
if T[i] = P[j]
if j = 0
return i; // match at i
else
i = i - 1;
j = j - 1;
else // Character-jump
l = L[T[i]];
i = i + m - min(j, 1 + l);
j = m - 1;
return -1; // no match
- Computing time of the last-occurrence function : $O(m+s)$
- Scan a pattern $P$ : $O(m)$
- Initialize a array $\sum$ with size $s$ : $O(s)$
- Worst-case running time : $O(nm + s)$
- Example of worst case
- $T=aaa...a \ (a^n), P=baaa \ (ba^{m-1})$
- 알파벳 갯수가 적은 경우 주로 발생 (Images or DNA sequences)
- 알파벳 갯수가 많은 경우 잘 발생하지 않음 (English text)
- Significantly faster than the Brute-Force algorithm on English text