CS/알고리즘

String Matching

재호맴매 2023. 11. 23. 15:33

String

  • Definition : A sequence of characters
  • $P$ : Patttern string of size $m$
  • Pattern Matching Problem : Given strings $T$ (text) and $P$ (pattern), find a substring of $T$ equal to $P$

 

Brute-Force Algorithm

  • Compare the pattern $P$ with the text $T$ until either
    1. a match is found, or
    2. all placements of the pattern have been tried
  • Input : text $T$ of size $n$ and pattern $P$ of size $m$
  • Output : starting index or $-1$
Algorithm BruteForceMatch(T, P)
	for i = 0 to n - m // test shift i of the pattern
		j = 0;
		while j < m 
			if T[i + j] = P[j]
				j = j + 1;
			else
				break while loop // mismatch
		if j = m
			return i // match at i
	return -1; // no match anywhere
  • Worst-case running time : $O(nm)$
  • Example of worst case
    • $T = aaa\cdots ah \ (a^nb)$, $P=aaah \ (a^{m-1}b)$
    • 알파벳 갯수가 적은 경우 주로 발생 (Images나 DNA sequences)
    • 알파벳 갯수가 많은 경우 잘 발생하지 않음 (English text)

 

Knuth-Morris-Pratt Algorithm

  • Compare the pattern to the text in left-to-right
  • Skip the comparision by the number of overlapping numbers
  • Failure function $F(j)$
    • Defined as the size of overlapping numbers
    • the largest prefix of $P[0...j]$ that is also a suffix of $P[0...j]$

Compute the failure function
Flow of the KMP algorithm

Algorithm KMPMatch(T, P)
	F = failureFunction(P);
	i = 0;
	j = 0;
	while i < n
		if T[i] = P[i]
			if j = m - 1
				return i - j; // match
			else
				i = i + 1;
				j = j + 1;
		else
			if j > 0
				j = F[j - 1];
			else
				i = i + 1;
		return -1; // no match

Algorithm failureFunction(P)
	F[0] = 0;
	i = 1;
	j = 0;
	while i < m
		if P[i] = P[j] // we have matched j + 1 chars
			F[i] = j + 1;
			i = i + 1;
			j = j + 1;
		else if j > 0 // use failure function to shift P
			j = F[j - 1];
		else // no match
			F[i] = 0;
			i = i + 1;
  • Computing time of the failure function : $O(m)$
  • Running time of KMP's algorithm : $O(m + n)$ -> Optimal

 

Boyer-Moore Algorithm

  • Boyer-Moore Heuristics
    1. Looking-glass heuristic : Compare moving backwards
    2. Character-jump (Bad-character) heuristic
      • When a mismatch occurs at $T[i] = c$
      • if $P$ contains $c$, shift $P$ to align the last occurrence of $c$ in $P$ with $T[i]$
      • else shift $P$ to align $P[0]$ with $T[i+1]$
  • Last-Occurrence Function $L(\delta )$
    • The largest index $i$ such that $P[i]=\delta or -1$ if no such index exists
    • Initialize all array as $-1$ and scan pattern

Results of Last-Occurrence Function
Flow of the Boyer-Moore algorithm

Algorithm BoyerMooreMatch(T, P, Σ)
	L = lastOccurenceFunction(P, Σ);
	i = m - 1;
	j = m - 1;
	while i > n -1
		if T[i] = P[j]
			if j = 0
				return i; // match at i
			else
				i = i - 1;
				j = j - 1;
		else // Character-jump
			l = L[T[i]];
			i = i + m - min(j, 1 + l);
			j = m - 1;
	return -1; // no match
  • Computing time of the last-occurrence function : $O(m+s)$
    • Scan a pattern $P$ : $O(m)$
    • Initialize a array $\sum$ with size $s$ : $O(s)$
  • Worst-case running time : $O(nm + s)$
  • Example of worst case
    • $T=aaa...a \ (a^n), P=baaa \ (ba^{m-1})$
    • 알파벳 갯수가 적은 경우 주로 발생 (Images or DNA sequences)
    • 알파벳 갯수가 많은 경우 잘 발생하지 않음 (English text)
  • Significantly faster than the Brute-Force algorithm on English text