ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • String Matching
    CS/알고리즘 2023. 11. 23. 15:33

    String

    • Definition : A sequence of characters
    • $P$ : Patttern string of size $m$
    • Pattern Matching Problem : Given strings $T$ (text) and $P$ (pattern), find a substring of $T$ equal to $P$

     

    Brute-Force Algorithm

    • Compare the pattern $P$ with the text $T$ until either
      1. a match is found, or
      2. all placements of the pattern have been tried
    • Input : text $T$ of size $n$ and pattern $P$ of size $m$
    • Output : starting index or $-1$
    Algorithm BruteForceMatch(T, P)
    	for i = 0 to n - m // test shift i of the pattern
    		j = 0;
    		while j < m 
    			if T[i + j] = P[j]
    				j = j + 1;
    			else
    				break while loop // mismatch
    		if j = m
    			return i // match at i
    	return -1; // no match anywhere
    • Worst-case running time : $O(nm)$
    • Example of worst case
      • $T = aaa\cdots ah \ (a^nb)$, $P=aaah \ (a^{m-1}b)$
      • 알파벳 갯수가 적은 경우 주로 발생 (Images나 DNA sequences)
      • 알파벳 갯수가 많은 경우 잘 발생하지 않음 (English text)

     

    Knuth-Morris-Pratt Algorithm

    • Compare the pattern to the text in left-to-right
    • Skip the comparision by the number of overlapping numbers
    • Failure function $F(j)$
      • Defined as the size of overlapping numbers
      • the largest prefix of $P[0...j]$ that is also a suffix of $P[0...j]$

    Compute the failure function
    Flow of the KMP algorithm

    Algorithm KMPMatch(T, P)
    	F = failureFunction(P);
    	i = 0;
    	j = 0;
    	while i < n
    		if T[i] = P[i]
    			if j = m - 1
    				return i - j; // match
    			else
    				i = i + 1;
    				j = j + 1;
    		else
    			if j > 0
    				j = F[j - 1];
    			else
    				i = i + 1;
    		return -1; // no match
    
    Algorithm failureFunction(P)
    	F[0] = 0;
    	i = 1;
    	j = 0;
    	while i < m
    		if P[i] = P[j] // we have matched j + 1 chars
    			F[i] = j + 1;
    			i = i + 1;
    			j = j + 1;
    		else if j > 0 // use failure function to shift P
    			j = F[j - 1];
    		else // no match
    			F[i] = 0;
    			i = i + 1;
    • Computing time of the failure function : $O(m)$
    • Running time of KMP's algorithm : $O(m + n)$ -> Optimal

     

    Boyer-Moore Algorithm

    • Boyer-Moore Heuristics
      1. Looking-glass heuristic : Compare moving backwards
      2. Character-jump (Bad-character) heuristic
        • When a mismatch occurs at $T[i] = c$
        • if $P$ contains $c$, shift $P$ to align the last occurrence of $c$ in $P$ with $T[i]$
        • else shift $P$ to align $P[0]$ with $T[i+1]$
    • Last-Occurrence Function $L(\delta )$
      • The largest index $i$ such that $P[i]=\delta or -1$ if no such index exists
      • Initialize all array as $-1$ and scan pattern

    Results of Last-Occurrence Function
    Flow of the Boyer-Moore algorithm

    Algorithm BoyerMooreMatch(T, P, Σ)
    	L = lastOccurenceFunction(P, Σ);
    	i = m - 1;
    	j = m - 1;
    	while i > n -1
    		if T[i] = P[j]
    			if j = 0
    				return i; // match at i
    			else
    				i = i - 1;
    				j = j - 1;
    		else // Character-jump
    			l = L[T[i]];
    			i = i + m - min(j, 1 + l);
    			j = m - 1;
    	return -1; // no match
    • Computing time of the last-occurrence function : $O(m+s)$
      • Scan a pattern $P$ : $O(m)$
      • Initialize a array $\sum$ with size $s$ : $O(s)$
    • Worst-case running time : $O(nm + s)$
    • Example of worst case
      • $T=aaa...a \ (a^n), P=baaa \ (ba^{m-1})$
      • 알파벳 갯수가 적은 경우 주로 발생 (Images or DNA sequences)
      • 알파벳 갯수가 많은 경우 잘 발생하지 않음 (English text)
    • Significantly faster than the Brute-Force algorithm on English text

    'CS > 알고리즘' 카테고리의 다른 글

    Dynamic Programming  (0) 2023.11.22
    Graph Optimization Problems and Greedy Algorithms  (0) 2023.11.22
    Graphs and Graph Traversals  (0) 2023.11.21
    Sorting  (0) 2023.11.14
    Analyzing Algorithms and Problem Principles  (0) 2023.11.13
Designed by Tistory.