CPS222 Lecture: Pattern Matching in Strings - Last revised 4/07/2015 OBJECTIVES: 1. To become familiar with the Knuth-Morris-Pratt algorithm for pattern matching in strings. 2. To become familiar with the basic wildcard matching algorithm for strings. MATERIALS: 1. Projectable of brute force matching algorithm 2. Projectable of applying brute force algorithm to a pathological case 3. Projectable of KMP algorithm 4. Projectable of applying KMP to some example strings 5. Projectable of computation of KMP failure function 6. Projectable of applying this to some example patterns 7. Projectable of wildcard matching algorithm I. Introduction - ------------ A. When we speak of a string, we are in general speaking of a (possibly empty) sequence of symbols drawn from some alphabet. There are, therefore, as many different types of strings as there are possible alphabets: 1. Bit strings are drawn from the alphabet {0,1}. Ultimately, all data is represented in memory as bit strings - but some hardware systems and some programming languages provide facilities for manipulating bit strings of arbitrary length (not necessarily 8 or 32 or whatever the word size happens to be). 2. An individual's DNA can be represented by a string on the alphabet { A, C, G, T } - symbolic names for four chemical bases. 3. Character strings are drawn from the alphabet {c | c is of type char }. Most often, when we speak of a string without further qualification, this is what we mean. We will focus our discussion on character strings. B. The key problem in implementing strings in any language is variable length. Over the course of a program's execution, the space needed by a given string variable may vary widely - especially when working with character strings. C. C++ has two different facilities for working with character strings 1. One facility is inherited from C - hence such strings are often called "C strings". A C string is an array of characters, with the end of the string (and hence its length) marked by a null character ('\0'). a. When the C++ compiler sees a sequence of characters enclosed in quotes, it interprets it as a C string. b. Example: the string "hello" has internal representation _______________________________ | h | e | l | l | o | \0 | ------------------------------- (Note that, although the string is 5 characters long, the representation needs six characters to allow for the terminating null character) c. When declaring a variable to hold a C string, the declared size of the array must be big enough to hold the largest length string the variable will ever hold. This will, of course, be 1 more than the number of characters. When the variable holds a smaller value, some number of elements in the array will be unused. d. Because of the equivalence between arrays and pointers in C, such a string can be regarded equivalently as being of type char ... [] or of type char *. 2. The other facility - which is unique to C++ - is the library string class. This facility supports variable length strings, as follows: a. A string object contains a single field, which points to a representation that looks like this: current length space reserved (can be larger than current length) # of references "selfish" data (array of characters at least big enough to hold current value) b. Operations on a string that would change its length (e.g. assignment of a new value, inserting or appending characters) may result in a new representation being created and the pointer being reset, if the total space available is less than the new needed length. c. For efficiency, two or more strings can share the same representation structure. The reference count keeps track of the number of strings sharing a representation - when it goes to zero the representation is deleted. d. The selfish flag indicates that a representation is subject to being modified (is being accessed by a non const method) and hence cannot be shared. II. Pattern matching algorithms -- ------- -------- ---------- A. From an efficiency standpoint, the most challenging string operation to implement is pattern matching: given a pattern string and a subject string, determine whether the pattern occurs in the subject and, if so, where the match begins. 1. This is an important operation. a. For example, it is what string methods such as index() do. b. But pattern matching is also used for other kinds of strings - e.g. matching DNA sequences in a database of DNA samples. 2. A fairly straight-forward approach is to use brute force as follows. We use the variable i to point to a character in the subject and j to point to a character in the pattern. At each iteration of the loop, we compare subject[i] to pattern[j]: // Return first position where pattern matches subject or -1 if no match int match(string pattern, string subject) { int p = 0; int s = 0; // Positions in pattern and subject while (p < pattern.length() && s < subject.length()) { if (pattern[p] == subject[s]) { p ++; s ++; } else { s = s - p + 1; p = 0; } } if (p >= pattern.length()) return s - p; else return -1; } PROJECT Example: search for 'lo' in 'hello' i j Initialize 0 0 if fails ('h' != 'l') Set i and j back 1 0 if fails ('e'!='l') Set i and j back 2 0 if succeeds ('l' = 'l') increment i, j 3 1 if fails ('l'!='o') Set i and j back 3 0 if succeeds ('l' = 'l') increment i, j 4 1 if succeeds ('o'='o') increment i, j 5 2 exit while loop - j >= length('lo') declare match starting at (5 - 2) = 3 and exit 3. If we let n be the length of the subject and m of the pattern, then in most cases, this algorithm has performance close to O(n+m). a. When pattern and subject match, i is incremented by 1 b. When pattern and subject don't match, i can be decremented. However, observe that if j = 0, i is in fact INCREMENTED, and if j = 1, i is left alone. Since mismatches generally occur early with typical subjects and patterns, we expect that most iterations of the while loop will result in increasing i by 1, and i will rarely be decremented. Since the while loop exits when i > n, we expect slightly more that O(n) iterations of the loop. c. We arrive at the approximation O(n+m) by arguing that we will have O(n) "false starts" and will match m characters when we finally succeed. 4. However, this algorithm has worst case performance O(n^2). To see this, consider an (admitedly-contrived) example: search for 'aaaab' in 'aaaaaaaaab' Initialize 0 0 'a' = 'a' - increment 1 1 'a' = 'a' - increment 2 2 'a' = 'a' - increment 3 3 'a' = 'a' - increment 4 4 'a' != 'b' - back up 1 0 'a' = 'a' - increment 2 1 'a' = 'a' - increment 3 2 'a' = 'a' - increment 4 3 'a' = 'a' - increment 5 4 'a' != 'b' - back up 2 0 'a' = 'a' - increment 3 1 'a' = 'a' - increment 4 2 'a' = 'a' - increment 5 3 'a' = 'a' - increment 6 4 'a' != 'b' - back up 3 0 'a' = 'a' - increment 4 1 'a' = 'a' - increment 5 2 'a' = 'a' - increment 6 3 'a' = 'a' - increment 7 4 'a' != 'b' - back up 4 0 'a' = 'a' - increment 5 1 'a' = 'a' - increment 6 2 'a' = 'a' - increment 7 3 'a' = 'a' - increment 8 4 'a' != 'b' - back up 5 0 'a' = 'a' - increment 6 1 'a' = 'a' - increment 7 2 'a' = 'a' - increment 8 3 'a' = 'a' - increment 9 4 'a' = 'a' - increment 10 6 Exit reporting match starting at (10 - 5) = 5 PROJECT What in fact happens here is we have a series of 5 matches of almost all the characters in the pattern, which fail only on the last character, but the 6th such match succeeds. That is, we compare (n - m + 1)(m) = nm - m^2 + m characters = O(mn). That is, the worst case for our algorithm is O(mn); and since the pattern can be of length comparable to the subject, this becomes O(n^2). B. The pathological behavior of the algorithm in cases like the above can be avoided (and a guaranteed O(n+m) search can be achieved) by taking advantage of information acquired up to the time a match fails. 1. In the above example, our brute force algorithm backs up all the way to the beginning of the pattern when it discovers the mismatch between the "b" in the pattern and the "a" in the subject. This is not necessary, since we know that the pattern begins with a series of 4 a's, all of which already matched in the subject. When we go to consider a possible match at the next starting position in the subject, three of the subject a's are already known to match the corresponding pattern a's, so we could resume by comparing the fourth a in the pattern with the corresponding character in the subject. That is, instead of backing up from j = 4 to j = 0 (4 positions), we could simply back up from j = 4 to j = 3 (1 position), leaving i where it is. First trial match: a a a a a a a a a b a a a a b (we consider all five positions) Second trial match: a a a a a a a a a b a a a a b ^ ^ | | (we only need to consider these positions) Third trial match: a a a a a a a a a b a a a a b ^ ^ | | (we only need to consider these positions) ... Sixth trial match: a a a a a a a a a b a a a a b ^ ^ | | (we only need to consider these positions) The total number of comparisons needed now is five on the first trial and two each on the second through sixth trials - or 15 total = n + m. 2. Now consider another slightly different example. Suppose the pattern is as before, but the subject is a a a a c a a a a b. What happens when we discover the mismatch between b and c at position 4 in the subject and pattern? a. In this case, because the pattern begins with four a's the c in the subject at position 4 allows us to conclude that no match can possibly start at positions 1, 2, 3, or 4 in the subject. b. Thus, after failing to find a match starting at position 0 in the subject, we can skip ahead to consider a possible match starting at position 5 in the subject - which of course succeeds. c. Here, we should be able to complete the search with only 10 comparisons (though the algorithm we will consider will, in fact, use 14 - close to the previous case.) 3. These observations are incorporated into an algorithm known as the Knuth-Morris-Pratt algorithm. a. The basic idea is this: before starting the actual matching process, we compute a table next[j] with one entry for each position in the pattern, defined as follows. (The book calls this function f instead of next): next[j] = length of the longest prefix of the pattern that is a suffix of the pattern starting at position 1 and ending at position j with special case next[0] = 0 Example: for our pattern aaaab, we compute next[] as follows j next[j] Rationale 0 0 Special case 1 1 pattern[0..0] = pattern[1..1] 2 2 pattern[0..1] = pattern[1..2] 3 3 pattern[0..2] = pattern[1..3] 4 0 No suffix of pattern[1..j] matches any prefix of the pattern (We will discuss the mechanics of computing this table shortly.) b. Now we use this table in the search as follows: i. Suppose that we have matched j characters of the pattern with the subject, but pattern[j] fails to match subject[i]. ii. In the brute force algorithm, we would continue the search by comparing pattern[0] with subject[i-j+1] - i.e. we start the whole matching process over beginning 1 beyond where the one that failed started. Now, instead, we proceed as follows: start with i = 0, j = 0; while i < length of subject and j < length of pattern if subject[i] == pattern[j] increment both i and j else if j > 0 set j = next[j-1] and leave i alone (i.e. look for a match in which the current subject character matches the pattern at position next[j-1] since it didn't match at position j, though previous characters are already known to match.) else then set i = i + 1. (No match can begin with subject[i], so the first possible match would begin at subject[i+1]). if j >= length of pattern then declare match found beginning at position i - j else declare no match PROJECT iii. Example: match aaaab with aaaaaaaaab i j Action Initialize 0 0 'a' == 'a' - increase i,j 1 1 'a' == 'a' - increase i,j 2 2 'a' == 'a' - increase i,j 3 3 'a' == 'a' - increase i,j 4 4 'a' != 'b' . Since next[3] = 3, set j = 3 - leave i alone 4 3 'a' == 'a' - increase i,j 5 4 'a' != 'a' . Since next[3] = 3, set 3 = 3 - leave i alone 5 3 'a' == 'a' - increase i,j 6 4 'a' ! 'b'. Since next[3] = 3, set j = 3 - leave i alone 6 3 'a' == 'a' - increase i,j 7 4 'a' != 'b' . Since next[3] = 3, set j = 3 - leave i alone 7 3 'a' == 'a' - increase i,j 8 4 'a' != 'b' . Since next[3] = 3, set j = 3 - leave i alone 8 3 'a' == 'a' - increase i,j 9 4 'b' == 'b' - increase i,j 10 5 Exit loop - announce match beginning at position 10-5 = 5 iv. Another example: match a a a a b with a a a a c a a a a b. i j Action Initialize 0 0 'a' == 'a' - increase i,j 1 1 'a' == 'a' - increase i,j 2 2 'a' == 'a' - increase i,j 3 3 'a' == 'a' - increase i,j 4 4 'c' != 'b' . Since next[3] = 3, set j = 3 - leave i alone 4 3 'c' != 'a' . Since next[2] = 2, set j = 2 - leave i alone 4 2 'c' != 'a' . Since next[1] = 1, set j = 1 - leave i alone 4 1 'c' != 'a' . Since next[0] = 0, set j = 0 - leave i alone 4 0 'c' == 'a' . Since j == 0, leave j at 0 and increment i 5 0 'a' == 'a' - increase i,j 6 1 'a' == 'a' - increase i,j 7 2 'a' == 'a' - increase i,j 8 3 'a' == 'a' - increase i,j 9 4 'b' == 'b' - increase i,j 10 5 Exit loop - announce match beginning at position 10-5 = 5 (14 comparisons total) PROJECT c. Of course, we still have the problem of computing the table next[j]. This must be done as a preliminary step before matching begins, and is done by matching the pattern against itself, as follows: i. Set i = 1, j = 0, next[0] = 0 ii. As long as i < the length of the pattern: if p[i] = p[j] then set next[i]=j+1 and increment both i and j else if j > 0 set j = next(j-1) but leave i alone else set next[i] = 0 and increment i but leave j 0 PROJECT iii. Example: for the pattern a a a a b: i j Initial next[0..4] 1 0 0 ? ? ? ? 1 0 'a' == 'a' 0 1 ? ? ? 2 1 'a' == 'a' 0 1 2 ? ? 3 2 'a' == 'a' 0 1 2 3 ? 4 3 'b' != 'a', j > 0 0 1 2 3 ? 4 2 'b' != 'a', j > 0 0 1 2 3 ? 4 1 'b' != 'a', j > 0 0 1 2 3 ? 4 0 'b' != 'a', j == 0 0 1 2 3 4 5 0 Done iv. Example: for the pattern a b a b a: i j Initial next[0..4] 1 0 'b' != 'a', j == 0 0 0 ? ? ? 2 0 'a' == 'a' 0 0 1 ? ? 3 1 'b' == 'b' 0 0 1 2 ? 4 2 'a' == 'a' 0 0 1 2 3 5 3 Done PROJECT d. To see that the Knuth-Morris-Pratt algorithm is in fact O(n+m), observe: i. In the main loop of the matching algorithm, we increment i most of the time, and we never decrease i. Further, we increment i exactly n times. ii. It is possible to have a series of one or more steps where we don't increment i; but since next[j-1] < j, if we have a series of k steps where we don't increment i, these must follow a series of k steps where we did increment both i and j. Thus, the number of times we don't increment i cannot exceed the number of times we do. iii. Thus, the total number of loop iterations is <= 2n, and the main match is O(n). iv. By similar argument, the preliminary process for creating next[] is O(m), so the total process is O(n+m). 4. There are several other ways to improve the matching process. For example, one algorithm developed by Boyer and Moore relies on working BACKWARD through the pattern, and can be very fast in those cases where characters near the end of the pattern do not occur elsewhere in the pattern. 5. Still other techniques are based on HASHING, rather than direct comparison. We will not discuss these now - see a book on Algorithms. C. Another interesting situation arises when we allow the use of wildcard characters in the pattern. 1. For example, C shells allow the use of the wildcard characters ? and * in arguments to commands. ? matches any one character, and * matches any sequence of characters (including an empty sequence). If a directory contains the files foe foo foreign fo? would match the first two, but not foreign foe* would match foe but not the other two fo* would match all three 2. The simple pattern matching we looked at earlier can easily be extended to handle wildcard like these by using a recursive auxiliary function, using symbolic names WILDCARD_SINGLE and WILDCARD_MANY for the wildcards: // Recursive auxiliary - return true if pattern starting at position p matches // subject starting at position s int matchAux(string pattern, string subject, int p, int s) { while (p < pattern.length() && s < subject.length()) { if (pattern[p] == subject[s] || pattern[p] == WILDCARD_SINGLE) { p ++; s ++; } else if (pattern[p] == WILDCARD_MANY) { return matchAux(pattern, subject, p, s+1) || matchAux(pattern, subject, p+1, s); } else { return false; } } return p >= pattern.length(); } // Return first position where pattern matches subject or -1 if no match int match(string pattern, string subject) { for (int start = 0; start < subject.length(); start ++) if (matchAux(pattern, subject, 0, start)) return start; return -1; } PROJECT 3. The extension to handle WILDCARD_SINGLE was easy. The extension to handle WILDCARD_MANY is the tricky part. Notice how we consider two possibilities: a. The wildcard matches at least the current character in the subject, and maybe more - so we consider what happens if we increment the position in the subject but not in the pattern. b. The wildcard matches no more characters in the subject, so we move on to the next character in the pattern while remaining at the same position in the subject. c. The fact that there are two possible ways of going at this point leads to the need to use a recursive auxiliary. 4. It is possible - though we won't pursue it here - to extend an algorithm like KMP to allow the use of wildcards while still preserving O(n) behavior. D. A final issue concerns allowing generalized regular expressions to be used in the pattern. 1. Both the java and C++ standard libraries provide support for this - the package java.util.regex and the standard header regex.h. 2. However, since regular expressions are a topic in CPS320 we won't discuss them further here.