The Good Suffix Rule

The good suffix rule presents a complementary method to enhance our search for valid shifts. To identify when the good suffix rule is applicable, let's look at the example given in the following table:

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
T A A B A B A B A C B A C A B B C A B
P A A C C A C C A C
Table 5.2: Illustration of the good suffix rule

When found in a situation where we have matched a suffix of P but have found a mismatch, using the good suffix rule, and considering t as the matched suffix, we can try to find the next shift that solves the mismatch by carrying out either of the following cases:

  • Find another occurrence of t to the left in P
  • Find a prefix of P which matches a suffix of t
  • Move P past t

Considering case 1, we can try to shift P by three to align other occurrences of t in P (starting at P[4]). As we can see, the letter to the left of that occurrence of t (in P[3]) is C, which is exactly the same as the one that provoked the mismatch. Therefore, we should always try to find a t that is followed, on the left, by a character that is different from the one that provoked the mismatch. A variant of the good suffix rule which ignores the character on the left of t is called the weak good suffix rule.


The good suffix rule takes into account that the character on the left of 
t is also called the strong good suffix rule.

If we can't find another occurrence of t in P, the best we can do with this rule is to find a prefix of P that matches a suffix of t, entering case 2. Table 5.3 illustrates this case:

i 0 1 2 3 4 5 6 7 8 9 10
T A A B A B A B A C B A
P A B B A B
Table 5.3: Finding prefix of P matching the suffix of T

In this case, we found a mismatch at P[1], but we can't find another occurrence of BAB to the left of it. We can, however, find a prefix of AB that matches a suffix of t AB and shift P so that these align.

Whenever we can neither find another occurrence nor a prefix of t, we're left with moving P past t in T.

The implementation of the good suffix rule also requires some preprocessing on P. To understand the preprocessing that is necessary, we need to introduce the concept of a border and proper prefix and suffix. A prefix of string S is a substring of S that occurs at the beginning of S. A proper prefix of string S is a prefix of S that is different than S (consider that S is always a prefix of S).

A suffix of string S is a substring of S that occurs at the end of S. A proper suffix of string S is a suffix of S that is different from S (consider that S is always a suffix of S). A border is a substring of a given string that is both a proper prefix and a proper suffix. For example, given the string ccacc, there are two borders: c and cc. cca is not a border.

The preprocessing step for the good suffix rule is divided into two steps: one for case 1 of the rule, and another for case 2.

In case 1, the matching suffix is a border of a suffix of a pattern. For example, if P = AACCACCAC and we have t = AC (a suffix of P), then we need to find a suffix of P that has AC as a prefix (constituting a border of the suffix). The string ACCAC is a suffix of P and has AC as a border.

Therefore, we need to find the borders of the suffixes of the pattern. But, even after finding them, we need to be able to map a given border to the shortest suffix that has this border so that we're able to shift accordingly. Moreover, to follow the strong good suffix rule, the border cannot be extended to the left by the same symbol that caused the mismatch.

The preprocessing algorithm for case 1 is displayed in the following snippet:

int i = m, j = m + 1;
int[] f = new int[m + 1];
int[] s = new int[m + 1];
f[i] = j;
while (i > 0) {
while (j <= m && P.charAt(i - 1) != P.charAt(j - 1)) {
if (s[j] == 0)
s[j] = j - i;
j = f[j];
}
i--; j--;
f[i] = j;
}
Snippet 5.4: Preprocessing algorithm for Case 1 of the good suffix rule. Source class name: GoodSuffixRule
Go to https://goo.gl/WzGuVG to access this code.
To better understand the preprocessing algorithm for
case 1, put some println statements on the relevant steps of the algorithm and run it using some sample input. You can use string ABBABAB, whose output is shown in Table 5.4.

In the previous snippet, we compute an array f, whose entries f[i] contain the starting position of the widest border of the suffix of the pattern that starts at position i. f[m] is equal to m + 1, as the empty string has no border. The idea behind the previously shown preprocessing algorithm is to compute each border by checking whether a shorter border that is already known can be extended to the left by the same symbol. The array s is used to store shift distances; we can save entries in array s whenever we can't extend a border to the left (when P[i - 1] != P[j - 1]), provided that s[j] is not already occupied.

To better understand what this algorithm produces, let's look at its output for string ABBABAB, which is shown in the following table:

i 0 1 2 3 4 5 6 7
P A B B A B A B
f 5 6 4 5 6 7 7 8
s 0 0 0 0 2 0 4 1
Table 5.4: Output of the preprocessing algorithm for Case 1 of the good suffix rule with string ABBABAB

The widest border of suffix BABAB, which starts at 2, is BAB, which starts at 4, and therefore f[2] = 4. The widest border of suffix AB, which starts at 5, is "", which starts at 7. Therefore, f[5] = 7. The suffix BABAB, whose widest border is BAB, cannot be extended to the left (since P[1] != P[3]). Therefore, the shift distance of BAB is matched and then a mismatch occurs, which is s[4] = 4 - 2 = 2. The suffix BABAB has border B as well, which also cannot be extended to left, which ensures that s[6] = 6 - 2 = 4. The suffix B beginning at position 6 has border "", beginning at position 7; therefore, s[7] = 7 - 6 = 1, which corresponds to the shift distance if nothing has matched.

In case 2, a suffix of the matching suffix of the pattern occurs at the beginning of the pattern, which constitutes a border of the pattern. Therefore, the pattern can be shifted as far as its widest border allows. What we need to do for the preprocessing step for case 2 is to find, for each suffix, the widest border of the pattern that is contained in that suffix. We can build upon the f array that was previously computed to do that. The following snippet illustrates this:

j = f[0];
for (i = 0; i <= m; i++) {
if (s[i] == 0)
s[i] = j;
if (i == j)
j = f[j];
}
Snippet 5.5: Preprocessing algorithm for Case 2 of the good suffix rule. Source class name: GoodSuffixRule

Go to https://goo.gl/ckoTu6 to access this code.

The widest border of the pattern is stored at f[0]. The idea of the preprocessing algorithm for case 2 is to use that value until the pattern becomes shorter than f[0], in which case we go with the next wider border of the pattern (f[j]).

Integrating the good suffix case with the naive search algorithm allows us to improve on the skips performed, as shown in the following code:

for (i = 0; i < n - m + 1; i += skip) {
boolean hasMatch = true;
skip = 0;
for (j = m - 1; j >= 0; j--) {
if (P.charAt(j) != T.charAt (i + j)) {
skip = s[j + 1];
hasMatch = false;
break;
}
}
if (hasMatch) {
shifts.add(i);
skip = s[0];
}
}
Snippet 5.6: The Boyer-Moore algorithm using only the good suffix rule. Source class name: Goodsuffixrule

Go to https://goo.gl/1uCgeh to access this code.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset