The Bad Character Rule

The idea of the bad character rule is to identify mismatches between a character in the pattern and a character in the text so that we can safely skip certain shifts. To identify the occurrence of a bad character, let's look at the example in the following table:

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
T H C B B A H C C A B A H A H B C C
P A B A H A H
Table 5.1: Identifying bad characters

In the example provided in Table 5.1, we successfully matched the suffix AH, but then arrived at a bad character, since B != H. Whenever this happens, we're sure that it will only be possible to find a valid shift starting from the next shift that solves this mismatch. This means that we can shift P until either of the following conditions are true:

  • The mismatch is turned into a match
  • The pattern moves past the mismatched character

We can turn a mismatch into a match whenever the pattern has characters to the left of the mismatched character that match the character in the text. Otherwise, we must move the pattern past the mismatched character. In the example provided in Table 5.1, we have another B at P[1], so we can shift P until P[1] aligns with T[3] as follows:

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
T H C B B A H C C A B A H A H B C C
P A B A H A H
Table 5.1.1: Using the bad character rule to skip a shift

We've safely skipped the check for 1 shift. Now, we have a mismatch right in the first character. Let's try to apply the bad character rule. First, let's see if we can turn the mismatch into a match.

Unfortunately, that is not possible because the character C is absent from P. In this case, we shift the pattern past the mismatched character as follows:

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
T H C B B A H C C A B A H A H B C C
P A B A H A H
Table 5.1.2: Pattern moving past a mismatched character

We've successfully skipped checking five shifts and have arrived at a valid shift.

The bad character rule will help us optimize the naive string matching algorithm, but only if we can efficiently find the correct number of times to shift. Let's assume we have access to a two-dimensional array [1...m][1...e], e being the size of our alphabet. For convenience, let's call this array left and assume that left[i][j] gives us the closest index k of character j in P so that k < i, or is -1 if character j isn't found to the left of i in P. If we're able to build such an array, we could improve our naive string search algorithm by considering possibly larger skips (given by the information in left). The following code snippet shows how we can use the left array to improve our naive string searching algorithm as follows:

int skip;
for (int i = 0; i < n - m + 1; i += skip) {
skip = 0;
for (int j = m - 1; j >= 0; j--) {
if (P.charAt(j) != T.charAt(i + j)) {
skip = Math.max(1, j - left[j][T.charAt(i + j)]);
break;
}
}
if (skip == 0) {
shifts.add(i);
skip = 1;
}
}
Snippet 5.3: Using the bad character rule to improve our skips

Go to https://goo.gl/cCYnfp to access this code.

We're left to filling in the left array, which will be performed in the next activity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset