63. Longest common substring

The following code uses the obvious approach described in the problem statement that examines all of the first string's possible substrings:

// Find the longest substring by examining every possible
// substring in string1.
private string FindLongestSubstring1(string string1, string string2)
{
string bestSubstring = "";
int bestLength = 0;

// Loop over all possible starting positions.
for (int startPos = 0; startPos < string1.Length - 1; startPos++)
{
// Loop over possible lengths starting at this position.
for (int length = bestLength + 1;
length <= string1.Length - startPos;
length++)
{
string testSubstring = string1.Substring(startPos, length);
int testPos = string2.IndexOf(testSubstring);
if (testPos >= 0)
{
bestLength = length;
bestSubstring = testSubstring;
}
}
}

return bestSubstring;
}

This method initializes the bestSubstring and bestLength variables to indicate that it hasn't found a solution.

The code then enters two nested loops. First, the code loops over all possible starting positions in the first string.

For each starting position, the code loops over all possible lengths that are greater than the longest length found so far and that a substring can have starting at that position. There's no point looking at shorter lengths because they would be shorter than the best one found so far. There's also no point looking at a string of length 10 if the position is 2 characters from the end of the string.

Within the inner loop, the code uses the second string's IndexOf method to see if the test substring appears in the second string. If the test substring is present then the code updates the best solution it has found.

After it finishes examining every possible substring, the method returns the longest substring that it found.

There's actually an easy way to make this method much faster. You may want to stop reading for a minute and try to figure it out on your own.

Suppose the inner loop is examining substrings that start at position startPos, and string2 does not contain a substring of a given length. In that case, it cannot contain any longer substrings starting at position startPos either.

For example, suppose string1 is MAYBE and string2 is ALWAYS. Now, suppose the outer loop is considering position 1 in string1. The possible substrings at that position are A, AY, AYB, and AYBE.

The inner loop checks to see if substring A appears in string2 and finds it. It then looks for substring AY and finds that one, too. The loop then looks for substring AYB, but that substring isn't in string2. Because AYB isn't present, that means the longer substring, AYBE, also must not be present. When the inner loop fails to find AYB, it can exit the loop without finishing it. This test saves a lot of time because, unless the strings have a really strange structure, most common substring tests will fail for relatively short substrings.

The following code shows the modified method with the new test highlighted in bold:

// Find the longest substring by examining every possible
// substring in string1.
private string FindLongestSubstring1(string string1, string string2)
{
string bestSubstring = "";
int bestLength = 0;

// Loop over all possible starting positions.
for (int startPos = 0; startPos < string1.Length - 1; startPos++)
{
// Loop over possible lengths starting at this position.
for (int length = bestLength + 1;
length <= string1.Length - startPos;
length++)
{
string testSubstring = string1.Substring(startPos, length);
int testPos = string2.IndexOf(testSubstring);
if (testPos < 0) break;

bestLength = length;
bestSubstring = testSubstring;
}
}

return bestSubstring;
}

This method is much faster than the original version, particularly for long strings.

This version is fast enough to be practical, but I want to describe another method, mostly because it demonstrates a useful dynamic programming technique. Dynamic programming is any technique that uses known information to calculate new information. Solution 5. Fibonacci numbers, provides a good example. One of the approaches for finding Fibonacci numbers used a table of previously calculated values to find new values.

For the longest substring problem, you can build an array that gives the lengths of the longest substrings that end at a particular letter in each word. For example, if the array is called num, then num[i, j] is the length of the longest substring that ends at position i in the first string and position j in the second string.

The following diagram shows the array for the strings IGUANA and BANANA. The bold entries show the places where the two strings have common substrings. For example, num[6, 4] holds the value 3, indicating that the substring ANA with length 3 ends at that position. (When numbering the rows and columns, ignore the picture's first row and column, which show the words' letters and are not really part of the array.):

After you build this table, you can find the longest substring by finding the biggest value and then looking backwards through the strings. The real trick is building the table, and that turns out to be easier than you might imagine.

The first row and column in the table are filled with 0s to make later bookkeeping easier.

Suppose you've filled in the table's rows up to row R - 1 and now you're considering row R, column C. You need to figure out the longest substring that ends with the letter at position C in the first string and the letter at position R in the second. There are two possible cases.

First, suppose the two strings' letters at those positions are different. In that case, there is no matching substring here, so the table's entry num[R, C] should be 0.

Second, suppose the two strings' letters at those positions are the same. In that case, this will be num[R, C] should be num[R – 1, C – 1] + 1. In other words, the longest substring at position [R, C] is the same as the longest substring at position [R – 1, C – 1], with the new character at position [R, C] added at the end.

For example, look again at the previous table and suppose you’re looking at num[7, 7] in the lower right corner. That position corresponds to the second A in IGUANA and the third A in BANANA. The letters match, so you look one row up and one column left at num[6, 6]. That value is 2, so num[7, 7] = 2 + 1 = 3.

If you think about the substrings, this means we're adding A to the end of the previously found matching substring, AN, to get ANA.

You might want to stop reading at this point and try to implement this new algorithm. The following code shows the implementation used by the example solution:

// Use a dynamic table to find the longest substring.
private string FindLongestSubstring3(string string1, string string2)
{
int[,] num = new int[string1.Length + 1, string2.Length + 1];
int bestLength = 0;
int bestEnd = -1;

for (int i = 1; i < string1.Length + 1; i++)
{
char ch1 = string1[i - 1];
for (int j = 1; j < string2.Length + 1; j++)
{
char ch2 = string2[j - 1];

// Compare the characters.
if (ch1 != ch2)
// They don't match. No matching substring here.
num[i, j] = 0;
else
{
// They match.
num[i, j] = num[i - 1, j - 1] + 1;

// See if this is better than the best found so far.
if (num[i, j] > bestLength)
{
bestLength = num[i, j];
bestEnd = i;
}
}
}
}
int bestStart = bestEnd - bestLength + 1;
return string1.Substring(bestStart - 1, bestLength);
}

In theory, this last approach is probably the fastest of the three described here. In practice, however, the second approach is slightly faster. I suspect that's because it takes advantage of the string class's IndexOf method, which is remarkably fast.

Download the LongestCommonSubstring example solution to see additional details.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset