61. Edit distance

This is a fairly advanced problem, so I'll outline a solution before I present any code. You may want to stop reading after the outline and try to implement the algorithm on your own.

The approach I'm going to describe uses an edit graph. The edit graph is a network of nodes that represents possible changes that you can make to change the start string into the end string.

The following diagram shows an edit graph that represents changing the word Dungeon into the word Dragon:

The starting word's letters fill the top row of nodes, not counting the node in the upper left corner. The end word's letters fill the left column of the nodes, also not counting the upper left corner.

Each horizontal arrow represents removing one of the starting word's letters. For example, the arrow in the top row between the n and the g represents removing the g from the string.

Each vertical arrow represents adding a letter that belongs in the end word. For example, the arrow in the left column between the D and the r represents adding the r to the word.

The shaded nodes have the same letter in their leftmost columns and topmost rows. Those nodes, which are called match points, represent leaving a letter unchanged. For example, the shaded node near the middle of the bottom row represents using the first n in Dungeon as the n in Dragon. The match points get additional arrows pointing from their upper left to indicate that we have skipped two steps that remove and re-add the same letter.

Any path through this graph represents a way to change the starting word into the end word. For example, you could move horizontally across the top row, removing all of the starting word's letters. Then, you could move down the rightmost column, adding all of the end word's letters. That would give a valid transformation, although generally not the shortest one.

Alternatively, you could move down the leftmost column, adding all of the end word's letters, and then move horizontally across the bottom row, removing all of the starting word's letters. Again, this probably isn't the shortest path.

To find the shortest path, assign a cost of 1 to all of the horizontal and vertical links. Give the match points' diagonal links the cost zero. Now, to find the smallest edit distance, you need to find the shortest path from the upper left node to the lower right node, following the links.

The preceding diagram shows the shortest path in bold. Because the match point links have zero cost, and because each of those links replaces a horizontal move plus a vertical move, the shortest path is also the one that uses the most diagonal links.

If you want to try implementing the algorithm on your own, stop reading and do so now.

The program must perform two main tasks. First, it must build the edit graph. Second, it needs to use the graph to find the shortest way to modify the first string into the second.

To perform the second task, the code needs to keep track of how it travels through the edit graph. To store information about a position within the graph, the example solution uses the following Direction enumeration and Node structure:

private enum Direction
{
FromAbove,
FromLeft,
FromDiagonal
}

private struct Node
{
public int Distance;
public Direction Direction;
}

The Direction enumeration indicates how you reached a node, whether from the node above, to the left, or diagonally to the left and above (for match points). The Node structure keeps track of the distance traveled to a spot in the edit graph and the link that we used to get to that node.

The following method builds the example solution's edit graph:

// Fill in the edit graph for two strings.
private Node[,] MakeEditGraph(string string1, string string2)
{
// Make the edit graph array.
int numCols = string1.Length + 1;
int numRows = string2.Length + 1;
Node[,] nodes = new Node[numRows, numCols];

// Initialize the leftmost column.
for (int r = 0; r < numRows; r++)
{
nodes[r, 0].Distance = r;
nodes[r, 0].Direction = Direction.FromAbove;
}

// Initialize the top row.
for (int c = 0; c < numCols; c++)
{
nodes[0, c].Distance = c;
nodes[0, c].Direction = Direction.FromLeft;
}

// Fill in the rest of the array.
char[] chars1 = string1.ToCharArray();
char[] chars2 = string2.ToCharArray();
for (int c = 1; c < numCols; c++)
{
// Fill in column c.
for (int r = 1; r < numRows; r++)
{
// Fill in entry [r, c].
// Check the three possible paths to here.
int distance1 = nodes[r - 1, c].Distance + 1;
int distance2 = nodes[r, c - 1].Distance + 1;
int distance3 = int.MaxValue;
if (chars1[c - 1] == chars2[r - 1])
{
// There is a diagonal link.
distance3 = nodes[r - 1, c - 1].Distance;
}

// See which is cheapest.
if ((distance1 <= distance2) && (distance1 <= distance3))
{
// Come from above.
nodes[r, c].Distance = distance1;
nodes[r, c].Direction = Direction.FromAbove;
}
else if (distance2 <= distance3)
{
// Come from the left.
nodes[r, c].Distance = distance2;
nodes[r, c].Direction = Direction.FromLeft;
}
else
{
// Come from the diagonal.
nodes[r, c].Distance = distance3;
nodes[r, c].Direction = Direction.FromDiagonal;
}
}
}
return nodes;
}

The method first builds an array of nodes big enough to hold all of the nodes shown in the earlier diagram. It then sets the Distance value in node i in the leftmost column to i. For example, it takes three steps to move from the upper left corner node to the third node below it. The code also sets each of those nodes' Direction values to FromAbove because the only way to get to those nodes is from the nodes above.

The code then uses a similar process to initialize the Distance and Direction values in the topmost row.

Next, the method loops through the rest of the array. There are three ways that you might be able to reach any given node, other than those in the top row and left column.

First, the cost to reach a node from the left is the cost to get to its left neighbor, plus 1 to move across the horizontal link. Similarly, the cost to reach a node from above is the cost to get to its upper neighbor, plus 1 to move down the vertical link.

The cost to move across a diagonal link is zero, so the total distance to a node via its diagonal link is the cost to get to its upper left neighbor plus zero. Note that the diagonal link only exists if the corresponding characters in the two words match, so this is a match point.

After calculating the distance to a node via each of the three possible methods, the code picks the cheapest route for that node and records it in the node's Distance and Direction values.

After initializing the edit graph, the MakeEditGraph method returns its array of nodes.

The second main task that the program must perform is to read the shortest sequence of edits out of the edit graph.

Learning the shortest edit distance is easy. It's stored in the Distance value of the node in the lower right corner of the graph. Each increment in the distance represents adding or removing a letter, so that node's Distance value tells you the number of moves it took to make the full conversion.

Retracing the steps used is also relatively easy. Start at the lower right node and follow the nodes' Direction values back through the graph until you reach the start node. This gives you the path backwards, but you can simply reverse it to get the path forward from the start node to the end node.

Understanding exactly what each move means is a bit more confusing. The example solution uses the following method to generate a list of steps taken during the transformation:

// Make a list showing the transformation from word1 to word2.
private List<string> DescribePath(Node[,] graph,
string word1, string word2)
{
// Build the path backward.
int numRows = graph.GetUpperBound(0) + 1;
int numCols = graph.GetUpperBound(1) + 1;
int r = numRows - 1;
int c = numCols - 1;

List<string> moves = new List<string>();
string word = word2;
moves.Add("End with: " + word);

while ((r > 0) || (c > 0))
{
switch (graph[r, c].Direction)
{
case Direction.FromAbove:
// We added letter r. Remove it.
moves.Add("Add " + word2.Substring(r - 1, 1) +
" to get: " + word);
word = word.Remove(r - 1, 1);
r--;
break;
case Direction.FromLeft:
// We removed letter c. Re-add it.
moves.Add("Remove " + word1.Substring(c - 1, 1) +
" to get: " + word);
word = word.Insert(r, word1.Substring(c - 1, 1));
c--;
break;
case Direction.FromDiagonal:
// We did nothing.
//moves.Add("Keep " + word1.Substring(c - 1, 1) +
// " to get: " + word);
r--;
c--;
break;
}
}
moves.Add("Start with word: " + word1);

// Reverse the moves.
moves.Reverse();
return moves;
}

This method sets the variables r  and c to the row and column of the lower right node. It creates a list to hold the transformation's steps and sets the word variable equal to the end word.

The code then enters a loop that continues until r and c reach the upper left corner node. Inside the loop, the code checks the current node's Direction value and takes one of three actions.

If the Direction value is FromAbove, then we got to this node from the node above by adding the letter in position r – 1 in the end word. The code adds a statement indicating what it is doing to the moves list and removes the letter. It then decrements r to move to the node above.

If the Direction value is FromLeft, then we got to this node from the node to the left by removing the original word's letter in position r from position c – 1. The code adds a statement indicating what it is doing to the moves list and inserts the letter. It then decrements c to move to the node on the left.

Finally, if the Direction value is FromDiagonal, then we got to this node by following a match point link. The program did not make a change to the word at this step, so the method adds nothing to the moves list. It then decrements both r and c to move to the node above and to the left.

After it finishes following nodes through the edit graph, the method adds a message, indicating its starting word. It then reverses the moves list to put the moves in their proper order and returns the reordered list of moves.

Download the EditDistance example solution to see additional details.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset