Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5. Approximate search

Thomas Mailund¹

(1)

Aarhus N, Denmark

We are not always satisfied with finding locations where a pattern matches exactly. Sometimes we could, for example, want to find all occurrences of a word and include occurrences where the word is misspelt. Searching for “almost matches” is called approximate search. We will look at a suffix tree–based approach and a BWT-based approach. Considering the experimental results from the last chapter, it might seem odd to consider a BWT approach instead of a suffix array approach, but there is a reason for this: we can add a trick to the BWT approach to make it faster than the suffix tree solution that was the most efficient for exact search.

Local alignment and CIGAR notation

An approximative match is one where we can edit the key to make it match at a given location. The operations we can do are

Substitute, or replace, one character for another
Insert a character
Delete a character

If we have a match of such an edited pattern at a given location, we say that we can align the pattern there. The idea is that we conceptually put the pattern on top of the reference string at the location where we have an approximative match. We then describe how the reference string should be edited to become the reference string, that is, whether we should change a character in the reference, a substitution, delete a character that is in the reference but not the pattern, or insert a character in the pattern string that is not in the reference.

For matching and substitution strings, we have one character on top of another. When we insert a character, we show it as a dash in the reference string (the inserted character does not match any of the characters in the reference, but this is where it should be inserted). If we delete a character, we put a dash in the reference string (the character would have been at this position if we hadn’t deleted it). The following are two examples:

AACTTTCTGAA

...TTAAAAAATTTCT-AACAACA...

* *

AACTTTCTG-AA

...TGGAAAA-TTTCTGGAATGGAT...

* *

The first alignment has a substitution where A should replace the C in the pattern, and it has an insertion of G (shown as a dash in the string we align to; it isn’t at that position but it is where we must insert it). The second alignment has an insertion of C (also shown as a dash in the string below) and a deletion of G.

This type of alignment is also called a local alignment . A global alignment is one where we edit one string to match the entirety of another string. We will not consider global alignments in this book.

When we report an occurrence of an approximative match, we want to report both the position of the occurrence and how the pattern must be edited to get the match. Reporting an alignment is not convenient but what we can do is to report the transformations needed on the pattern. For example, we can report matches as M, substitutions as S, insertions as I, and deletions as D as a string. The sequence of operations we perform for the alignments earlier is MMSMMMMMIMM and MMIMMMMMMDMM. The CIGAR format is a compressed form of this sequence of events. Whenever there is a sequence of the same operation, it contains the length of the sequence followed by the operation. It does not distinguish between M and S; if we know that one character should go above another, we can always check if we have a match or substitution anyway. The CIGAR for the first alignment above is 8M1I2M and the CIGAR for the second is 2M1I6M1D2M. The reason that there are 8 Ms in the first CIGAR, although there are only two Ms in the string of operations, is that we use M for substitutions as well in CIGARs.

When matching a pattern, I find it easiest to collect each transformation step individually, that is, the first format earlier, and then transform it into a CIGAR string to report once I am done. The following edits_to_cigar() function will take a string where each edit is represented by a letter and produce the corresponding CIGAR string:

static const char *scan(

const char *edits

) {

const char *p = edits;

while (*p == *edits)

++p;

return p;

}

void edits_to_cigar(

char *cigar_buffer,

const char *edits

) {

while (*edits) {

const char *next = scan(edits);

cigar_buffer = cigar_buffer + sprintf(

cigar_buffer, "%d%c",

(int)(next - edits), *edits

);

edits = next;

}

*cigar_buffer = '';

}

The code might be a little hard to decipher. The sprintf() function writes to the front of the cigar_buffer and returns the length of the string it writes. When we add that length to cigar_buffer, we get a pointer to where we should continue writing next time we call sprintf(). The (int)(next - edits) expression gives us the number of characters we have scanned past, and *edits is the operation we repeated for this number of times.

cigar_buffer = cigar_buffer + sprintf(

cigar_buffer, "%d%c",

(int)(next - edits), *edits

);

Brute force approach

The straightforward way to do approximative matching is to construct all the patterns at a certain edit distance (number of changes) from the pattern and then do an exact search. We call all such strings the edit cloud around the pattern, and we can construct it recursively.

Building an edit cloud

An easy way to build a string and a CIGAR from a pattern is to recursively handle the three operations (four if you separate matching into (actual) matchind and mismatching). Assume we have processed the pattern up to some point, for example, a pattern_front pointer . Put the modified string in a buffer and have a pointer, string_front , pointing at the next position we should add characters to, and have the edits so far in a buffer where pointer edit_fronts points to the next position where we should add to the buffer. The situation is shown in Figure 5-1.

If we want to add an insertion to the pattern, we skip past the current symbol in the pattern, we do not add anything to the string, but we record the operation in the edits. The pattern_front and edits_front are incremented. If it seems odd to you that we do not add a symbol to the string in an insertion operation, then remember that insertion is something we do to transform the string into the pattern. The symbol we skip past in the pattern is the one we have inserted there. Substitutions and matching are the same operation (except that substitutions increase the edit distance). We add a character to the string and increment string_front , increment pattern_front past the character matched or substituted to, and record the operation in the edits buffer. For deletion, we do almost the same as for matching. We add a character to the string and increment string_front, and we add a D to the edits buffer. The difference to matching is that we do not increment pattern_front. The character we insert into the string is deleted in the pattern so we should continue the recursion from the current position.

The idea is implemented in the following function; I will explain the at_beginning variable after the code listing.

void recursive_generator(

const uint8_t *pattern_front,

const uint8_t *alphabet,

// To avoid initial deletions.

bool at_beginning,

// Write the edited string here.

uint8_t *string_front,

// Holds the beginning of full buffer

// so we can report the string.

uint8_t *string,

// We write the output cigar here.

char *cigar,

// We build the edit string here.

char *edits_front,

// and use the beginning of the edits buffer

// when we report

char *edits,

int max_edit_distance)

{

if (*pattern_front == '') {

// No more pattern to match...

// Terminate the buffer and report.

*string_front = '';

*edits_front = '';

edits_to_cigar(cigar, edits);

report(string, cigar);

} else if (max_edit_distance == 0) {

// We can't edit anymore, so just move the

// pattern to buffer and report.

uint32_t rest = strlen((char *)pattern_front);

for (uint32_t i = 0; i < rest; ++i) {

string_front[i] = pattern_front[i];

edits_front[i] = 'M';

}

string_front[rest] = cigar[rest] = '';

edits_to_cigar(cigar, edits);

report(string, cigar);

} else {

// RECURSION

// Insertion

*edits_front = 'I';

recursive_generator(pattern_front + 1,

alphabet,

false,

string_front, string,

cigar,

edits_front + 1, edits,

max_edit_distance - 1);

// Deletion

if (!at_beginning) {

for (const uint8_t *a = alphabet; *a; a++) {

*string_front = *a;

*edits_front = 'D';

recursive_generator(pattern_front,

alphabet,

at_beginning,

string_front + 1,

string,

cigar,

edits_front + 1, edits,

max_edit_distance - 1);

}

// Match/substitution

for (const uint8_t *a = alphabet; *a; a++) {

if (*a == *pattern_front) {

*string_front = *a;

*edits_front = 'M';

recursive_generator(pattern_front + 1,

alphabet,

false,

string_front + 1,

string,

cigar,

edits_front + 1, edits,

max_edit_distance);

} else {

*string_front = *a;

*edits_front = 'M';

recursive_generator(pattern_front + 1,

alphabet,

false,

string_front + 1,

string,

cigar,

edits_front + 1, edits,

max_edit_distance - 1);

}

../images/490521_1_En_5_Chapter/490521_1_En_5_Fig1_HTML.png — Figure 5-1
Recursion for constructing an edit cloud

I use the at_beginning variable to avoid initial deletions. We cannot easily avoid that many sequences of edits can lead to the same string, but we know that all edits that start or end with deletions will have the same string as the edits that do not include them. Whenever we reach the end of the pattern, we report the result, and we do not add possible deletions as results as well. So terminal deletions are naturally avoided. We can avoid initial deletions if we never do a deletion unless we have already done an insertion or a match. The at_beginning parameter ensures exactly that.

To get an iterator version of the recursion, we need an explicit stack. Stack frames will contain the information we need for an operation and information about which operation to do. I found it easiest to have four operations: one that creates the recursive calls and three operations for insertion, deletion, and matches. Stack frames look like this:

enum edit_op {

RECURSE,

INSERTION,

DELETION,

MATCH

};

struct deletion_info {

char a;

};

struct match_info {

char a;

};

struct edit_iter_frame {

enum edit_op op;

// The character we should delete or match

uint8_t a;

// Have we inserted or matched yet?

bool at_beginning;

// Fronts of buffers

const uint8_t *pattern_front;

uint8_t *string_front;

char *cigar_front;

// Number of edits left

int max_dist;

// The rest of the stack

struct edit_iter_frame *next;

};

Deletion and match operations need to know which character to add to the string. In the recursive function, we put those characters into the buffer before we called recursively. With the explicit stack, we need to iterate through the alphabet to push operations, and if we just update the string buffer, we would override all but the last character we put there. Instead, we remember the character in the stack frame and add it to the string when we get to the operation.

Pushing information to the stack is straightforward.

static struct edit_iter_frame *

push_edit_iter_frame(

enum edit_op op,

bool at_beginning,

const uint8_t *pattern_front,

uint8_t *string_front,

char *cigar_front,

int max_dist,

struct edit_iter_frame *next

) {

struct edit_iter_frame *frame =

malloc(sizeof(struct edit_iter_frame));

frame->op = op;

frame->at_beginning = at_beginning;

frame->pattern_front = pattern_front;

frame->string_front = string_front;

frame->cigar_front = cigar_front;

frame->max_dist = max_dist;

frame->next = next;

return frame;

}

An iterator will hold the beginning of the buffers, a cigar buffer that we use to translate the edits string into a CIGAR representation, and a pointer to the remainder of the stack below the frame.

struct edit_iter {

const uint8_t *pattern;

const char *alphabet;

uint8_t *string;

char *edits;

char *cigar;

struct edit_iter_frame *frames;

};

When we initialize an iterator, we allocate the buffers we need. We can never have more than twice the string length edits—regardless of the maximum edit distance, we will explore. If we remove all characters and then insert them again, we get this maximum, and that is the most distant string we can ever create. So that is an upper bound on the size of the buffers we need. After allocating the buffers, we push the first recursion unto the stack.

void init_edit_iter(

struct edit_iter *iter,

const uint8_t *pattern,

const char *alphabet,

int max_edit_distance

) {

uint32_t n = 2 * (uint32_t)strlen((char *)pattern);

iter->pattern = pattern;

iter->alphabet = alphabet;

iter->string = malloc(n); iter->string[n - 1] = '';

iter->edits = malloc(n); iter->edits[n - 1] = '';

iter->cigar = malloc(n);

iter->frames = push_edit_iter_frame(

RECURSE,

true,

iter->pattern,

iter->string,

iter->edits,

max_edit_distance,

);

}

When we increment the iterator, we check if we have reached the end of the pattern or the total number of edits allowed. In either case, we report an occurrence. If not, we perform the operation from the frame and then proceed to the next frame in a recursive call. There is not much more to say about the function; it closely follows the recursive version.

struct edit_pattern {

const uint8_t *pattern;

const char *cigar;

};

bool next_edit_pattern(

struct edit_iter *iter,

struct edit_pattern *result

) {

if (iter->frames == 0) return false;

// Pop top frame

struct edit_iter_frame *frame = iter->frames;

iter->frames = frame->next;

const uint8_t *pattern = frame->pattern_front;

uint8_t *buffer = frame->string_front;

char *cigar = frame->cigar_front;

if (*pattern == '') {

// No more pattern to match...

*buffer = '';

*cigar = '';

edits_to_cigar(iter->cigar, iter->edits);

result->pattern = iter->string;

result->cigar = iter->cigar;

free(frame);

return true;

} else if (frame->max_dist == 0) {

// We can't edit anymore, so just move

// pattern to the string and report.

uint32_t rest = (uint32_t)strlen((char *)pattern);

for (uint32_t i = 0; i < rest; ++i) {

buffer[i] = pattern[i];

cigar[i] = 'M';

}

buffer[rest] = cigar[rest] = '';

edits_to_cigar(iter->cigar, iter->edits);

result->pattern = iter->string;

result->cigar = iter->cigar;

free(frame);

return true;

}

switch (frame->op) {

case RECURSE:

for (const char *a = iter->alphabet; *a; a++) {

if (!frame->at_beginning) {

iter->frames = push_edit_iter_frame(

DELETION,

false,

frame->pattern_front,

frame->string_front,

frame->cigar_front,

frame->max_dist,

iter->frames

);

iter->frames->a = *a;

}

iter->frames = push_edit_iter_frame(

MATCH,

false,

frame->pattern_front,

frame->string_front,

frame->cigar_front,

frame->max_dist,

iter->frames

);

iter->frames->a = *a;

}

iter->frames = push_edit_iter_frame(

INSERTION,

false,

frame->pattern_front,

frame->string_front,

frame->cigar_front,

frame->max_dist,

iter->frames

);

break;

case INSERTION:

*cigar = 'I';

iter->frames = push_edit_iter_frame(

RECURSE,

false,

frame->pattern_front + 1,

frame->string_front,

frame->cigar_front + 1,

frame->max_dist - 1,

iter->frames

);

break;

case DELETION:

if (frame->at_beginning) break;

*buffer = frame->a;

*cigar = 'D';

iter->frames = push_edit_iter_frame(

RECURSE,

false,

frame->pattern_front,

frame->string_front + 1,

frame->cigar_front + 1,

frame->max_dist - 1,

iter->frames

);

break;

case MATCH:

if (frame->a == *pattern) {

*buffer = frame->a;

*cigar = 'M';

iter->frames = push_edit_iter_frame(

RECURSE,

false,

frame->pattern_front + 1,

frame->string_front + 1,

frame->cigar_front + 1,

frame->max_dist,

iter->frames

);

} else {

*buffer = frame->a;

*cigar = 'M';

iter->frames = push_edit_iter_frame(

RECURSE,

false,

frame->pattern_front + 1,

frame->string_front + 1,

frame->cigar_front + 1,

frame->max_dist - 1,

iter->frames

);

}

break;

}

free(frame);

return next_edit_pattern(iter, result); // recurse...

}

When we are done with the iterator, we need to remove what might remain of the stack (in case we stop iterating before we reach the end of it) and free the buffers.

void dealloc_edit_iter(

struct edit_iter *iter

) {

struct edit_iter_frame *frame = iter->frames;

while (frame) {

struct edit_iter_frame *next = frame->next;

free(frame);

frame = next;

}

free(iter->string);

free(iter->edits);

free(iter->cigar);

}

Once you have created the edit cloud for a pattern, you can use any exact pattern-matching algorithm of your choosing. The number of strings grows exponential with the maximum edit distance, however. We cannot get around this in general; the recursion works that way. But if we do the search at the same time as we explore edits, we can break when we know that we cannot match a pattern further. With suffix trees and BWT search, we can do exactly that.

Suffix trees

If we search for approximative matches in a suffix array, we can stop our search if we exceed the number of edits we allow, simply by aborting the recursive search. In my implementation, I collect all CIGARs and roots of a matching hit when I initialize the iterator. It is possible to iterate one hit at a time, but the code gets substantially harder to read, so I have chosen this approach. I have also decided to write the search as a recursive function. I do not expect that patterns are very long, so the recursion doesn’t get too deep. It is trivial to replace the recursion stack with an explicit stack if the recursion depth becomes a problem. I collect the hits in two vectors, one that holds the root of the tree where we have hits and one that holds the CIGAR used for this hit. I need to copy the CIGARs the function generates when I store the hits, and for that, I use this function:

uint8_t *str_copy(const uint8_t *x)

{

uint32_t n = strlen((char *)x);

uint8_t *copy = malloc(sizeof(uint8_t) * n + 1);

strncpy((char *)copy, (char *)x, n);

copy[n] = 0;

return copy;

}

The iterator will hold the suffix tree, so we have access to it when we needed it. It also holds the two vectors we collect hits in. For iterating through the hits, we have a flag, processing_tree, that tells us if we are in the middle of traversing a tree or not. Finally, it holds an index that tells us which tree we are processing and a leaf iterator for doing the actual processing.

struct st_approx_match_iter {

struct suffix_tree *st;

struct pointer_vector nodes;

struct string_vector cigars;

bool processing_tree;

uint32_t current_tree_index;

struct st_leaf_iter leaf_iter;

};

When searching for the hits, we also need three string buffers—one that contains the operations we have so far, a pointer to the beginning of this buffer for when we need to construct a CIGAR from it, and then a buffer in which we construct the CIGAR.

struct collect_nodes_data {

struct st_approx_match_iter *iter;

char *edits_start;

char *edits;

char *cigar_buffer;

};

The iterator initialization closely follows the recursion we used in the previous section. A difference is that we search along an edge in the suffix tree using a pointer x to look at the current character we are processing and another, end, that is the end of the edge. We also have a pointer to where we currently are in the patter, p, and a pointer to where we are in the edits string, edits. Then, we have a flag, at_beginning, that tells us if we have seen insertions or matches yet so we avoid initial deletions (similar to what we did earlier). Finally, we have a counter that keeps track of how many edit operations we have left.

The function should be relatively easy to read. We will terminate the search if we reach the end of the string the suffix was built from.

If we reach that point, we know we cannot continue matching. If we do not have any edits left, we also terminate the search. If we have reached the end of the pattern, we report the result by adding the current CIGAR and the current node to the vectors. If we reach the end of the edge we are scanning, we will continue searching from the node’s children. If none of the previously discussed applies, then we recurse with the different edit operations.

static void collect_approx_hits(

struct collect_nodes_data *data,

struct suffix_tree_node *v,

bool at_beginning,

const uint8_t *x, const uint8_t *end,

const uint8_t *p,

char *edits,

int edits_left

) {

struct suffix_tree *st = data->iter->st;

// We need to know this one so we never move past the end

// of the string (and access memory we shouldn't).

const uint8_t *string_end = st->string + st->length;

if (x == string_end)

return; // Do not move past the end of the buffer.

if (edits_left < 0) {

// We have already made too many edits.

return;

}

if (*p == '') {

// A hit. Save the data in the iterator.

*edits = '';

edits_to_cigar(

data->cigar_buffer,

data->edits_start

);

string_vector_append(

&data->iter->cigars,

str_copy((uint8_t*)data->cigar_buffer));

pointer_vector_append(

&data->iter->nodes, (void *)v);

return;

}

if (x == end) {

// We ran out of edge: recurse on children.

recurse_children(

data, v,

at_beginning,

edits, p, edits_left);

return;

}

if (edits_left == 0 && *x != *p) {

// We cannot do any more edits and

// we need at least a substitution.

return;

}

// Recursion

int match_cost = *p != *x;

*edits = 'M';

collect_approx_hits(

data, v,

false,

x + 1, end,

p + 1,

edits + 1,

edits_left - match_cost,

);

if (!at_beginning) {

*edits = 'D';

collect_approx_hits(

data, v,

false,

x + 1, end,

p, edits + 1,

edits_left - 1

);

}

*edits = 'I';

collect_approx_hits(

data, v,

false,

x, end,

p + 1, edits + 1,

edits_left - 1

);

}

I have moved the code for recursing on a node’s children to a separate function that looks like this:

static void recurse_children(

struct collect_nodes_data *data,

struct suffix_tree_node *v,

bool at_beginning,

char *edits,

const uint8_t *p,

int max_edits

) {

struct suffix_tree_node *child = v->child;

while (child) {

const uint8_t *x = child->range.from;

const uint8_t *end = child->range.to;

collect_approx_hits(data, child, at_beginning,

x, end, p, edits, max_edits);

child = child->sibling;

}

When we initialize the iterator, we allocate the strings we use to build CIGARs and the vectors we use to collect the hits, and then we collect the hits recursively. We also initialize the leaf iterator. This instantiation of the leaf iterator is not used for anything—we initialize it again in the next function—but by always keeping the iterator initialized, we know that the deallocation function can always release the resources in it. At the end of the initialization, we mark that we are not in the process of iterating through leaves—the first step in the next function will then start from the first tree—and we set the current tree index to zero so the next function will start there.

void init_st_approx_iter(

struct st_approx_match_iter *iter,

struct suffix_tree *st,

const uint8_t *pattern,

int edits

) {

iter->st = st;

uint32_t n = strlen((char *)pattern);

struct collect_nodes_data data;

data.iter = iter;

data.edits_start = data.edits = malloc(2*n + 1);

data.cigar_buffer = malloc(2*n + 1);

init_pointer_vector(&iter->nodes, 10);

init_string_vector(&iter->cigars, 10);

collect_approx_hits(&data, st->root, true,

st->root->range.from, st->root->range.to,

pattern, data.edits, edits, 0);

free(data.edits_start);

free(data.cigar_buffer);

// We only initialize this to make resource management

// easier. We keep this iterator initialized at all

// time except when we deallocate it and immediately initialize.

// it again.

init_st_leaf_iter(&iter->leaf_iter, st, st->root);

iter->processing_tree = false;

iter->current_tree_index = 0;

}

The information we want to report for each match is the position of the match and the CIGAR:

struct st_approx_match {

uint32_t pos;

const char *cigar;

};

When we increment the iterator, we check if we are in a leaf iteration. If not, we need to pick the next tree or terminate the iteration if we do not have any more trees. When we have a tree, we initialize the leaf vector and tag that we are now processing a tree. We then call the function recursively so it can handle the new situation. If we are processing a tree, we increment the leaf iterator. If we do have more trees, we initialize the leaf iterator so it can process the next tree. We call recursively to get the next tree and start processing it.

bool next_st_approx_match(

struct st_approx_match_iter *iter,

struct st_approx_match *match

) {

if (!iter->processing_tree) {

if (iter->current_tree_index == iter->nodes.used) {

return false;

}

dealloc_st_leaf_iter(&iter->leaf_iter);

init_st_leaf_iter(

&iter->leaf_iter, iter->st,

pointer_vector_get(

&iter->nodes,

iter->current_tree_index

)

);

iter->processing_tree = true;

return next_st_approx_match(iter, match);

} else {

struct st_leaf_iter_result res;

bool more_leaves =

next_st_leaf(&iter->leaf_iter, &res);

if (!more_leaves) {

iter->processing_tree = false;

iter->current_tree_index++;

return next_st_approx_match(iter, match);

} else {

uint32_t i = iter->current_tree_index;

match->pos = res.leaf->leaf_label;

match->cigar = (const char *)iter->cigars.data[i];

return true;

}

The resources we need to free when the iterator is deallocated are the nodes vectors, the CIGAR strings and the CIGAR vector, and then the leaf iterator.

void dealloc_st_approx_iter(

struct st_approx_match_iter *iter

) {

dealloc_pointer_vector(&iter->nodes);

for (uint32_t i = 0; i < iter->cigars.used; ++i) {

free(iter->cigars.data[i]);

}

dealloc_string_vector(&iter->cigars);

dealloc_st_leaf_iter(&iter->leaf_iter);

}

The Li-Durbin algorithm

You can take the same approach with the BWT search as you can with the suffix tree—write a recursion that explores all edits while you search until you cannot match any more with the edits you have available—but the Li-Durbin algorithm adds an idea to this. They build an additional table for the BWT search that they use to terminate a search early. The table gives a minimum number of edits you need to match the rest of a string, and if the number of edits is smaller than this, the recursion stops.

The BWT search algorithm processes a pattern from the end toward the beginning. If we build a suffix array from the reversed string and search in that, starting at the beginning of the pattern and moving toward the end, then we find out where the reversed pattern sits in the reversed string. It is not hard to see this. If we reversed both the string and the pattern and did the BWT search, then we would locate the reversed pattern in the reverse string. The BWT algorithm doesn’t care that it is the reversed string we are searching in. Processing the pattern in the beginning-to-end order will give the algorithm the characters in the order it would get them if we reversed the pattern and went from end to beginning.

We can determine if a prefix of the pattern is in the string by searching from the beginning against the suffix array of the reversed string. The reversed pattern prefix is in the reversed string if and only if the prefix is in the original string. The same applies to any substring of the pattern. We can determine if that substring is in the string either by searching from the end to the beginning of the pattern using the original suffix array or by searching from the beginning to the end in suffix array of the reversed string. We are interested in prefixes when building the table that lets the Li-Durbin algorithm terminate early, so we will search from beginning to end in the reversed string.

The idea is to build a table with an entrance per index in the pattern, and at each index, we will record a lower bound in the number of edits we need. We do an exact matching search from left to right in the pattern, searching in the reversed string. Each time we get to a point where we do not have a match, we record that at least one edit is needed to match the prefix. We then start from the full range of the reversed string and the point we got to in the prefix and continue until we cannot match any longer. There, we record that at least two edits are needed. We continue like this until we have processed the entire pattern. Then, when we do an approximative match from right to left in the pattern, we always check how many edits are needed to match the rest of the pattern (the prefix remaining). If we do not have enough edit operations left to match the pattern, we stop the recursion.

To search in the reversed string, we add an O table from the suffix of the reversed string to our bwt_table data struct and add the suffix array to the initialization function. Rather than having separate tables and initializers, we take an argument for the suffix array of the reversed string that can be null. If it is, we do not use it, and if it is not, we build the O table from it.

struct bwt_table {

struct remap_table *remap_table;

struct suffix_array *sa;

uint32_t *c_table;

uint32_t *o_table;

uint32_t *ro_table; // NEW TABLE

uint32_t **ro_indices; // NEW TABLE

};

#define RO(a,i) (bwt_table->ro_indices[i][a])

void init_bwt_table(

struct bwt_table *bwt_table,

struct suffix_array *sa,

struct suffix_array *rsa,

struct remap_table *remap_table

) {

assert(sa);

uint32_t alphabet_size = remap_table->alphabet_size;

bwt_table->remap_table = remap_table;

bwt_table->sa = sa;

// ---- COMPUTE C TABLE -----------------------------------

uint32_t char_counts[remap_table->alphabet_size];

memset(char_counts, 0, remap_table->alphabet_size * sizeof(uint32_t));

for (uint32_t i = 0; i < sa->length; ++i) {

char_counts[sa->string[i]]++;

}

bwt_table->c_table = calloc(remap_table->alphabet_size, sizeof(*bwt_table->c_table));

for (uint32_t i = 1; i < remap_table->alphabet_size; ++i) {

C(i) = C(i-1) + char_counts[i - 1];

}

// ---- COMPUTE O TABLE -----------------------------------

// The table has indices from zero to n, so it must have size.

// Sigma x (n + 1)

uint32_t o_size = remap_table->alphabet_size *

(sa->length + 1) *

sizeof(*bwt_table->o_table);

bwt_table->o_table = malloc(o_size);

bwt_table->o_indices =

malloc((sa->length + 1) *

sizeof(*bwt_table->o_indices));

for (uint32_t i = 0; i < sa->length + 1; ++i) {

uint32_t *ptr = bwt_table->o_table +

alphabet_size * i;

bwt_table->o_indices[i] = ptr;

}

for (uint8_t a = 0; a < remap_table->alphabet_size; ++a) {

O(a, 0) = 0;

}

for (uint8_t a = 0; a < remap_table->alphabet_size; ++a) {

for (uint32_t i = 1; i <= sa->length; ++i) {

O(a, i) = O(a, i - 1) + (bwt(sa, i - 1) == a);

}

// NEW CODE

if (rsa) {

bwt_table->ro_table = malloc(o_size);

bwt_table->ro_indices =

malloc((sa->length + 1) *

sizeof(bwt_table->ro_indices));

for (uint32_t i = 0; i < sa->length + 1; ++i) {

bwt_table->ro_indices[i] =

bwt_table->ro_table +

alphabet_size * i;

}

for (uint8_t a = 0; a < remap_table->alphabet_size; ++a) {

RO(a, 0) = 0;

}

for (uint8_t a = 0; a < remap_table->alphabet_size; ++a) {

for (uint32_t i = 1; i <= rsa->length; ++i) {

RO(a, i) = RO(a, i - 1) + (bwt(rsa, i - 1) == a);

}

} else {

bwt_table->ro_table = 0;

bwt_table->ro_indices = 0;

}

void dealloc_bwt_table(

struct bwt_table *bwt_table

) {

free(bwt_table->c_table);

free(bwt_table->o_table);

// NEW CODE

if (bwt_table->ro_table) free(bwt_table->ro_table);

if (bwt_table->ro_indices) free(bwt_table->ro_indices);

}

The approximative matching iterator looks like this:

struct bwt_approx_iter {

struct bwt_table *bwt_table;

const uint8_t *remapped_pattern;

uint32_t L, R, next_interval;

struct index_vector Ls;

struct index_vector Rs;

struct string_vector cigars;

uint32_t m;

char *edits_buf;

uint32_t *D_table;

};

It contains pointers to the BWT tables and the (remapped) pattern. We need these for the recursive search. We use the L, R, and next_interval variables when we traverse the interval for a hit. The intervals for hits that we find are stored in the Ls and Rs vectors and the corresponding CIGARs in the cigars vector. The m variable will contain the length of the pattern. We will use it for allocating CIGAR strings; it tells us how long they can maximally be. The edits_buf variable points to the beginning of the string that holds our edits and the D_table variable holds the D table we use to terminate searches early.

When we initialize our approximative match iterator, we build the table of lower bounds, called D in the code. We only build it if we have the suffix array for the reversed string, so it is also possible to search without the D table if one so wishes. Building D is done as described earlier. We do a normal BWT search except it is in the ro_table suffix array and from the beginning to the end. We search until we get an empty interval and then record that we need one edit more.

In the initializer, we also handle the recursive search for hits. I have taken a different approach to avoid initial deletion here, just to show the alternative. We call the recursion after matches and insertions so we are never in the situation where we can have an initial deletion.

void init_bwt_approx_iter(

struct bwt_approx_iter *iter,

struct bwt_table *bwt_table,

const uint8_t *remapped_pattern,

int max_edits)

{

// Initialize resources for the recursive search.

iter->bwt_table = bwt_table;

iter->remapped_pattern = remapped_pattern;

init_index_vector(&iter->Ls, 10);

init_index_vector(&iter->Rs, 10);

init_string_vector(&iter->cigars, 10);

if (bwt_table->ro_table) {

// Build D table.

uint32_t m = (uint32_t)strlen((char *)remapped_pattern);

iter->D_table = malloc(m * sizeof(uint32_t));

int min_edits = 0;

uint32_t L = 0, R = bwt_table->sa->length;

for (uint32_t i = 0; i < m; ++i) {

uint8_t a = remapped_pattern[i];

L = C(a) + RO(a, L);

R = C(a) + RO(a, R);

if (L >= R) {

min_edits++;

L = 0;

R = bwt_table->sa->length;

}

iter->D_table[i] = min_edits;

}

} else {

iter->D_table = 0;

}

// Set up the edits buffer.

uint32_t m = (uint32_t)strlen((char *)remapped_pattern);

uint32_t buf_size = 2 * m + 1;

iter->m = m;

iter->edits_buf = malloc(buf_size + 1);

iter->edits_buf[0] = '';

// Start searching.

uint32_t L = 0, R = bwt_table->sa->length; int i = m - 1;

struct remap_table *remap_table = bwt_table->remap_table;

char *edits = iter->edits_buf;

// M-operations

unsigned char match_a = remapped_pattern[i];

// Iterating alphabet from 1 so

// I don't include the sentinel.

for (unsigned char a = 1;

a < remap_table->alphabet_size;

++a) {

uint32_t new_L = C(a) + O(a, L);

uint32_t new_R = C(a) + O(a, R);

int edit_cost = (a == match_a) ? 0 : 1;

if (max_edits - edit_cost < 0) continue;

if (new_L >= new_R) continue;

*edits = 'M';

rec_approx_matching(iter, new_L, new_R, i - 1,

1, max_edits - edit_cost,

edits + 1);

}

// I-operation

*edits = 'I';

rec_approx_matching(iter, L, R, i - 1, 0,

max_edits - 1, edits + 1);

// Make sure we start at the first interval.

iter->L = m; iter->R = 0;

iter->next_interval = 0;

}

The recursive function follows the suffix tree recursion closely. The main difference is in how we handle the CIGAR at a hit. We search for the pattern from the end to the beginning, so we build the edit operations in that order as well. To build the CIGAR for a match, we must first reverse the edits and then build the CIGAR. We cannot reverse the edits inside the edits buffer. This would affect all the recursive calls since it is a shared buffer. Instead, we allocate a new string, move the edits into it, reverse it, and then compute the CIGAR and store it in the vector for the hits.

static void rec_approx_matching(

struct bwt_approx_iter *iter,

uint32_t L, uint32_t R, int i,

int edits_left,

char *edits

) {

struct bwt_table *bwt_table = iter->bwt_table;

struct remap_table *remap_table = bwt_table->remap_table;

int lower_limit =

(i >= 0 && iter->D_table) ? iter->D_table[i] : 0;

if (edits_left < lower_limit) {

return; // We can never get a match from here.

}

if (i < 0) { // We have a match.

index_vector_append(&iter->Ls, L);

index_vector_append(&iter->Rs, R);

// Extract the edits and reverse them.

*edits = '';

char *rev_edits =

(char *)str_copy((uint8_t *)iter->edits_buf);

str_inplace_rev((uint8_t*)rev_edits);

// Build the cigar from the edits.

char *cigar = malloc(2 * iter->m);

edits_to_cigar(cigar, rev_edits);

// Free the reversed edits; we do not need them now.

free(rev_edits);

string_vector_append(&iter->cigars, (uint8_t *)cigar);

return; // Done down this path of matching...

}

uint32_t new_L;

uint32_t new_R;

// M-operations

unsigned char match_a = iter->remapped_pattern[i];

// Iterating alphabet from 1 so

// I don't include the sentinel.

for (unsigned char a = 1;

a < remap_table->alphabet_size;

++a) {

new_L = C(a) + O(a, L);

new_R = C(a) + O(a, R);

int edit_cost = (a == match_a) ? 0 : 1;

if (edits_left - edit_cost < 0) continue;

if (new_L >= new_R) continue;

*edits = 'M';

rec_approx_matching(iter, new_L, new_R, i - 1,

edits_left - edit_cost,

edits + 1);

}

// I-operation

*edits = 'I';

rec_approx_matching(iter, L, R, i - 1,

edits_left - 1, edits + 1);

// D-operation

*edits = 'D';

for (unsigned char a = 1;

a < remap_table->alphabet_size;

++a) {

new_L = C(a) + O(a, L);

new_R = C(a) + O(a, R);

if (new_L >= new_R) continue;

rec_approx_matching(

iter, new_L, new_R, i,

edits_left - 1, edits + 1

);

}

The expression checks if we can use the D table . We can only do this if we haven’t reached the beginning of the pattern and the D table was calculated (it will only be if we have the suffix array of the reversed string).

int lower_limit = (i >= 0 && iter->D_table) ?

iter->D_table[i] : 0;

When we increment the iterator, we use the L and R variables to determine whether we are processing an interval or if we should move to the next interval. If R is less than L, the current interval is empty, and we move to the next. If there aren’t any intervals left, we return false to report that we have iterated over all matches. If we are in an interval, we extract the CIGAR and the current position in the interval (where iter->L points).

bool next_bwt_approx_match(

struct bwt_approx_iter *iter,

struct bwt_approx_match *match

) {

if (iter->L >= iter->R) { // Done with current interval

if (iter->next_interval >= iter->Ls.used)

return false; // No more intervals

// Start the next interval

iter->L = iter->Ls.data[iter->next_interval];

iter->R = iter->Rs.data[iter->next_interval];

iter->next_interval++;

}

match->cigar =

(char *)iter->cigars.data[iter->next_interval - 1];

match->position = iter->bwt_table->sa->array[iter->L];

iter->L++;

return true;

}

When we deallocate the iterator, we deallocate all the vectors and the D table if it was computed.

static void free_strings(

struct string_vector *vec

) {

for (int i = 0; i < vec->used; i++) {

free(string_vector_get(vec, i));

}

void dealloc_bwt_approx_iter(

struct bwt_approx_iter *iter

) {

dealloc_index_vector(&iter->Ls);

dealloc_index_vector(&iter->Rs);

free_strings(&iter->cigars);

dealloc_string_vector(&iter->cigars);

free(iter->edits_buf);

if (iter->D_table) free(iter->D_table);

}

Comparisons

I will not compare edit cloud–based exact search with the other algorithms in this chapter. Being able to build the edit cloud gives a good intuition about approximative searching, and using an edit cloud with an exact search is an excellent way to run tests of the more complex algorithms. In practice, though, the size of the edit cloud explodes when the pattern gets large, and in practice, it is not practical to use this approach. Instead, I will compare searching using the suffix tree and the BWT with and without the D table. The results are shown in Figure 5-2.

I have performed the experiments with a maximum edit distance of 1, 2, and 3 (shown at the top of the figure). The pattern lengths I have used are 50, 100, and 150; see the x axis. What we see is that for small edit distances, there is not much difference between the algorithms, but that the difference in running time increases with the edit distance. Using the BWT approach without the D table is the slowest. Then comes the suffix tree approach and, finally, the BWT approach with the D table. The latter is dramatically faster and should be your first choice if you need to do approximative searches.

Of course, there is also a penalty for building the D table. It takes roughly twice as long to build both the O and the RO table than just the O table; see Figure 5-3. You only construct the tables once and might search in the millions of times, so this extra construction time might not be an issue.

You can find the code I used for the experiments here:

https://github.com/mailund/stralg/blob/master/performance/bwt_search.c

https://github.com/mailund/stralg/blob/master/performance/bwt_construction.c