Wildcard queries

Wildcard queries use * to specify what to search for. It can be seen in different places like at the starting of the word or ending of the word. The search term may have a beginning such as *its, which means find words that end with its. Such queries are called suffix queries. The search term may use * at the end, such as its*, which means find words starting with its. Such queries are called prefix queries. In term of trees, prefix queries are easy, as they require us to find terms between its <= t <= itt. Suffix queries require extra trees that maintain terms for backward movement. The next kind, which require more operations, are queries that have * in the middle, such as "fil*er", "se*te", and "pro*cent". To solve such queries, it requires to find "fil*" and "*er", and intersects the result of the two sets. This is an expensive operation, as one needs to traverse in both directions of the tree; this needs a workaround to make it simpler. One approach is to modify the query so that it contains "*" at the end only. The permuterm index approach adds a special character, "$", to words; for example, the term "hello" can be represented as hello$, ello$h, llo$he, lo$hel, or o$hell. Let's assume the query is for hel*o, so it will look for hel and o, ending up in o$hel. It simply rotates the wildcard so that it appears at the end only. It adds all rotations in the B-tree. It also takes up a lot of space. Another approach is to use bigram (k-gram) indexes, which are more efficient than permuterm indexes. In bigram indexes, all k-grams are enumerated. For example, "April is the cruelest month", split into 2-grams (bigrams) will look like the following:

$a, ap, pr, ri, il, l$, $i, is, s$, $t, th, he, e$, $c, cr, ru, ue, el, le, es, st, t$, $m, mo, on, nt, h$

$ is used to denote the start or end of the term. It maintains the second index in inverted form for all bigrams, and dictionary terms that contain the bigram. It retrieves all the postings that match the bigrams and intersects the whole list. Now, a query such as hel* is run as $h and he and el. It applies a post filter to filter results that are not relevant. It is fast and space efficient.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset