Dictionaries and tolerant retrieval

Dictionary data structures store the list term vocabulary, with the list of documents that contain the given term, also as posting.

Dictionary data structures can be stored in two different ways: using hash tables or trees. The naive approach to storing such data structures will lead to performance issues when the corpus grows. Some IR systems use the hash approach, whereas others use the tree approach to make the dictionaries. Both approaches have their pros and cons.

Hash tables store vocabulary terms in the form of integers, which are obtained by hashing. Lookups or searches in hash tables are faster,as it is time constant O(1). If the search is prefix-based search like find text starting with "abc", it will not work if the hash tables are used to store the terms because terms will be hashed. It is not easy to find minor variants. As the terms grow, rehashing is expensive.

A tree base approach uses a tree structure, normally a binary tree, which is very efficient for searching. It handles the prefix base searching efficiently. It is slower, as it takes O(log M) to search. Each re-balancing of the tree is expensive.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset