Boolean retrieval

Boolean retrieval deals with a retrieval system or algorithm where the IR query can be seen as a Boolean expression of terms using the operations AND, OR, and NOT. A Boolean retrieval model is a model that sees the document as words and can apply query terms using Boolean expressions. A standard example is to consider Shakespeare's collected works. The query is to determine plays that contain the words "Brutus" and "Caesar," but not "Calpurnia." Such a query is feasible using the grep command which is available on Unix-based systems.

It is an effective process when the document size is limited, but to process a large a document quickly, or the amount of data available on the web, and rank it on the basis of an occurrence count, is not possible.

The alternative is to index the document in advance for the terms. The approach is to create an incidence matrix, which records in a form of binary and marks whether the term is present in the given play or not:

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Brutus 1 1 0 0 0 1
Caesar 1 1 0 1 0 0
Calpurnia 0 1 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0

 

Now, to answer the previous request for "Brutus" and "Caesar," but not "Calpurnia," this query can be turned into 110100 AND 110111 AND 101111 = 100100, so the answer is that Antony and Cleopatra and Hamlet are the plays that satisfy our query.

The preceding matrix is good, but considering the large corpus, it can grow into anything with the entry of 1 and 0. Think of creating a matrix of 500,000 terms of 1 million documents, which will result in a matrix of 500,000 x 1 million dimensions. As shown in the preceding table, the matrix entries will be 0 and 1, so an inverted index is used. It stores terms and lists of documents in the form of a dictionary that looks like the following diagram:

Taken from https://nlp.stanford.edu/IR-book/pdf/01bool.pdf

The documents in the term appears from a list, known as the posting list, and an individual document is known as a posting. To create such a structure, the document is tokenized, and the tokens created are normalized by linguistic preprocessing. Once the normalized tokens are formed, a dictionary and a posting are created. To provide the ranking, the frequency of the term is also stored, as shown in the following diagram:

The extra information stored is useful for search engines in a rank retrieval model. The posting list is also sorted for efficient query processing. Using this method, the storage requirement is reduced; recall the m x n matrix with 1 and 0. This also helps in processing the Boolean query or retrieval.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset