Vector space model

Boolean retrieval works fine, but it only gives output in binary; it says the term matches or is not in the document, which works well if there are only a limited number of documents. If the number of documents increases, the results generated are difficult for humans to follow. Consider a search term, X is searched for in 1 million documents, out of which half return positive results. The next phase is to order the documents on some basis, such as rank or some other mechanism, to show the results.

If the rank is required, then the document needs to attach some kind of score, which is given by a search engine. For a normal user, writing a Boolean query itself is a difficult task, where they have to make a query using and, or, and not. In real-time, the queries can be simple as single words query and as complex as a sentence containing lots of words.

The vector space model can be divided into three stages:

  • Document indexing, where the terms are extracted from the documents
  • Weighing of the indexed terms, so the retrieval system can be enhanced
  • Ranking the documents on the basis of query and similarity measures

There is always metadata associated with the document that has various types of information, such as the following:

  • Author details
  • Creation date
  • Format of the document
  • Title
  • Date of publication
  • Abstract (although not always)

This metadata helps in forming queries such as "search for all documents whose author is xyz and were published in 2017" or "search for the document whose title contains the word AI and the author is ABC." For such queries, a parametric index is maintained, and such queries are called parametric searches. Zones contain the free text, such as title, which is not possible in a parametric index. Normally, for each parameter, a separate parametric index is prepared. Searching for a title or abstract requires a zonal approach. A separate index is prepared for each zone, as shown in the following diagram:

This ensures efficient retrieval and storage of data. It still works well for Boolean queries and retrieval on fields and zones.

A representation of a set of documents as a vector in common vector space is known as a vector space model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset