Evaluation of information retrieval systems

To evaluate an information retrieval system the standard way, a test collection is needed, which should have the following:

A collection of documents
Test query set for the required information
Binary assessment of relevant or not relevant

The documents in collections are classified using two categories, relevant and not relevant. The test document collection should be of a reasonable size, so the test can have reasonable scope to find the average performance. Relevance of output is always assessed relative to information required, and not on the basis of a query. In other words, having a query word in the results does not mean that it is relevant. For example, if the search term or query is for "Python," the results may show the Python programming language or a pet python; both the results contain the query term, but whether it is relevant to the user is the important factor. If the system contains a parameterized index, then it can be tuned for better performance, in which case, a separate test collection is required to test the parameters. It may happen that the weights assigned are different according to parameters also altered by the parameters.

There are some standard test collections available for the evaluation of information retrieval. Some of them are as listed here:

The Cranfield collection contains 1398 abstracts from aerodynamic journals and 225 queries, as well as exhaustive relevance judgments on all.
The Text REtrieval Conference (TREC) has maintained a large IR test series for evaluation since 1992. It consists of 1.89 million documents and relevance judgment for 450 information needs.
GOV2 has a collection of 25 million web pages.
NTCIR focuses on test collection focusing on East Asian language and cross-language information retrieval. [http://ntcir.nii.ac.jp/about/]
REUTERS consists of 806,791 documents.
20 newsgroups is another collection used widely for classification.

Two measures that are used to find the effectiveness of a retrieval system are precision and recall. Precision is the fraction of documents that are retrieved and are relevant, and recall is the fraction of relevant document that are found.

Table of Contents for Evaluation of information retrieval systems

Create new playlist

Sign In

Sign Up

Table of Contents for
Evaluation of information retrieval systems