In this chapter I present a solution to the previous exercise. Then you will write code to combine multiple search results and sort them by their relevance to the search terms.
First, let’s go over our solution to the previous exercise. I provided an outline of WikiCrawler
; your job was to fill in crawl
. As a reminder, here are the fields in the WikiCrawler
class:
public class WikiCrawler { // keeps track of where we started private final String source; // the index where the results go private JedisIndex index; // queue of URLs to be indexed private Queue<String> queue = new LinkedList<String>(); // fetcher used to get pages from Wikipedia final static WikiFetcher wf = new WikiFetcher(); }
When we create a WikiCrawler
, we provide source
and index
. Initially, queue
contains only one element, source
.
Notice that the implementation of queue
is a LinkedList
, so we can add elements at the end—and remove them from the beginning—in constant time. By assigning a LinkedList
object to a Queue
variable, we limit ourselves to using methods in the Queue
interface; specifically, we’ll use offer
to add elements and poll
to remove them.
Here’s my implementation of WikiCrawler.crawl
:
public String crawl(boolean testing) throws IOException { if (queue.isEmpty()) { return null; } String url = queue.poll(); System.out.println("Crawling " + url); if (testing==false && index.isIndexed(url)) { System.out.println("Already indexed."); return null; } Elements paragraphs; if (testing) { paragraphs = wf.readWikipedia(url); } else { paragraphs = wf.fetchWikipedia(url); } index.indexPage(url, paragraphs); queueInternalLinks(paragraphs); return url; }
Most of the complexity in this method is there to make it easier to test. Here’s the logic:
If the queue is empty, it returns null
to indicate that it did not index a page.
Otherwise it removes and stores the next URL from the queue.
If the URL has already been indexed, crawl
doesn’t index it again, unless it’s in testing mode.
Next it reads the contents of the page: if it’s in testing mode, it reads from a file; otherwise it reads from the web.
It indexes the page.
It parses the page and adds internal links to the queue.
Finally, it returns the URL of the page it indexed.
I presented an implementation of Index.indexPage
in “The Redis-Backed Indexer”. So the only new method is WikiCrawler.queueInternalLinks
.
I wrote two versions of this method with different parameters: one takes an Elements
object containing one DOM tree per paragraph; the other takes an Element
object that contains a single paragraph.
The first version just loops through the paragraphs. The second version does the real work:
void queueInternalLinks(Elements paragraphs) { for (Element paragraph: paragraphs) { queueInternalLinks(paragraph); } } private void queueInternalLinks(Element paragraph) { Elements elts = paragraph.select("a[href]"); for (Element elt: elts) { String relURL = elt.attr("href"); if (relURL.startsWith("/wiki/")) { String absURL = elt.attr("abs:href"); queue.offer(absURL); } } }
To determine whether a link is “internal,” we check whether the URL starts with “/wiki/”. This might include some pages we don’t want to index, like meta-pages about Wikipedia. And it might exclude some pages we want, like links to pages in non-English languages. But this simple test is good enough to get started.
That’s all there is to it. This exercise doesn’t have a lot of new material; it is mostly a chance to bring the pieces together.
The next phase of this project is to implement a search tool. These are the pieces we’ll need:
An interface where users can provide search terms and view results
A lookup mechanism that takes each search term and returns the pages that contain it
Mechanisms for combining search results from multiple search terms
Algorithms for ranking and sorting search results
The general term for processes like this is information retrieval, which you can read more about at http://thinkdast.com/infret.
In this exercise, we’ll focus on steps 3 and 4. We’ve already built a simple version of 2. If you are interested in building web applications, you might consider working on step 1.
Most search engines can perform boolean searches, which means you can combine the results from multiple search terms using boolean logic. For example:
The search “java AND programming” might return only pages that contain both search terms: “java” and “programming”.
“java OR programming” might return pages that contain either term but not necessarily both.
“java -indonesia” might return pages that contain “java” and do not contain “indonesia”.
Expressions like these that contain search terms and operators are called “queries”.
When applied to search results, the boolean operators AND
, OR
, and -
correspond to the set operations intersection
, union
, and difference
. For example, suppose
s1
is the set of pages containing “java”,
s2
is the set of pages containing “programming”, and
s3
is the set of pages containing “indonesia”.
In that case:
The intersection of s1
and s2
is the set of pages containing “java” AND “programming”.
The union of s1
and s2
is the set of pages containing “java” OR “programming”.
The difference of s1
and s2
is the set of pages containing “java” and not “indonesia”.
In the next section you will write a method to implement these operations.
In the repository for this book you’ll find the source files for this exercise:
WikiSearch.java
, which defines an object that contains search results and performs operations on them
WikiSearchTest.java
, which contains test code for WikiSearch
Card.java
, which demonstrates how to use the sort
method in java.util.Collections
You will also find some of the helper classes we’ve used in previous exercises.
Here’s the beginning of the WikiSearch
class definition:
public class WikiSearch { // map from URLs that contain the term(s) to relevance score private Map<String, Integer> map; public WikiSearch(Map<String, Integer> map) { this.map = map; } public Integer getRelevance(String url) { Integer relevance = map.get(url); return relevance==null ? 0: relevance; } }
A WikiSearch
object contains a map from URLs to their relevance score. In the context of information retrieval, a relevance score is a number intended to indicate how well a page meets the needs of the user as inferred from the query. There are many ways to construct a relevance score, but most of them are based on term frequency, which is the number of times the search terms appear on the page. A common relevance score is called TF-IDF, which stands for “term frequency–inverse document frequency”. You can read more about it at http://thinkdast.com/tfidf.
You’ll have the option to implement TF-IDF later, but we’ll start with something even simpler, TF:
If a query contains a single search term, the relevance of a page is its term frequency; that is, the number of time the term appears on the page.
For queries with multiple terms, the relevance of a page is the sum of the term frequencies; that is, the total number of times any of the search terms appear.
Now you’re ready to start the exercise. Run ant build
to compile the source files, then run ant WikiSearchTest
. As usual, it should fail, because you have work to do.
In WikiSearch.java
, fill in the bodies of and
, or
, and minus
so that the relevant tests pass. You don’t have to worry about testSort
yet.
You can run WikiSearchTest
without using Jedis because it doesn’t depend on the index in your Redis database. But if you want to run a query against your index, you have to provide a file with information about your Redis server. See “Making a Redis-Backed Index” for details.
Run ant JedisMaker
to make sure it is configured to connect to your Redis server. Then run WikiSearch
, which prints results from three queries:
“java”
“programming”
“java AND programming”
Initially the results will be in no particular order, because WikiSearch.sort
is incomplete.
Fill in the body of sort
so the results are returned in increasing order of relevance. I suggest you use the sort
method provided by java.util.Collections
, which sorts any kind of List
. You can read the documentation at http://thinkdast.com/collections.
There are two versions of sort
:
The one-parameter version takes a list and sorts the elements using the compareTo
method, so the elements have to be Comparable
.
The two-parameter version takes a list of any object type and a Comparator
, which is an object that provides a compare
method that compares elements.
If you are not familiar with the Comparable
and Comparator
interfaces, I explain them in the next section.
The repository for this book includes Card.java
, which demonstrates two ways to sort a list of Card
objects. Here’s the beginning of the class definition:
public class Card implements Comparable<Card> { private final int rank; private final int suit; public Card(int rank, int suit) { this.rank = rank; this.suit = suit; }
A Card
object has two integer fields, rank
and suit
. Card
implements Comparable<Card>
, which means that it provides compareTo
:
public int compareTo(Card that) { if (this.suit < that.suit) { return -1; } if (this.suit > that.suit) { return 1; } if (this.rank < that.rank) { return -1; } if (this.rank > that.rank) { return 1; } return 0; }
The specification of compareTo
indicates that it should return a negative number if this
is considered less than that
, a positive number if it is considered greater, and 0 if they are considered equal.
If you use the one-parameter version of Collections.sort
, it uses the compareTo
method provided by the elements to sort them. To demonstrate, we can make a list of 52 cards like this:
public static List<Card> makeDeck() { List<Card> cards = new ArrayList<Card>(); for (int suit = 0; suit <= 3; suit++) { for (int rank = 1; rank <= 13; rank++) { Card card = new Card(rank, suit); cards.add(card); } } return cards; }
and sort them like this:
Collections.sort(cards);
This version of sort
puts the elements in what’s called their “natural order” because it’s determined by the objects themselves.
But it is possible to impose a different ordering by providing a Comparator
object. For example, the natural order of Card
objects treats Aces as the lowest rank, but in some card games they have the highest rank. We can define a Comparator
that considers “Aces high”, like this:
Comparator<Card> comparator = new Comparator<Card>() { @Override public int compare(Card card1, Card card2) { if (card1.getSuit() < card2.getSuit()) { return -1; } if (card1.getSuit() > card2.getSuit()) { return 1; } int rank1 = getRankAceHigh(card1); int rank2 = getRankAceHigh(card2); if (rank1 < rank2) { return -1; } if (rank1 > rank2) { return 1; } return 0; } private int getRankAceHigh(Card card) { int rank = card.getRank(); if (rank == 1) { return 14; } else { return rank; } } };
This code defines an anonymous class that implements compare
, as required. Then it creates an instance of the newly defined, unnamed class. If you are not familiar with anonymous classes in Java, you can read about them at http://thinkdast.com/anonclass.
Using this Comparator
, we can invoke sort
like this:
Collections.sort(cards, comparator);
In this ordering, the Ace of Spades is considered the highest class in the deck; the two of Clubs is the lowest.
The code in this section is in Card.java
if you want to experiment with it. As an exercise, you might want to write a comparator that sorts by rank
first and then by suit
, so all the Aces should be together, and all the twos, etc.
If you get a basic version of this exercise working, you might want to work on these optional exercises:
Read about TF-IDF at http://thinkdast.com/tfidf and implement it. You might have to modify JavaIndex
to compute document frequencies; that is, the total number of times each term appears on all pages in the index.
For queries with more than one search term, the total relevance for each page is currently the sum of the relevance for each term. Think about when this simple version might not work well, and try out some alternatives.
Build a user interface that allows users to enter queries with boolean operators. Parse the queries, generate the results, then sort them by relevance and display the highest-scoring URLs. Consider generating snippets that show where the search terms appeared on the page. If you want to make a web application for your user interface, consider using Heroku as a simple option for developing and deploying web applications using Java. See http://thinkdast.com/heroku.