In the next few exercises we will get back to building a web search engine. To review, the components of a search engine are:
We’ll need a program that can download a web page, parse it, and extract the text and any links to other pages.
We’ll need an index that makes it possible to look up a search term and find the pages that contain it.
And we’ll need a way to collect results from the index and identify pages that are most relevant to the search terms.
If you did “Exercise 6”, you implemented an index using Java maps. In this exercise, we’ll revisit the indexer and make a new version that stores the results in a database.
If you did “Exercise 5”, you built a crawler that follows the first link it finds. In the next exercise, we’ll make a more general version that stores every link it finds in a queue and explores them in order.
And then, finally, you will work on the retrieval problem.
In these exercises, I provide less starter code, and you will make more design decisions. These exercises are also more open-ended. I will suggest some minimal goals you should try to reach, but there are many ways you can go farther if you want to challenge yourself.
Now, let’s get started on a new version of the indexer.
The previous version of the indexer stores the index in two data structures: a TermCounter
that maps from a search term to the number of times it appears on a web page, and an Index
that maps from a search term to the set of pages where it appears.
These data structures are stored in the memory of a running Java program, which means that when the program stops running, the index is lost. Data stored only in the memory of a running program is called volatile, because it vaporizes when the program ends.
Data that persists after the program that created it ends is called persistent. In general, files stored in a file system are persistent, as well as data stored in databases.
A simple way to make data persistent is to store it in a file. Before the program ends, it could translate its data structures into a format like JSON (http://thinkdast.com/json) and then write them into a file. When it starts again, it could read the file and rebuild the data structures.
But there are several problems with this solution:
Reading and writing large data structures (like a web index) would be slow.
The entire data structure might not fit into the memory of a single running program.
If a program ends unexpectedly (for example, due to a power outage), any changes made since the program last started would be lost.
A better alternative is a database that provides persistent storage and the ability to read and write parts of the database without reading and writing the whole thing.
There are many kinds of database management systems (DBMS) that provide different capabilities. You can read an overview at http://thinkdast.com/database.
The database I recommend for this exercise is Redis, which provides persistent data structures that are similar to Java data structures. Specifically, it provides:
Lists of strings, similar to Java List
Hashes, similar to Java Map
Sets of strings, similar to Java Set
Redis is a key-value database, which means that the data structures it contains (the values) are identified by unique strings (the keys). A key in Redis plays the same role as a reference in Java: it identifies an object. We’ll see some examples soon.
Redis is usually run as a remote service; in fact, the name stands for “REmote DIctionary Server”. To use Redis, you have to run the Redis server somewhere and then connect to it using a Redis client. There are many ways to set up a server and many clients you could use. For this exercise, here is my advice:
Rather than install and run the server yourself, consider using a service like RedisToGo (http://thinkdast.com/redistogo), which runs Redis in the cloud. They offer a free plan with enough resources for the exercise.
For the client I recommend Jedis, which is a Java library that provides classes and methods for working with Redis.
Here are more detailed instructions to help you get started:
Create an account on RedisToGo, at http://thinkdast.com/redissign, and select the plan you want (probably the free plan to get started).
Create an instance, which is a virtual machine running the Redis server. If you click the “Instances” tab, you should see your new instance, identified by a host name and a port number. For example, I have an instance named “dory-10534”.
Click the instance name to get the configuration page. Make a note of the URL near the top of the page, which looks like this:
redis://redistogo:[email protected]:10534
This URL contains the server’s host name, dory.redistogo.com
, the port number, 10534
, and the password you will need to connect to the server, which is the long string of letters and numbers in the middle. You will need this information for the next step.
In the repository for this book, you’ll find the source files for this exercise:
JedisMaker.java
contains example code for connecting to a Redis server and running a few Jedis methods.
JedisIndex.java
contains starter code for this exercise.
JedisIndexTest.java
contains test code for JedisIndex
.
WikiFetcher.java
contains the code we saw in previous exercises to read web pages and parse them using jsoup.
You will also need these files, which you worked on in previous exercises:
Index.java
implements an index using Java data structures.
TermCounter.java
represents a map from terms to their frequencies.
WikiNodeIterable.java
iterates through the nodes in a DOM tree produced by jsoup.
If you have working versions of these files, you can use them for this exercise. If you didn’t do the previous exercises, or you are not confident in your solutions, you can copy my solutions from the solutions
folder.
The first step is to use Jedis to connect to your Redis server. RedisMaker.java
shows how to do this. It reads information about your Redis server from a file, connects to it and logs in using your password, then returns a Jedis
object you can use to perform Redis operations.
If you open JedisMaker.java
, you should see the JedisMaker
class, which is a helper class that provides one static method, make
, which creates a Jedis
object. Once this object is authenticated, you can use it to communicate with your Redis database.
JedisMaker
reads information about your Redis server from a file named redis_url.txt
, which you should put in the directory src/resources
:
Use a text editor to create end edit ThinkDataStructures/code/src/resources/redis_url.txt
.
Paste in the URL of your server. If you are using RedisToGo, the URL will look like this:
redis://redistogo:[email protected]:10534
Because this file contains the password for your Redis server, you should not put this file in a public repository. To help you avoid doing that by accident, the repository contains a .gitignore
file that makes it harder (but not impossible) to put this file in your repo.
Now run ant build
to compile the source files and ant JedisMaker
to run the example code in main
:
public static void main(String[] args) { Jedis jedis = make(); // String jedis.set("mykey", "myvalue"); String value = jedis.get("mykey"); System.out.println("Got value: " + value); // Set jedis.sadd("myset", "element1", "element2", "element3"); System.out.println("element2 is member: " + jedis.sismember("myset", "element2")); // List jedis.rpush("mylist", "element1", "element2", "element3"); System.out.println("element at index 1: " + jedis.lindex("mylist", 1)); // Hash jedis.hset("myhash", "word1", Integer.toString(2)); jedis.hincrBy("myhash", "word2", 1); System.out.println("frequency of word1: " + jedis.hget("myhash", "word1")); System.out.println("frequency of word1: " + jedis.hget("myhash", "word2")); jedis.close(); }
This example demonstrates the data types and methods you are most likely to use for this exercise. When you run it, the output should be:
Got value: myvalue element2 is member: true element at index 1: element2 frequency of word1: 2 frequency of word2: 1
In the next section, I’ll explain how the code works.
Redis is basically a map from keys, which are String
s, to values, which can be one of several data types. The most basic Redis data type is a string. I will write Redis types in italics to distinguish them from Java types.
To add a string to the database, use jedis.set
, which is similar to Map.put
; the parameters are the new key and the corresponding value. To look up a key and get its value, use jedis.get
:
jedis.set("mykey", "myvalue"); String value = jedis.get("mykey");
In this example, the key is "mykey"
and the value is "myvalue"
.
Redis provides a set structure, which is similar to a Java Set<String>
. To add elements to a Redis set, you choose a key to identify the set and then use jedis.sadd
:
jedis.sadd("myset", "element1", "element2", "element3"); boolean flag = jedis.sismember("myset", "element2");
You don’t have to create the set as a separate step. If it doesn’t exist, Redis creates it. In this case, it creates a set named myset
that contains three elements.
The method jedis.sismember
checks whether an element is in a set. Adding elements and checking membership are constant time operations.
Redis also provides a list structure, which is similar to a Java List<String>
. The method jedis.rpush
adds elements to the end (right side) of a list:
jedis.rpush("mylist", "element1", "element2", "element3"); String element = jedis.lindex("mylist", 1);
Again, you don’t have to create the structure before you start adding elements. This example creates a list named “mylist” that contains three elements.
The method jedis.lindex
takes an integer index and returns the indicated element of a list. Adding and accessing elements are constant time operations.
Finally, Redis provides a hash structure, which is similar to a Java Map<String, String>
. The method jedis.hset
adds a new entry to the hash:
jedis.hset("myhash", "word1", Integer.toString(2)); String value = jedis.hget("myhash", "word1");
This example creates a hash named myhash
that contains one entry, which maps from the key word1
to the value "2"
.
The keys and values are strings, so if we want to store an Integer
, we have to convert it to a String
before we call hset
. And when we look up the value using hget
, the result is a String
, so we might have to convert it back to Integer
.
Working with Redis hashes can be confusing, because we use a key to identify which hash we want, and then another key to identify a value in the hash. In the context of Redis, the second key is called a field, which might help keep things straight. So a “key” like myhash
identifies a particular hash, and then a “field” like word1
identifies a value in the hash.
For many applications, the values in a Redis hash are integers, so Redis provides a few special methods, like hincrby
, that treat the values as integers:
jedis.hincrBy("myhash", "word2", 1);
This method accesses myhash
, gets the current value associated with word2
(or 0 if it doesn’t already exist), increments it by 1, and writes the result back to the hash.
Setting, getting, and incrementing entries in a hash are constant time operations.
You can read more about Redis data types at http://thinkdast.com/redistypes.
At this point you have the information you need to make a web search index that stores results in a Redis database.
Now run ant JedisIndexTest
. It should fail, because you have some work to do!
JedisIndexTest
tests these methods:
JedisIndex
, which is the constructor that takes a Jedis
object as a parameter.
indexPage
, which adds a web page to the index; it takes a String
URL and a jsoup Elements
object that contains the elements of the page that should be indexed.
getCounts
, which takes a search term and returns a Map<String, Integer>
that maps from each URL that contains the search term to the number of times it appears on that page.
Here’s an example of how these methods are used:
WikiFetcher wf = new WikiFetcher(); String url1 = "http://en.wikipedia.org/wiki/Java_(programming_language)"; Elements paragraphs = wf.readWikipedia(url1); Jedis jedis = JedisMaker.make(); JedisIndex index = new JedisIndex(jedis); index.indexPage(url1, paragraphs); Map<String, Integer> map = index.getCounts("the");
If we look up url1
in the result, map
, we should get 339, which is the number of times the word “the” appears on the Java Wikipedia page (that is, the version we saved).
If we index the same page again, the new results should replace the old ones.
One suggestion for translating data structures from Java to Redis: remember that each object in a Redis database is identified by a unique key, which is a string. If you have two kinds of objects in the same database, you might want to add a prefix to the keys to distinguish between them. For example, in our solution, we have two kinds of objects:
We define a URLSet
to be a Redis set that contains the URLs that contain a given search term. The key for each URLSet
starts with "URLSet:"
, so to get the URLs that contain the word “the”, we access the set with the key "URLSet:the"
.
We define a TermCounter
to be a Redis hash that maps from each term that appears on a page to the number of times it appears. The key for each TermCounter
starts with "TermCounter:"
and ends with the URL of the page we’re looking up.
In my implementation, there is one URLSet
for each term and one TermCounter
for each indexed page. I provide two helper methods, urlSetKey
and termCounterKey
, to assemble these keys.
At this point you have all the information you need to do the exercise, so you can get started if you are ready. But I have a few suggestions you might want to read first:
For this exercise I provide less guidance than in previous exercises. You will have to make some design decisions; in particular, you will have to figure out how to divide the problem into pieces that you can test one at a time, and then assemble the pieces into a complete solution. If you try to write the whole thing at once, without testing smaller pieces, it might take a very long time to debug.
One of the challenges of working with persistent data is that it is persistent. The structures stored in the database might change every time you run the program. If you mess something up in the database, you will have to fix it or start over before you can proceed. To help you keep things under control, I’ve provided methods called deleteURLSets
, deleteTermCounters
, and deleteAllKeys
, which you can use to clean out the database and start fresh. You can also use printIndex
to print the contents of the index.
Each time you invoke a Jedis
method, your client sends a message to the server, then the server performs the action you requested and sends back a message. If you perform many small operations, it will probably take a long time. You can improve performance by grouping a series of operations into a Transaction
.
For example, here’s a simple version of deleteAllKeys
:
public void deleteAllKeys() { Set<String> keys = jedis.keys("*"); for (String key: keys) { jedis.del(key); } }
Each time you invoke del
requires a round-trip from the client to the server and back. If the index contains more than a few pages, this method would take a long time to run. We can speed it up with a Transaction
object:
public void deleteAllKeys() { Set<String> keys = jedis.keys("*"); Transaction t = jedis.multi(); for (String key: keys) { t.del(key); } t.exec(); }
jedis.multi
returns a Transaction
object, which provides all the methods of a Jedis
object. But when you invoke a method on a Transaction
, it doesn’t run the operation immediately, and it doesn’t communicate with the server. It saves up a batch of operations until you invoke exec
. Then it sends all of the saved operations to the server at the same time, which is usually much faster.
Now you really have all the information you need; you should start working on the exercise. But if you get stuck, or if you really don’t know how to get started, you can come back for a few more hints.
Don’t read the following until you have run the test code, tried out some basic Redis commands, and written a few methods in JedisIndex.java
.
OK, if you are really stuck, here are some methods you might want to work on:
/** * Adds a URL to the set associated with term. */ public void add(String term, TermCounter tc) {} /** * Looks up a search term and returns a set of URLs. */ public Set<String> getURLs(String term) {} /** * Returns the number of times the given term appears at the given URL. */ public Integer getCount(String url, String term) {} /** * Pushes the contents of the TermCounter to Redis. */ public List<Object> pushTermCounterToRedis(TermCounter tc) {}
These are the methods I used in my solution, but they are certainly not the only way to divide things up. So please take these suggestions if they help, but ignore them if they don’t.
For each method, consider writing the tests first. When you figure out how to test a method, you often get ideas about how to write it.
Good luck!