In this chapter, I define MyBetterMap
, a better implementation of the Map
interface than MyLinearMap
, and introduce hashing, which makes MyBetterMap
more efficient.
To improve the performance of MyLinearMap
, we’ll write a new class, called MyBetterMap
, that contains a collection of MyLinearMap
objects. It divides the keys among the embedded maps, so the number of entries in each map is smaller, which speeds up findEntry
and the methods that depend on it.
Here’s the beginning of the class definition:
public class MyBetterMap<K, V> implements Map<K, V> { protected List<MyLinearMap<K, V>> maps; public MyBetterMap(int k) { makeMaps(k); } protected void makeMaps(int k) { maps = new ArrayList<MyLinearMap<K, V>>(k); for (int i=0; i<k; i++) { maps.add(new MyLinearMap<K, V>()); } } }
The instance variable, maps
, is a collection of MyLinearMap
objects. The constructor takes a parameter, k
, that determines how many maps to use, at least initially. Then makeMaps
creates the embedded maps and stores them in an ArrayList
.
Now, the key to making this work is that we need some way to look at a key and decide which of the embedded maps it should go into. When we put
a new key, we choose one of the maps; when we get
the same key, we have to remember where we put it.
One possibility is to choose one of the sub-maps at random and keep track of where we put each key. But how should we keep track? It might seem like we could use a Map
to look up the key and find the right sub-map, but the whole point of the exercise is to write an efficient implementation of a Map
. We can’t assume we already have one.
A better approach is to use a hash function, which takes an Object
, any Object
, and returns an integer called a hash code. Importantly, if it sees the same Object
more than once, it always returns the same hash code. That way, if we use the hash code to store a key, we’ll get the same hash code when we look it up.
In Java, every Object
provides a method called hashCode
that computes a hash function. The implementation of this method is different for different objects; we’ll see an example soon.
Here’s a helper method that chooses the right sub-map for a given key:
protected MyLinearMap<K, V> chooseMap(Object key) { int index = 0; if (key != null) { index = Math.abs(key.hashCode()) % maps.size(); } return maps.get(index); }
If key
is null
, we choose the sub-map with index 0, arbitrarily. Otherwise we use hashCode
to get an integer, apply Math.abs
to make sure it is non-negative, then use the remainder operator, %
, which guarantees that the result is between 0 and maps.size()-1
. So index
is always a valid index into maps
. Then chooseMap
returns a reference to the map it chose.
We use chooseMap
in both put
and get
, so when we look up a key, we get the same map we chose when we added the key. At least, we should—I’ll explain a little later why this might not work.
Here’s my implementation of put
and get
:
public V put(K key, V value) { MyLinearMap<K, V> map = chooseMap(key); return map.put(key, value); } public V get(Object key) { MyLinearMap<K, V> map = chooseMap(key); return map.get(key); }
Pretty simple, right? In both methods, we use chooseMap
to find the right sub-map and then invoke a method on the sub-map. That’s how it works; now let’s think about performance.
If there are n entries split up among k sub-maps, there will be entries per map, on average. When we look up a key, we have to compute its hash code, which takes some time, then we search the corresponding sub-map.
Because the entry lists in MyBetterMap
are k times shorter than the entry list in MyLinearMap
, we expect the search to be k times faster. But the runtime is still proportional to n, so MyBetterMap
is still linear. In the next exercise, you’ll see how we can fix that.
The fundamental requirement for a hash function is that the same object should produce the same hash code every time. For immutable objects, that’s relatively easy. For objects with mutable state, we have to think harder.
As an example of an immutable object, we’ll define a class called SillyString
that encapsulates a String
:
public class SillyString { private final String innerString; public SillyString(String innerString) { this.innerString = innerString; } public String toString() { return innerString; }
This class is not very useful, which is why it’s called SillyString
, but I’ll use it to show how a class can define its own hash function:
@Override public boolean equals(Object other) { return this.toString().equals(other.toString()); } @Override public int hashCode() { int total = 0; for (int i=0; i<innerString.length(); i++) { total += innerString.charAt(i); } return total; }
Notice that SillyString
overrides both equals
and hashCode
. This is important. In order to work properly, equals
has to be consistent with hashCode
, which means that if two objects are considered equal—that is, equals
returns true
—they should have the same hash code. But this requirement only works one way; if two objects have the same hash code, they don’t necessarily have to be equal.
equals
works by invoking toString
, which returns innerString
. So two SillyString
objects are equal if their innerString
instance variables are equal.
hashCode
works by iterating through the characters in the String
and adding them up. When you add a character to an int
, Java converts the character to an integer using its Unicode code point. You don’t need to know anything about Unicode to understand this example, but if you are curious, you can read more at http://thinkdast.com/codepoint.
This hash function satisfies the requirement: if two SillyString
objects contain embedded strings that are equal, they will get the same hash code.
This works correctly, but it might not yield good performance, because it returns the same hash code for many different strings. If two strings contain the same letters in any order, they will have the same hash code. And even if they don’t contain the same letters, they might yield the same total, like "ac"
and "bb"
.
If many objects have the same hash code, they end up in the same sub-map. If some sub-maps have more entries than others, the speedup when we have k maps might be much less than k. So one of the goals of a hash function is to be uniform; that is, it should be equally likely to produce any value in the range. You can read more about designing good hash functions at http://thinkdast.com/hash.
Strings are immutable, and SillyString
is also immutable because innerString
is declared to be final
. Once you create a SillyString
, you can’t make innerString
refer to a different String
, and you can’t modify the String
it refers to. Therefore, it will always have the same hash code.
But let’s see what happens with a mutable object. Here’s a definition for SillyArray
, which is identical to SillyString
, except that it uses an array of characters instead of a String
:
public class SillyArray { private final char[] array; public SillyArray(char[] array) { this.array = array; } public String toString() { return Arrays.toString(array); } @Override public boolean equals(Object other) { return this.toString().equals(other.toString()); } @Override public int hashCode() { int total = 0; for (int i=0; i<array.length; i++) { total += array[i]; } System.out.println(total); return total; }
SillyArray
also provides setChar
, which makes it possible to modify the characters in the array:
public void setChar(int i, char c) { this.array[i] = c; }
Now suppose we create a SillyArray
and add it to a map:
SillyArray array1 = new SillyArray("Word1".toCharArray()); map.put(array1, 1);
The hash code for this array is 461. Now if we modify the contents of the array and then try to look it up, like this:
array1.setChar(0, 'C'); Integer value = map.get(array1);
the hash code after the mutation is 441. With a different hash code, there’s a good chance we’ll go looking in the wrong sub-map. In that case, we won’t find the key, even though it is in the map. And that’s bad.
In general, it is dangerous to use mutable objects as keys in data structures that use hashing, which includes MyBetterMap
and HashMap
. If you can guarantee that the keys won’t be modified while they are in the map, or that any changes won’t affect the hash code, it might be OK. But it is probably a good idea to avoid it.
In this exercise, you will finish off the implementation of MyBetterMap
. In the repository for this book, you’ll find the source files for this exercise:
MyLinearMap.java
contains our solution to the previous exercise, which we will build on in this exercise.
MyBetterMap.java
contains the code from the previous chapter with some methods you will fill in.
MyHashMap.java
contains the outline of a hash table that grows when needed, which you will complete.
MyLinearMapTest.java
contains the unit tests for MyLinearMap
.
MyBetterMapTest.java
contains the unit tests for MyBetterMap
.
MyHashMapTest.java
contains the unit tests for MyHashMap
.
Profiler.java
contains code for measuring and plotting runtime versus problem size.
ProfileMapPut.java
contains code that profiles the Map.put
method.
As usual, you should run ant build
to compile the source files. Then run ant MyBetterMapTest
. Several tests should fail, because you have some work to do!
Review the implementation of put
and get
from the previous chapter. Then fill in the body of containsKey
. Hint: Use chooseMap
. Run ant MyBetterMapTest
again and confirm that testContainsKey
passes.
Fill in the body of containsValue
. Hint: Don’t use chooseMap
. Run ant MyBetterMapTest
again and confirm that testContainsValue
passes. Notice that we have to do more work to find a value than to find a key.
Like put
and get
, this implementation of containsKey
is linear, because it has to search one of the embedded sub-maps. In the next chapter, we’ll see how we can improve this implementation even more.