Dictionary
CollectionsA dictionary is the general term for a collection of key–value pairs. Section 17.9.2 introduced the generic Dictionary
collection. In this section, we discuss fundamentals of how a Dictionary
works, then demonstrate the related SortedDictionary
collection.
Dictionary
FundamentalsWhen an app creates objects of new or existing types, it needs to manage those objects efficiently. This includes sorting and retrieving objects. Sorting and retrieving information with arrays is efficient if some aspect of your data directly matches the key value and if those keys are unique and tightly packed. If you have 100 employees with nine-digit social security numbers and you want to store and retrieve employee data by using the social security number as a key, it would nominally require an array with 1,000,000,000 elements, because there are 1,000,000,000 unique nine-digit numbers. If you have an array that large, you could get high performance storing and retrieving employee records by simply using the social security number as the array index, but it would be a huge waste of memory. Many apps have this problem—either the keys are of the wrong type (i.e., not nonnegative integers), or they’re of the right type but are sparsely spread over a large range.
What’s needed is a high-speed scheme for converting keys such as social security numbers and inventory part numbers to unique array indices. Then, when an app needs to store something, the scheme could convert the key rapidly to an index and the record of information could be stored at that location in the array. Retrieval occurs the same way—once the app has a key for which it wants to retrieve the data record, the app simply applies the conversion to the key, which produces the array index where the data resides in the array and retrieves the data.
The scheme we describe here is the basis of a technique called hashing. Why the name? Because, when we convert a key into an array index, we literally scramble the bits, making a “hash” of the number. The number actually has no real significance beyond its usefulness in storing and retrieving this particular data record. Data structures that use hashing are commonly called hash tables (like class Hashtable
in the System.Collections
namespace). A hash table is one way to implement a dictionary—class Dictionary<K,V>
in the System.Collections.Generic
namespace is implemented as a hash table.
A glitch in the scheme occurs when there are collisions (i.e., two different keys “hash into” the same cell, or element, in the array). Since we cannot sort two different data records to the same space, we need to find an alternative home for all records beyond the first that hash to a particular array index. One scheme for doing this is to “hash again” (i.e., to re-apply the hashing transformation to the key to provide a next candidate cell in the array). The hashing process is designed so that with just a few hashes, an available cell will be found.
Another scheme uses one hash to locate the first candidate cell. If the cell is occupied, successive cells are searched linearly until an available cell is found. Retrieval works the same way—the key is hashed once, the resulting cell is checked to determine whether it contains the desired data. If it does, the search is complete. If it does not, successive cells are searched linearly until the desired data is found.
The most popular solution to hash-table collisions is to have each cell of the table be a hash “bucket”—typically, a linked list of all the key–value pairs that hash to that cell. This is the solution that the .NET Framework’s Dictionary
class implements.
The load factor affects the performance of hashing schemes. The load factor is the ratio of the number of objects stored in the hash table to the total number of cells of the hash table. As this ratio gets higher, the chance of collisions tends to increase.
The load factor in a hash table is a classic example of a space/time trade-off: By increasing the load factor, we get better memory utilization, but the app runs slower due to increased hashing collisions. By decreasing the load factor, we get better speed because of reduced hashing collisions, but we get poorer memory utilization because a larger portion of the hash table remains empty.
A hash function performs a calculation that determines where to place data in the hash table. The hash function is applied to the key in a key–value pair of objects. Any object
can be used as a key. For this reason, class object
defines method GetHashCode
, which all objects inherit. Most classes that are candidates to be used as keys in a hash table, such as string
, override this method to provide one that performs efficient hash-code calculations for a specific type.
SortedDictionary
CollectionThe .NET Framework provides several implementations of dictionaries that implement the IDictionary<K,V>
interface (described in Fig. 21.1). The app in Fig. 21.4 demonstrates the generic class SortedDictionary
. Unlike class Dictionary
—which is implemented as a hash table—class SortedDictionary
stores its key–value pairs in a binary search tree (introduced in Section 19.7). As the class name suggests, the entries in Sort-edDictionary
are sorted by key. For key types that implement IComparable<T>
, the SortedDictionary
uses the results of IComparable<T>
method CompareTo
to sort the keys. Despite these implementation details, we use the same public
methods, properties and indexers with classes Dictionary
and SortedDictionary
. In many scenarios, these classes are interchangeable—this is the beauty of object-oriented programming.
Because class SortedDictionary
keeps its elements sorted in a binary tree, obtaining or inserting a key–value pair takes O(log n) time, which is fast compared to linear searching, then inserting.
Lines 4–6 contain using
directives for namespaces System
(for class Console
), System.Text.RegularExpressions
(for class Regex
) and System.Collections.Generic
(for class SortedDictionary
). The generic class SortedDictionary
takes two type arguments:
the first specifies the type of key (i.e., string
) and
the second the type of value (i.e., int
).
Class SortedDictionaryTest
declares three static
methods:
Method CollectWords
(lines 19–48) inputs a sentence and returns a Sorted-Dictionary<string, int>
in which the keys are the words in the sentence and the values are the number of times each word appears in the sentence.
Method DisplayDictionary
(lines 51–65) displays the SortedDictionary
passed to it in column format.
Method Main
(lines 10–16) simply invokes CollectWords
(line 13), then passes the SortedDictionary<string
, int>
returned by CollectWords
to Display-Dictionary
in line 15.
CollectWords
Method CollectWords
(lines 19–48) begins by initializing local variable dictionary
with a new SortedDictionary<string, int>
(line 22). Lines 24–25 prompt the user and input a sentence as a string
. We use static
method Split
of class Regex
(introduced in the online section of Chapter 16) in line 28 to divide the string
by its whitespace characters. In the regular expression s+
, s
means whitespace and + means one or more of the expression to its left—so the words are separated by one or more whitespace characters, which are discarded. This creates an array of “words,” which we then store in local variable words
.
SortedDictionary
Methods ContainsKey
and Add
Lines 31–45 iterate through the array words
. Each word is converted to lowercase with string
method ToLower
, then stored in variable key
(line 33). Next, line 36 calls Sorted-Dictionary
method ContainsKey
to determine whether the word is in the dictionary. If so, that word occurred previously in the sentence. If the SortedDictionary
does not contain an entry for the word, line 43 uses SortedDictionary
method Add
to create a new entry in the dictionary, with the lowercase word as the key and 1
as the value.
Using the Add
method to add a key that already exists in the hash table causes an ArgumentException
.
SortedDictionary
IndexerIf the word is already a key in the hash table, line 38 uses the SortedDictionary
’s indexer to obtain and set the key’s associated value (the word count) in the dictionary. Using the set
accessor with a key that does not exist in the hash table creates a new entry, as if you had used the Add
method, so line 43 could have been written as
dictionary[key] = 1;
Invoking the get
accessor of a SortedDictionary
indexer with a key that does not exist in the collection causes a KeyNotFoundException.
DisplayDictionary
Line 47 returns the dictionary to the Main
method, which then passes it to method DisplayDictionary
(lines 51–65), which displays all the key–value pairs. This method uses read-only property Keys
(line 59) to get an ICollection<T>
that contains all the keys. Because the interface ICollection<T>
extends IEnumerable<T>
, we can use this collection in the foreach
statement in lines 59–62 to iterate over the keys. This loop accesses and outputs each key and its corresponding value using the iteration variable and the Sorted-Dictionary
indexer’s get
accessor. Each key and value is displayed left aligned in a field width of 12
positions. Because a SortedDictionary
stores its key–value pairs in a binary search tree, the key–value pairs are displayed with the keys in sorted order. Line 64 uses SortedDictionary
property Count
to get the number of key–value pairs in the dictionary.
SortedDictionary
’s KeyValuePair
sLines 59–62 could have also used the foreach
statement with the SortedDictionary
object itself, instead of using the Keys
property. If you use a foreach
statement with a SortedDictionary
object, the iteration variable will be of type KeyValuePair<K,V>
. The enumerator of a SortedDictionary
uses the KeyValuePair<K,V> struct
value type to store key–value pairs. KeyValuePair<K,V>
provides properties Key
and Value
for retrieving the key and value of the current element.
SortedDictionary
’s Values
PropertyIf you do not need the keys, class SortedDictionary
also provides a read-only Values
property that gets an ICollection<T>
of all the values stored in the SortedDictionary
. You could use this property to iterate through the values stored in the SortedDictionary
without regard for their corresponding keys.