In the previous chapter, you learned that one of the essential strategies behind mitigating operational cost risk in your software is to optimize your code for performance efficiency. Imagine, for example, that you want to create a new social media company. Successful social media companies have potentially billions of users with complicated connections between the users. If you want people to use your service, you will need to ensure consistent uptime and that information on the network is retrieved in a responsive manner. This is an extremely difficult engineering challenge.
In order to ensure optimal experience in your software, you first need to determine how to measure the performance of a program using Big O notation. As you might recall, Big O notation is a measure of the order of magnitude of operations performed on an input data set of length n. In the case of a social media company, the input data will likely be either the number of connections in a particular user’s network or the amount of activity in their feed. But the important thing to note is that performance is always measured based on an input data set, or group of data.
You were previously introduced to a variety of basic data types, of which included groups of data like Lists and Maps. Data structures are also groups of data that we can introduce to our programs in varying implementations in order to maximize the performance of our programs. Each of the following data structures, or groups of data, has very similar operations that can be performed on each of them. We will analyze each operation using Big O notation in order to compare and contrast their varying performance. Once you have a grasp on what each data structure can do, you can start to apply them to your software projects to optimize your program’s algorithmic efficiency. Again, picking the right data structures could be the difference in determining whether your software is profitable or even possible.
Arrays
You were briefly introduced to the notion of an array in a previous chapter. Some languages consider lists and arrays to be synonymous, or at least, they abstract the implementation of this type of grouped data away from the user so they are not aware of the difference. But an array is a very particular arrangement of grouped data.
Instantiating an array in Scala
It is important to identify the type of the array that you wish to instantiate; otherwise, Scala will imply that you want the array to be a group of data of type Nothing. This then means you cannot fill it with any data except nulls, which wouldn’t be very useful. In this example, we told Scala that we want this array to contain 10 positions and that those positions would be filled with String type data.
Once you have initialized the array, you can fill it with data using the index accessor. In this example, we added a user, represented by the string "me", to the group’s first position, index 0. Because a reference to the array was stored in the variable users, the computer was able to immediately find where this array was stored in memory. From there, the computer can add a number of memory location positions to the request to find the position you want to store the data in. In this example, we are asking the computer to store the value “me” in the starting memory position of the users array plus 0 positions. If we wanted to store a user at the end of the array, we would ask the computer to find the array’s starting memory location plus 9 positions, or users(9). Because accessing memory in this way is just a simple arithmetic operation, finding or assigning data in an array is immediate. The computer doesn’t have to look all over its memory to find where the data is stored; it has an instant reference point. Because the operation is immediate, no matter how big you decided the array should be, the assignment operation of an index position of an array is of constant algorithmic complexity, or O(1).
If you recall, O(1) complexity is the fastest we can hope to achieve in our software. So, if this is the fastest group of data, why don’t we use it for everything? Well, what if you don’t know how big your group of data is going to be? How could you possibly hope to guess? You might just try to instantiate an array so large that it would never be filled, but that would be a terrible waste memory. Not to mention, determining the length of data in your array would not be efficient. For example, determining the length of data in a full array is O(1) time since that information is decided upon instantiation. However, if you have an array of length 10,000 and it only has 100 actual items and the rest are null, you would have to traverse the entire array to determine the real length of the array. That would mean you would have to do 1 operation for each index position, which is considered O(n) time complexity.
So, it should be easy to see that the disadvantage of an array is that in order to realize its full benefits, you must know the size of data ahead of time. If your data grows too big for the array, you can always instantiate a new array with more positions and move all your items into the new array. However, that would require one assignment operation for each data point, which is O(n) time. So, in situations where you don’t know the size of your group of data, or it needs to grow and shrink fairly dynamically, you might consider using a linked list.
Linked Lists
- 1.
Assign the reference pointer of the first node to the new node.
- 2.
Assign the reference pointer of the new node to the second node.
In a full array, an insertion of this nature would require you to instantiate a new array, traverse the existing array to assign all its values to the new array, pause where the insertion should occur to insert the new value, and then continue on assigning its remaining values. Deletion on an array is the same procedure but simply in reverse. Insertion and deletion on an array, therefore, are of O(n) complexity. In a linked list that is free to grow and shrink, both insertion and deletion are simple operations of reassigning pointers. Therefore, insertion and deletion on linked lists are considered O(1) complexity operations. This is the major advantage of linked lists over arrays and is especially important when your data structures are required to maintain a certain order because adding new items to a list that needs to be sorted requires, by its nature, to insert new items in a specified order rather than at the end of the group of data.
The disadvantage of linked lists is that accessing an item in the list is O(n) complexity. Consider a linked list with ten nodes. If you want to access and use the seventh node in the list, you must first access the head and follow its reference pointer to the second node in the list. From the second node, you follow its reference pointer to the third node and so on until you reach the seventh node. In the best case scenario, the item you want to access is the head of the list and it only requires one operation. In the worst case scenario, the item you want to access is at the very end of the list and requires you to traverse the entire n length list, hence the O(n) complexity. We always measure the complexity of data structures based on the worst case scenario since we cannot rely on the best case when assessing risk in our software.
An implementation of a Linked List
As you can see in this example, all that is required of the individual node in a linked list is to keep track of its own data value and the reference to the next node. This makes reassignment to its surrounding nodes a fairly trivial operation. You’ll notice that we’ve created a method for adding data to the end of the linked list which requires traversal of the list to find the tail. This is an O(n) operation since it is seeking the last node. In an array, finding the last node would be an O(1) operation. Deciding whether to use a linked list or an array depends on what type of operations will be performed most frequently on the group of input data.
You might also notice some new syntax in this example. In both the LinkedList class and the Node class, there is a reference to a generic type [T]. A generic is a notation that allows you to let the downstream developer decide what type of data the linked list will use. If I were to instantiate this new linked list, I would assign it to a variable of type LinkedList[String] in order to tell the linked list that it expects all of its Nodes to contain String types in their data property.
It is worthy to note that you do not need to write your own linked list implementation for each project. Scala has built-in linked lists that you can use that are optimized for performance. The List data type that you were introduced to earlier in this book is an implementation of an immutable linked list. You can also use the mutable.LinkedList data structure from the scala.collection standard library if you need a mutable linked list.
Exercise 12-1
- 1.
You are creating a program that keeps track of users in alphabetical order. More users are added to the group every hour.
- 2.
You need to design a program that can list every tenth person from a list for statistical analysis on a regular basis.
Queues and Stacks
Queues and stacks are both specialized implementations of a linked list. The big difference that both of these share is that neither data structure allows for insertion. However, adding and removing items from a queue or a stack are both O(1) operations. Given this, these data structures might be considered fairly efficient, albeit limited.
- 1.
Items can only be added to the queue at the tail.
- 2.
Items cannot be inserted into the queue.
- 3.
Items can only be removed from the queue at the head.
Adding and removing items from a queue
As you can see from this example, to add items to a mutable queue in Scala, you use the enqueue method, and to remove items from the queue, you use the dequeue method. The dequeue method not only removes the item from the queue, but it also returns that item so that you can perform operations with it. You’ll notice that we have stored the first item to be removed from this queue in the firstPerson variable. Just like a linked list, you can check the length at any point. Before the dequeuing operation occurs, the length is three and after it is two.
- 1.
Items can only be added to the stack at the head.
- 2.
Items cannot be inserted into the stack.
- 3.
Items can only be removed from the stack at the head.
Adding and removing items from a stack
Just like the queue, you can add and remove items to and from the stack. When you add an item, you use the push method, and when you remove items, you use the pop method. Items that are popped off the stack are returned from the pop method so that you can use them in your program. In this case we have stored that item in the lastViewed variable for later reference. As you can see, the length of the stack before the pop operation was 3, and after the operation, it is now of length 2.
As you can see, arrays, linked lists, queues, and stacks all have their advantages and disadvantages when it comes to inserting, deleting, and accessing known items in the group of data. But what if you have no idea what values a group of data contains and you want to search for an arbitrary value? All of these data structures perform “search” operations in O(n) time. Might we be able to improve upon that? If so, at what cost? Hash tables and trees may provide an answer to these questions.
Hash Table
A hash table is a data structure that uses a hashing function to determine the location of each item in the group of data. The hash function can have various implementations, ranging from extremely simple to increasingly complex. However, just like any function, a hashing function takes an input (the item to be added, removed, or looked up in the group of data) and returns a single output (the location of the item in the group).
An O(n) search operation on an array
Modulo division hashing function
Hashing items before insertion and upon searching for items
As you can see, by inserting each friend into the array according to the result of their hash function, we can guarantee an O(1) operation when searching to see if the array “contains” a friend, as demonstrated by the contains function. In the second call to the contains function in this example, we perform the hash function on the input string “Bob” which evaluates to 3. We immediately check in the array at index 3 to see if Bob is there. Since that index position contains the value “Kim,” we can automatically determine that “Bob” is not in the list of friends without needing to search the rest of the array.
You might notice that this methodology will not work if we are trying to add two friends whose names contain the same length of characters. If we wanted to add “Bob” to our list of friends, his hash code would be the index 3, which is already occupied by “Kim.” In this situation we encounter what is called a hash collision. To deal with a hash collision, we can insert a list into the index position where the collision has occurred and store any name that has the same length of characters in that new inner list. So, index 3 would contain an inner list like so: List("Kim", "Bob"). Then when performing the contains function, we can iterate through each item in the list in that index position to see if the search term exists in the inner list. This is slightly less than O(1) time complexity since there is a bit of traversal going on within the inner list. However, it is more performant than traversing the entire array. More complicated hashing functions minimize the probability of hash code collisions.
It is important to note that, in order to ensure the uniqueness of hash code values, data structures that use hashes to store values must contain only unique values. You cannot store any duplicate data in hash tables. Also, given how the hash function determines where the data should be stored, it might be obvious to you that data stored via hashing cannot do so in any type of sorted order. If you attempt to sort a hashing table, you will forfeit the benefits of its O(1) search times.
In the Scala mutable collections library, you can see that the implementation of the HashTable is written as a Trait. Therefore, you cannot use a hash table directly in Scala. You can, however, decorate your own data structure with the HashTable trait if you desire. That being said, the HashTable is used in various implementations of both Maps and Sets, both of which have native Scala collections, which we will cover later in this chapter.
Trees
If you wish to have increased performance when searching for an item in a group of data but you must also mandate that the list maintain some kind of sorted order, a Binary Search Tree data structure might be what you are looking for. The binary search tree will not provide you the same O(1) time complexity achieved when searching for items in a hash table; however, it does allow for better than O(n) search time. There are many types of trees that can be used in software engineering, but the binary search tree is fairly common for searching items in sorted order.
In order to allow for increased search performance, values inserted into the tree should do so in sorted order. At no point can a binary search tree ever be unsorted. When adding an item to the tree, if there are no items in the tree, then a root node is created and the value of the item is added to this root node. When the next item is added to the tree, if its value is less than the value of the root node, it is added as a node that the root’s leftChild property points to. If its value is greater than the value of the root node, it is added as a node that the root’s rightChild property points to. If leftChild or rightChild already have a node assigned to them, then the new item will be compared to the value of that node. The new item will continue to traverse the tree of nodes, comparing to the left and to the right, until it finds a null reference point to assign itself to. If the value of the node being inserted ends up being equal to the value of a node that already exists on the tree, then the new node will be ignored. Just like hash tables, the values in a tree must be unique in order to obtain the performance benefits that a tree can provide.
Note
There are edge cases in tree data structures where items being added to the list will already be in a sorted order. In this case, a straight-line tree will occur where either every leftChild or every rightChild is null which essentially creates a linked list, nullifying the benefits of the binary nature of the tree. In this case, many tree structures have built-in balancing algorithms that re-organize themselves to maintain performance. For further study, investigate red-black trees or weight balanced binary trees.
Just like hash tables, a tree is typically not used alone in Scala. In other words, like the hash table, a tree is an abstract data structure. Instead of using a tree on its own, its organizing principles are generally extended by either sets or maps. If you desire, you can also create your own data structures that extend the capabilities of a tree as well.
Sets
A set, in both mathematics and computer science, is a collection of items that are inherently unique or distinct, meaning that there are no duplicates. Sound familiar? One of the main side effects from the performance gains of both hash tables and trees is that the items contained within them must be unique. Thus, the abstraction of hash tables and trees can be applied to the Set data type.
An example of using a HashSet from the Scala mutable collections library
Notice in this example that we’ve added two names to the set that have the same length, “Kim” and “Bob.” The hashing function that the native HashSet implementation of Scala uses handles the hashing collision that we would have run into if we were using our own modulo division implementation of the hashing function. We also attempt to add in “Eric” to the set twice, which is ignored by the set as you can see when the final set is printed to the terminal. You might also notice that the order of the items the set ultimately prints is not the same order in which they were added. This is confirmation that they are being positioned according to the hashing function rather than according to the order in which they were inserted. Remember, data structures that implement hashes while gaining the O(1) access and searching performance cannot contain any type of sorted order.
An example of using a TreeSet from the Scala mutable collections library
In this example, we add a collection of numbers to a set in random order. We also attempt to add the number 10 twice, which per the properties of a set is ignored. When printing out the resulting TreeSet, you can see that the items are stored in sorted order. Also, when printing whether or not the set of numbers contains the number 5, we receive a true value that was accessed in O(log(n)) time.
Maps
Demonstration of a HashMap and a TreeMap from the Scala mutable collections library
This example provides a collection of key/value pairs of the names and ages of the friends in our program. Notice in the terminal output that the items stored in the HashMap appear to be in no particular order. The contains method of the HashMap searches for any matching keys in O(1) time. Accessing an item by key in the HashMap will also occur in O(1) time since the key itself is stored based on the hashing function.
In the TreeMap output, you can see that the keys are stored in sorted order. Determining whether the age of a friend exists in the Map by key is performed in O(log(n)) time. Accessing any individual friend by key is also performed in O(log(n)) time as each friend is stored in the sorted tree structure.
Exercise 12-2
- 1.
Your program needs to access as quickly as possible the medical records of a patient by their id.
- 2.
You have a program that generates error logs based on order of events, and you want to be able to search the logs for events or display them in order.
- 3.
You want to create a search engine that indexes the entire Internet, and you should be able to search for the URL of each page as quickly as possible.
Performance Reference
A reference guide of Big O algorithmic efficiency for each data structure
Data Structure | Access | Search | Insertion/Deletion |
---|---|---|---|
Array | O(1) | O(n) | O(n) |
Linked List | O(n) | O(n) | O(1) |
Stack/Queue | O(1) | O(n) | O(1) |
Hash Tables∗+ | O(1) | O(1) | O(1) |
Trees∗ | O(log(n)) | O(log(n)) | O(log(n)) |
Summary
In this chapter, you learned how to optimize the performance of your programs by weighing the pros and cons of the different data structures that might impact your code. You learned that arrays are optimal for accessing items by index position but inflexible when it comes to inserting or deleting items. In order to gain efficiency in programs that need flexibility in that area, you learned that linked lists can grow and shrink with relative ease. From there, you were introduced to queues and stacks that are specialized implementations of the linked list. Finally, to gain performance in searching for items in a collection, you learned that Hashes and Trees can be applied to either Sets or Maps to give you optimal algorithmic efficiency.