It must certainly be apparent to you by now that software engineering requires significantly more consideration than simply programming or, more idiomatically, “coding.” In the previous chapter, you were introduced to an engineering strategy meant to help optimize the performance and minimize operational cost risk in your programs, which was that of determining the appropriate data structures to use for a given scenario. In this chapter, you will be introduced to a set of algorithms that will provide strategies to further optimize your code. An algorithm is a defined set of instructions meant to solve particular problems with contextually similar circumstances, agnostic of the programming language used to solve the problem. An example of a problem that could be solved with algorithms is the Rubik’s Cube puzzle game. Given a particular combination of different colors on each side of the cube, a discrete set of steps can be enacted to solve part of the puzzle. With enough executed combinations of different algorithms to solve color patterns for either an individual side or a layer of the cube, you can solve the entire puzzle.
While the topic of algorithms is exceptionally large, increasingly complicated, and often considered one of the most difficult topics in computer science to learn, its importance cannot be overstated. Knowing some of the basic design principles of algorithms will help you to design your own algorithms in the future, to uncover solutions to previously unsolved problems, and to bring new knowledge to the world. New algorithms tend to become the basis for valuable intellectual property and advanced breakthroughs in technology. In addition to this, understanding existing algorithms will allow you to recognize when you should use previously discovered solutions to a problem rather than reinvent the proverbial wheel on your own. An exhaustive list of core algorithms will not be provided in this chapter, as that could instead be a book on its own. However, this chapter will provide you a solid basis of understanding that can ideally lead you toward a path of continued education on this topic.
In this chapter, we will continue to build out our model operating system. You will first be introduced to the performance tuning strategy of greedy algorithms demonstrated via a merge sort algorithm and a binary search algorithm to look for files in a folder. Afterward, in order to optimally search for files in an entire operating system of folders, you will learn about various graph traversal algorithms. Finally, you will be shown an optimization strategy known as dynamic programming which we will use to issue fuzzy match searching within our operating system.
Greedy Algorithms
Our model operating system has come a long way since we started developing a basic while loop. Let’s imagine that its value has become clear to a group of power users and they have begun to use it at scale. They are starting to create hundreds or thousands of text files in the system, and their businesses have become dependent upon being able to access them dependably. As we know, scale challenges and their associated risks create unique engineering problems. As these power users begin to ask for new features in order to help them manage their files, and subsequently their businesses, we will need to keep the engineering principles we’ve learned thus far in mind.
The first request your users have asked for is the ability to sort the files in a directory. This will make it easier to navigate through the text files when they are printed to the screen via the “list” command. While it is likely that the underlying operating system that our Nebula shell is operating on already sorts the files in the directory by name, let’s assume for the sake of example that they are always printed to the terminal in the order in which they were created. Given that information, the most basically intuitive, or “naive,” way to sort these files might be to iterate through each file i and compare i to each file that comes before it. During that comparison, we would find the first file that should belong after i, insert i right before it, and then move on to the next iteration. This is an algorithm known as insertion sort and, in the worst case scenario, if the list were sorted in reverse order, each file would have to check every single item that comes before it. This results in exponential time complexity, or O(n2). Given that exponential time complexity is extremely slow, our users will likely not be very happy with that type of solution. Surely we can do better than this “brute force” approach.
The second request that your users have asked for is a way to search for an individual file based on an input keyword. This would make it easier to see if a file exists in a directory without having to scroll through a long list of files that have already been created. To accomplish this, we could implement a naive solution known as sequential search which performs a “brute force” check of every item in the directory one at a time. This approach would take O(n) time which, albeit better than O(n2), is still pretty slow. How might we appease our users’ requests in the most efficient manner possible?
Instead of using naive algorithms, let’s instead use what is known as greedy algorithms to accomplish both of these requests. A greedy algorithm is an algorithm that, given a series of iterations, chooses the locally optimum solution at each iteration regardless of the effect that choice will have on the future outcome of the algorithm. An example of this might be that of a software engineer trying to maximize their total earnings over the course of their career. Let’s assume, given that software engineers are in high demand, that they are recruited by a new firm every 2 years and that each new firm offers them a significant pay increase. The software engineer would be left with the choice of either staying with their current firm or leaving to the new firm to accept the pay increase. A greedy algorithm would suggest that, for each iteration (every 2 years), the engineer should accept the locally optimal solution. Therefore, in order to maximize their career earnings, they must take the pay increase and leave their current firm every time they are offered more money to leave.
- 1.
They must have a “Greedy Choice Property,” meaning that making locally optimal choices derives a globally optimal solution.
- 2.
They must have an optimal substructure, meaning the problem must be able to be broken into smaller sub-problems wherein the optimal solution can be found.
The algorithms that we are going to use that satisfy these two properties are the merge sort algorithm and binary search algorithm. Both of these algorithms employ a useful strategy known as “divide and conquer” in their implementations. The divide and conquer technique is considered an extremely efficient strategy for solving algorithmic problems and can be used when designing your own algorithms outside of the context of the two algorithms that we will use for sorting and searching.
Divide and Conquer
A recursive divide and conquer algorithm that prints the nth Fibonacci number
Note
Recall that recursion and recursive functions refer to a function that calls itself within the body of the function. Recursive functions often take the place of iterative loops in the functional programming paradigm. Recursive functions will continue to call themselves repeatedly until a set of criteria is met to keep the function from calling itself again.
As you might have guessed, this implementation of a Fibonacci algorithm is not greedy. In fact, it has a time complexity of O(2n) which is the slowest program we have examined in this book so far. However, it is a great representation of how you can approach solving seemingly impossible problems by breaking them down into smaller and smaller sub-problems. Let’s combine this divide and conquer strategy with the two properties of greedy algorithms to solve our users’ requests for our model operating system.
Merge Sort
The first request from our users is to have the ability to sort the files in our operating system. Since we know that the insertion sort algorithm is not going to be fast enough to solve our users’ needs, we must identify an algorithm that can sort the files in our operating system in faster than O(n2) time. Merge sort uses a divide and conquer method that will end up yielding O(n∗log(n)) time, which is relatively fast compared to the naive solution of insertion sort. What a merge sort attempts to do is take in two lists that are already sorted and merge them together, taking the lesser of the first values of the two lists as it builds up a list. This satisfies the “Greedy Choice” property of the greedy algorithm pattern since it can take the locally optimal choice (the lesser of the first values of the two lists) and it will result in a globally optimal result.
For example, assume we have two previously sorted lists of files with values “Alpha.txt”, “Jason.txt”, and “Zordon.txt” in the first list and “Billy.txt”, “Kimberly.txt”, “Tina.txt”, and “Zack.txt” in the second list. The merge sort will examine the first item in both lists and determine that “Alpha.txt” should be added to its newly constructed list and not “Billy.txt”. Then the merge sort will examine what remains of the two lists and compare “Billy.txt” to “Jason.txt”, at which point it will add “Billy.txt” to its newly constructed list which now contains “Alpha.txt” and “Billy.txt” in sorted order. It will continue to examine both lists pulling the lesser value from the head of each list until an entirely new list is constructed in sorted order.
At this point, you might wonder how we can make the assumption that the two lists that are being merged are already in sorted order themselves. The answer to that comes from a recursive divide and conquer function. In our merge sort algorithm, we can continue to divide our list in half recursively until each list essentially contains only one item. This satisfies the optimal substructure property of the greedy algorithm pattern. In our example scenario, once the lists have been divided all the way down to their base case, we would have seven lists with only one item in each. From there, the recursive function that divided those lists will compare the two halves that it divided and merge them in sorted order.
Implementation of the merge sort algorithm
In this example, instead of using a basic List data structure, the optimal data structure to use is an Array because we will need to continually access the midpoint of each list. To know the midpoint of a list, we must first determine its length, which takes O(n) time in a List but O(1) time for an Array. Using an array helps keep the operation cost of our algorithm down. You’ll notice that we are actually using an ArrayBuffer from the scala mutable collections library. The ArrayBuffer collection implements various methods for us to help the array grow and shrink (via index copying to new lists) that we don’t have to implement ourselves, for convenience.
In our first recursive function, mergeSort, we first check our base case, which is that of whether or not the midpoint is 0. The midpoint would be zero only if the length of the list were less than or equal to one, which would be the most basic list. In that scenario, we simply return the list as it is already sorted. If the midpoint of the list is anything besides zero, as denoted by the underscore operator, then we split the list into two lists, left and right, at its midpoint. From there, we call the merge function and pass a recursive call to mergeSort on the left list and a recursive call to mergeSort on the right list, which will continue to divide and merge all the way down to the base case.
The merge function is also a recursive function, but this one has two base cases. The first base case is a scenario that matches if the left list has values but the right list is empty. In this case, it simply returns the left list. The second base case is a scenario that matches if the right list has values but the left list is empty. In that case, the right list is returned. If both lists have values, then the heads of the two lists are compared. The head with the least value is added to a new ArrayBuffer and then that ArrayBuffer is added to the result of a recursive merge call on the remnants of the two lists. When all the recursive functions have been called down to their base cases, returned a value from the base case, and calculated back up the recursive chain, a new sorted ArrayBuffer will be returned in O(n∗log(n)) time. This should be fast enough to satisfy our users when using our file system.
Note
In many implementations of common data structures in the functional programming paradigm, including the ArrayBuffer, the tail of the data structure refers to every item of the list that is not the head rather than just the last item in the list.
Refactoring the list function to sort the results
Now the command for list splits the user input and calls a lift function on the results to pull out the value at index 1, if it exists. The result of the lift function is an Option type. An Option in Scala is a data type that wraps the underlying data that you are interested in (in this case, we are using an Option that wraps a String) and will indicate to the code whether the data in question exists or not. If it does exist, it is said to be of type Some; otherwise, it is of type None. Some and None are both wrapped up within the type Option, and we can use a pattern match to pull out whether the underlying data actually exists.
In the method signature of our list function, we have added the parameter sort: Option[String]. This tells Scala that we are expecting an Option type to be passed to this function, and if any underlying data exists, it will be of type String. We then pattern match on the sort parameter. If the underlying data of the Option is of type Some, meaning it does contain data, then we want to take the contents of the directory, add them to an ArrayBuffer, and pass them to the mergeSort function. This will sort them in descending order and print them to the console. If the underlying data of the Option is of type None, then we just want to print each item in the directory in its existing order. We can test this command by typing the command list in our Nebula shell without any additional arguments to see the list printed normally or the command list sort (or any other additional argument) to see the list sorted in descending order.
Exercise 13-1
The example of merge sort provided was implemented in a completely recursive manor. In order to cognitively cement the concept of merge sort, try re-implementing the algorithm using a procedural paradigm.
Binary Search
The next request from the users of our model operating system is to be able to search for a file to see if it exists. We could simply check every file in the directory to see if it matches a given search term, but how might we use the concepts of greedy algorithms and divide and conquer to perform this operation in a more efficient manor? The binary search algorithm will improve the performance of a search function over a sequential search algorithm from O(n) time to O(log(n)) time.
Binary search algorithm implementation
The input parameters to our search function are a keyword, which is the file that you are looking for, and optionally a file list (which is set to None by default), which is used during recursion but not for the initial call. The first thing our algorithm does is check to see if a file list was provided. If it is provided, then it sets the variable files to the provided value. If it is not provided, it initializes a new file list by pulling all of the files out of the current directory, adding them to an ArrayBuffer, and performing our mergeSort function on them. The result of the merge sort is then stored in the files variable. By performing the merge sort, we are satisfying the “Greedy Choice” property that allows us to choose a locally optimal path when we start dividing the list in half and searching through each half of the list. The fact that the sorted list can be continually divided and checked also inherently satisfies the optimal substructure property of the greedy algorithm requirements.
Next, the algorithm initializes a midpoint variable by performing integer division on the length of the list. This midpoint is used to check for values or continually divide the list in half, similar to how we used the midpoint in the merge sort algorithm. Finally, we perform a pattern match on the length of the file list that contains three cases. The first case is that the length of the ArrayBuffer is 1, in which case there is only one file to check the search keyword against. If that is the case, we perform the check, and if it matches, we return the value; otherwise, we return a message that says “No match found.” The next case is if the file list length is 2. In this case, we can simply check both files to see if they match the search keyword or return our failure message otherwise.
In order to call this search function, we’ve added a command to our NebulaOS command list that checks if the user input contains the word “search.” If it does, we split the input into an array and pull out the second position of that array (index 1) as the search keyword to pass to the search function. Thus, if our user provides in the input “search sample.txt”, it will traverse all of the files in the directory, and if it finds a text file named “sample”, it will print “sample.txt” to the terminal; otherwise, it will print “No match found.”
This all works great if the users are happy to continue creating all of their files in a single directory. But at some point, for scalability’s sake, they are going to want to organize their files a little better to make them easier to search. Let’s take our operating system one step further and add the ability to create folders and navigate between them. Then we can put functions in place to search for files either within an individual directory or in the operating system as a whole. In order to do that, we must create a graph data structure to represent the folders in our operating system and implement graph traversal algorithms to search through them.
Graph Traversal Algorithms
By creating a folder structure for our operating system, we are essentially deriving an undirected acyclic graph which contains a folder at each node and a reference to each sub-folder as its edges. Each node, therefore, would contain a list of file values as a property of the node. Once we have this mental model of our operating system, we can apply various graph traversal algorithms to satisfy the requirement of searching for files at either a global level or at the level of a given folder.
Implementing infrastructure to keep track of current directory in the Nebula OS command line shell
As you can see, when passing an argument to the readLine function to prompt the user for input, we can reference the path variable that was implemented to keep track of the directory path. If the path does not have a length greater than 0, then we must be in the root folder, and therefore our original prompt can be displayed as normal. If the path does have a length greater than 0, then we can append the path to the prompt to show the user what directory they are in and how they got there.
Addition of mkdir and cd commands to the pattern matching expression in the NebulaOS shell
Implementation of the mkDir and changeDir functions, along with helper functions as necessary
In this implementation, the changeDir function first checks if the passed-in argument is the double period operator and if the operating system is not currently in the root directory. If these conditions are true, it takes the curdir argument, splits it into an array, uses the init method to keep all items except that last item, and then joins the array back into a string separated by a slash. It then returns this newly constructed directory so that the operating system can set the curdir variable to the directory one level up from where it was before.
The next thing the changeDir function does, assuming the user did not pass in the double period operator, is check to see if the new directory the user requested exists. It does this by passing the curDir argument and the newDir argument to a helper function called dirExists. The dirExists function initializes an array of files and folders that exist in the directory referenced by curDir. Then it iterates through each of those files, passing those files to a stripDir helper function that removes the path prefix from the directory name and checks it against the newDir argument. If it finds a match at any point, it returns true; otherwise, it returns false. If the directory that the user is trying to navigate to exists, then it appends that directory to the current directory and returns that value to the operating system so that it can set the curdir variable to this new value. If it does not exist, it defaults to the final else condition which prints out the error message and returns the curdir that was originally passed into the function with no changes.
Addition of the current directory as an argument to the search command
As you can see, a curdir parameter has been added to the method signature for the search function in the Utilities.scala file, and the nebula.scala file passes its curdir value as an argument to the search function. The search function now uses the curdir parameter to initialize the list of files in the None case, and it has to continue to pass that curdir parameter to itself as it iterates through recursive calls. Now that this is in place, users can search for files in their current directory. Next, we’ll need to provide them the ability to search for a file in the entire graph of folders. There are two main algorithms that help us accomplish this, depth first search and breadth first search.
Exercise 13-2
By adding a current directory variable to the search function, we have given our users the ability to search for files locally within a given folder. They will want this same functionality applied to their list command and write command now that folders are available. Refactor the list command and the write command on your own to provide these features.
Depth First Search
Not all algorithms can be implemented in a greedy fashion. In the case of searching a graph for a value, it is not possible to use a divide and conquer strategy since the folders contain values independent of each other. Because of this, we are not able to sort the values in our graph to eliminate part of the search space as we traverse through the graph. Also, a graph similar to ours is not binary and therefore could have n number of child nodes for each node. This means that we cannot satisfy the optimal substructure or greedy choice properties required of greedy algorithms.
Because of this, we will need to resort to a more naive algorithm that checks every node for the values we are looking for. However, depending on the user’s knowledge of their file structure, they might be able to optimize their search by choosing one of two graph traversal algorithms. The first algorithm is called depth first search.
Implementation of depth first search algorithm
In Listing 13-9, we start by adding a search command with a -dfs parameter, signifying that the user wishes to use depth first search. Ensure that this command is placed before the original search command in our pattern matching expression so that it does not match the original search command first and ignore our depth first search command. This depth first search command should take a keyword in index position 2 after the command keyword and the -dfs parameter. Next, we define our depth first search function that is referenced in the command. The only argument it takes is the keyword to be searched for.
We start the algorithm by initializing a Stack data structure, imported from the scala.collection.mutable library, and adding the root directory to it. We call this stack foldersToSearch. Next we initialize a variable called folderFound which we set to an empty string for now. After that, we start a while loop that will kick off our graph traversal. That while loop is going to continue to iterate until the stack is empty or the folderFound variable contains a value. In the while loop, we first pop the first item off the stack and then pass that value to our binary search algorithm to search for the keyword in that directory. If the value is found, it is added to the folderFound variable and our algorithm is done. If it does not find the keyword in this root directory, it passes the node that was searched to a helper called extractFolders. This function checks all the contents of the given folder and pulls out only the items that are other folders or other nodes in the graph. Once extracted, these nodes are all pushed onto the stack for future investigation, keeping the stack from being empty and therefor ensuring that the while loop will continue to traverse nodes. The loop will continue to iterate until there are no more nodes to visit or until it finds the folder that contains the keyword. Once we break out of the loop, we print to the terminal where the keyword was found or that it was not found at all.
Note
If you recall, the binary search function makes a call to our merge sort function. Thus, both of these algorithms are employed each time a node is visited in these graph traversal algorithms. Because the graph traversal algorithms are not greedy, it is important that the algorithms that they depend on are performant since stacking algorithms on top of algorithms can add a lot of time complexity and potentially operational cost to our programs.
Breadth First Search
Implementation of breadth first search algorithm
Our users now have the ability to create folders, navigate between them, add files to folders, search within a folder, and search globally using both depth first search and breadth first search. However, their search is still predicated on matching a keyword exactly. We could add in some logic to match on sub-strings rather than full strings, but that does not protect our user in the event that they made a small typo. How might we still return a matching result in the event our user made a mistake? To do that, we could introduce fuzzy matching using an algorithmic strategy known as dynamic programming.
Dynamic Programming
Dynamic programming is an algorithm design strategy that seeks to minimize the operational cost of exhaustive operations, such as the brute force searching in our graph traversal algorithms. It does this using a process known as memoization. Memoization refers to storing the values of previously solved solutions in a hash table during the execution of an algorithm. For example, Figure 13-1 depicted the recursive call chain that our Fibonacci function created while employing a recursive divide and conquer algorithm. If you notice, that recursion chain ends up calculating the same value several times throughout the lifespan of the algorithm. To optimize this, any time the function needed to calculate the nth Fibonacci number, it could first check a memoized hash table of known Fibonacci numbers. If the answer is not there, it can calculate it on its own and store the result in the hash table for future iteration to check. But if it is there, it does not need to re-calculate it. This will cut down the execution time of the algorithm significantly.
In our model operating system, we want to add a way for our users to search for files in graph nodes that don’t exactly match the input keyword. This would allow for minor spelling differences in the input search term or if the files being searched for were unintentionally saved with a bad filename. One way to do this would be to calculate all the possible changes you could make to the input search term to make it match the file name in question and then return any file that takes less than a certain number of changes as a threshold. This type of calculation is known as an edit distance calculation, and the Levenshtein distance algorithm can help us determine the least amount of operations that could be performed on one string to turn it into another.
Levenshtein Distance
Given two strings to compare, the Levenshtein distance algorithm provides three operations to transform the first string into the second string: insertion, deletion, or replacement. Each operation that needs to be performed in the comparison incurs a cost that is counted. This dynamic programming algorithm tallies up all possible transformations to turn the first string into the second string and then returns a count of the minimum number of transformations required to satisfy the comparison.
Levenshtein distance implementation function
In this example, the first thing that we do is initialize a hash map called memoizedCosts that will store the cost of operations for any two strings of given integer lengths. This satisfies the memoization property of the dynamic programming design pattern. Next, it creates an inner function that takes the length of the first string and the length of the second string as input parameters. The first thing the inner function does is check to see if it has already calculated a minimum cost for two strings of those lengths (with the characters defined at their given index position) that it has already stored in the hash table. If it has, then the inner function is complete. If it has not, then it must calculate the value and store it in the hash table. This is all wrapped in the getOrElseUpdate method call.
Next, the algorithm does a pattern match on the lengths of the strings which evaluates two base cases. First, a scenario wherein the first string has some length and the second string has length 0. The cost of transforming the second string of length 0 into the first string is the length of the first string since we will need to insert every character from the first string. Second, it performs this same logic but in reverse. After the base cases have been evaluated, in the event that both strings have some length, it initializes a list that contains three recursive function calls.
The first call represents a scenario wherein the last character position of the first string could be considered deleted. Such would be the only case if the first string were “turns” and the second string were “turn” and all we had to do was delete the last character in the first string. In that scenario, we count the cost (1) and then add in the cost of performing these evaluations on all the other characters in the remaining sub-string of string one and all the characters in string two.
The second call in the list represents a scenario where instead of deleting the last character position in the first string, we insert the last character in string two into the last position in string one. In that scenario, we now know that the last character from string two should now match, so we only need to finish comparing string one to a sub-string of all characters in string two except the last character. So we add one to a recursive call that removes one character length from string two.
The last recursive call checks to see if the last character positions in both strings actually match. If they do, the cost is zero, and if they don’t, the cost of replacement is 1. Once we’ve determined what the cost will be for a replacement operation, we subtract one from the character lengths of both strings and advance it through recursion. In our Fibonacci recursive function, you were able to see how each iteration through the recursion split out two additional function calls into a tree that evaluated all the way down to the base case in each branch. In this edit distance algorithm, each of these recursive calls will make three additional recursive calls all the way down to their base cases. However, in many of those calls, the cost will have already been calculated and stored in the memoization table, thus speeding up the algorithm. This is the power of dynamic programming.
It is worthy to note that the unique part of this algorithm is that through each iteration, as it starts to return values up through the recursion chain, the whole function will only return the minimum cost evaluated for the three scenarios in the list. This is denoted by the call to the min method at the end of the list initialization. This demonstrates how useful dynamic programming can be in optimization problems that seek to find a min or max value from a set of scenarios. You will see these types of strategies employed in many algorithms that seek to find the shortest path between two nodes on a graph.
After defining the inner function, we actually call that inner function with the lengths of the two strings provided to the outer function. This call to the inner function builds up our memoization table. Once it has been built, we can access the key in that table that represents the lengths of the two strings that we are comparing to find the minimum edit cost of the two strings, and return that as the overall value of the outer function.
Given this useful functionality, instead of comparing equality in our three search functions, you can check to see if the edit distance of the search keyword and the file being evaluated falls below a certain threshold (you can decide whatever threshold you want that to be). In my case, I elected to check if the edit distance between the words was less than 2, meaning that there could only be at most a one-character difference between the words for my operating system to consider it a match. Now my users have a small margin for error in spelling differences when searching for files which adds a marginal level of convenience that hopefully contributes to the overall user experience of the application.
Exercise 13-3
Go back and refactor the three search functions in your operating system. Instead of checking the keyword against the files in the operating system for equality, try comparing their edit distance to a threshold number. If the edit distance falls below the threshold, return the file as a match.
Summary
In this chapter, the algorithm design strategies of divide and conquer, greedy algorithms, and dynamic programming laid the groundwork for your future exploration into the vast world of known algorithms. You were exposed to a few common algorithms, namely, merge sort, binary search, depth first search, breadth first search, and Levenshtein distance. Having a base understanding of known algorithms is important; however, knowing how to design your own algorithms can be just as useful in constructing large-scale software engineering projects. As you might have noticed, we layered each algorithm on top of the other in our operating system as we built up functionality. This demonstrates how important it is to ensure that each function is as efficient as possible because, when combined with other operations, things can start to get incredibly computationally costly. Knowing the right algorithm for the job and how much time complexity it adds to your program might be the difference in determining whether or not your project is probable or even possible.