Chapter 13

Sorting and Searching

Some of the most fundamental things that we do on computers are to order data according to certain values and to search through data for various things. These activities are known as sorting and searching, respectively. Due to their importance in solving many different types of problems, a lot of work has gone into doing them as quickly and efficiently as possible. In this chapter we will start your exploration of the topics by working through some of the simpler approaches.

13.1 Basic Comparison Sorts

There are many different ways in which to sort data. The most general ones do not have many requirements on the nature of the data. All they require is the ability to compare elements to see if one should come before another. These comparison sorts are the ones that are used most frequently, mainly because they are so flexible. There are limits on how fast they can be because they are not customized to the data, but typically their generality trumps speed.

At this point we are not even going to consider the fastest comparison-based sorts. Instead, we will start with some of the conceptually simpler sorts. Before we do that though, you should take a second to think about how you sort things as a human. There are some different scenarios you could consider. When you were younger, odds are that you were given assignments in school where you had to sort a list of words into alphabetical order. What approach did you take to doing that? Could you write down a step-by-step procedure to explain to a small child who was doing it for the first time how he/she might go about it?

The procedure that you use might vary depending on what you have to work with. In the list of words, you might need to write it down as a sorted list and you do not want to have to erase and move things around if at all possible. You might be sorting folders in a filing cabinet. The folders can slide backward and forward. They do not have any set position, and moving a folder is not all that hard to do. On the other hand, imagine the silly situation of having to sort cars in a parking lot by their license plates. The cars do not move easily. Each move takes effort and they certainly do not slide around.

Take some time to consider how you would sort in these different scenarios. How does the type of object impact your approach? Space can also be an issue. If you are sorting folders in a big room you might pick a different approach than if you are in a small closet with a file cabinet. For the cars, you might pick one approach if you are moving the cars from one long row to a completely different row than you would if there was only one open space for you to park cars in as you move things around.

The sorts that we are going to work with here not only work with a basic comparison operation, they are also intended to work with arrays and can work in-place. This implies that they can be done using just a single array. You do not have to make a copy. Not all sorts have this property. So this is like the parking lot with only a few open spaces.

13.1.1 Bubble Sort

The first sort that we will describe is something of a classic in Computer Science. It is not efficient. In fact, you would be hard pressed to convince any person to use this sort of method when sorting things by hand. It works for computers because they will do whatever you tell them, even if it is extremely repetitive and pedantic. It is only taught broadly because it is so remarkably simple to describe and code.

The basic idea of the bubble sort is that you want to run through the array and look at items that are next to one another. If two items are out of order, you swap them. One pass through the array will not get it sorted unless it was very close to sorted to begin with. However, if you repeat this process over and over, the array will eventually get to a situation where it is sorted. Using this description, there are a few ways in which the bubble sort can be written. They all have an inner loop that looks something like the following.

  for(i <- 0 until a.length-1) {
  if(a(i) > a(i+1)) {
   val tmp=a(i)
   a(i)=a(i+1)
   a(i+1)=tmp
  }
  }

The end index of this loop can change over different iterations to make it more efficient, but the idea is the same. The value i runs through the array comparing each element to the one after it. If the first is larger than the second, then they do a swap.

The swap code is the three lines inside of the if. To picture what this is doing, imagine the analogy of cars in a parking lot. The tmp variable is an extra space where you can move one car. So you start by moving the first car to the empty space. After that you move the second car into the space the first one had been in. To finish it off, you pull the car from the extra space into the second spot. The car analogy is not perfect. When we move a car from one spot to another, the spot that is vacated is left empty. When we do an assignment, the variable or Array location we are getting the value from does not lose the value. Instead, we have two memory locations that both have references to the value. That would be like having a second version of the car. Much of the time this is only a technicality, but there are times when it can be significant as those extra references can hold onto memory that you really are not using anymore.

In order to do a full sort, this loop needs to be repeated over and over again. How many times must it be repeated? To answer that we can observe that each time through, the largest unsorted value is pushed all the way to where it belongs. If there are n elements in the Array, then after n − 1 passes everything will certainly be in place.1 Using this logic, we can write the following code.

def bubbleSort(a:Array[Double]) {
  for(j <- 0 until a.length-1) {
  for(i <- 0 until a.length-1-j) {
  if(a(i) > a(i+1)) {
   val tmp=a(i)
   a(i)=a(i+1)
   a(i+1)=tmp
  }
  }
 }
}

In this code, the outer loop executes n − 1 times as the value j counts up from 0 to n − 2. You will notice one other change in the inner loop. The end value for i has been changed from a.length-1 to a.length-1-j. This is based on the observation that after one pass, the largest element is in the correct location at the end. After two passes, the second largest element is also in the correct place and so on. For this reason, the inner loop can stop one spot earlier each time through because there is no need to check those values that we already know are sorted. The sort would still work fine with subtracting off 1, but it would do roughly twice as much work.

The other thing that this full sort includes that was not apparent from just the inner loop is that this function has been written to sort an Array of Doubles. As written, this will not work if you pass in anything other than an Array[Double]. If you need to sort something else, you have to write a different sort. There are many ways in which this is less than ideal, especially considering that everything except the type declaration would work equally well for Array[Int], Array[String], or an Array of anything else that works with greater than. We will talk about how to get around this in chapter 19.

Technically, this version of bubble sort might get the data sorted and then keep working on it. Smaller elements only move forward one slot on each pass. If the data is not too far out of order so that nothing has to move all that far forward, the sort might get the data in order before the n − 1 iterations are complete. For this reason, you might want to use a flagged bubble sort. This is a small variation on the normal bubble sort that uses a while loop for the outer loop. It exits after n − 1 iterations or if it goes through a pass of the inner loop without doing any swaps as that implies that everything is already sorted.

def flaggedBubbleSort(a:Array[Double]) {
  var flip=true
  var j=0
  while(flip && j<a.length-1) {
  flip=false
  for(i <- 0 until a.length-1-j) {
  if(a(i) > a(i+1)) {
   val tmp=a(i)
   a(i)=a(i+1)
   a(i+1)=tmp
   flip=true
  }
  }
  j+=1
  }
}

This code adds a few more lines to deal with the flag and the fact that we have to do the counting ourselves with a while loop. The flag is called flip and it starts off as true. However, before the inner loop starts it is always set to false and only get set back to true when a swap is done.

13.1.2 Selection Sort (Min/Max)

The next sort we want to consider is one that humans are far more likely to employ, especially in a situation like sorting the cars where moving cars around takes a significant amount of effort. The sort is the selection sort and it is called this because it runs through the values and selects one and places it where it belongs. In order for this to work you have to know where the value should go. This is easy if the value is either the largest or the smallest in the collection. As such, the selection sort is typically implemented as either a minimum sort (min sort) or a maximum sort (max sort). In theory one can also write a min-max sort that picks both and moves them to where they belong.

We will write a min sort here. It is fairly trivial to convert this to a max sort. The inner loop of the min sort will run through all of the currently unsorted collection and find the minimum element in that. If that element is not already where it should be, it is swapped into place. For an Array, the fact that this is a swap is significant. To understand this, think about the cars. Shifting all the cars over one is hard. Swapping two is much easier. The Array is like the parking lot with fixed locations. Certain other data structures could act more similar to file folders which slide around easily and might allow shifting. If you use an Array and a shift instead of a swap, the code you produce will be both longer and significantly slower.

def minSort(a:Array[Double]) {
  for(j <- 0 until a.length-1) {
  var min=j
  for(i <- j+1 until a.length) {
   if(a(i) < a(min)) min=i
  }
  if(min!=j) {
   val tmp=a(j)
   a(j)=a(min)
   a(min)=tmp
  }
  }
}

As with the standard bubble sort, the outer loop happens n − 1 times as it has to swap n − 1 elements into the locations they belong. The contents of the inner loop here are quite short, just a check that changes the min variable if needed. The swap itself is longer than the loop.

The selection sort is not really more efficient than the bubble sort in general. However, because the swap is outside of the inner loop, this sort can be much more efficient in those situations where the swaps are expensive. This situation can not happen for normal Arrays in Scala, but it can be true for Arrays in some other languages or if the values being sorted are contained in a file instead of the memory of the computer.

13.1.3 Insertion Sort

The last sort we will discuss in this section is another one that humans might consider using. It is called the insertion sort because the way it works is to build up a sorted group of elements at the beginning of the Array and each new element is inserted into the proper location in that group. The advantage of this sort is that it can do less checking. Once it finds the place to put the element it can stop and move to the next element. It is particularly efficient if the Array is fairly close to sorted as each element typically does not have to be moved very far.

The code for an insertion sort can be written in the following way.

def insertionSort(a:Array[Double]) {
  for(j <- 1 until a.length) {
  var i=j-1
  val tmp=a(j)
  while(i>=0 && a(i)>tmp) {
   a(i+1) = a(i)
   i -= 1
  }
  a(i+1) = tmp
  }
}

The outer loop here still executes n − 1 times, but in a different way and for a different reason. Instead of starting at 0 and stopping one short of the last element, this loop starts at 1 and goes to the last element of the Array. The reason for this is that an Array of one element can be viewed as a sorted Array. So we take the first element as being where it should be in a group of 1. Then we start with the second element and potentially move it forward if needed.

The code inside the outer loop can use some explaining as well, as this is a bit different from what we saw with the other two sorts. It begins with the declaration of the variable i, which begins at j-1. This is the index of the next element that we need to compare to the one we are moving. We also declare a temporary variable called tmp that stores the value in the Array that we are moving forward. Having the temporary variable here prevents us from doing full swaps and makes the code both shorter and faster. To understand this, consider again the example of moving cars in a parking lot. Consider you have ten cars, the first eight of which are already sorted. You need to move the ninth one forward to the place it belongs, potentially several spots down. You would not do this by swapping cars one after the other. Instead, you would pull that car out and put it in the one empty spot to the side. Then you would move the sorted cars down, one at a time, until you had cleared out the proper spot. At that point you would drive the car from the spot on the side into the correct position.

That mental image is exactly what this is doing. The while loop executes as long as we have not gotten down to the beginning of the Array and the element we are considering is greater than the one we are inserting. As long as that is true, we move the element up and shift our attention down to the next element. When we are done, we put the temporary value where it belongs.

13.1.4 Testing and Verifying Sorts

The last three subsections simply presented sorts as working products. In reality, sorts are like any other code and have to be tested. Fortunately, sorts are fairly easy to test. After the sort has completed, you should be able to run through the elements and check that each element is less than or equal to the one after it. As long as this is true for all of the elements, the array has been properly sorted. An imperative approach to doing this might look like the following.

def isSorted(a:Array[Double]):Boolean = {
  for(i <- 0 until a.length-1) {
  if(a(i)>a(i+1)) return false
  }
  true
}

This code does something that we have not seen before. It uses a return statement. We have seen that normally, the last statement in a function should be an expression and the value of that expression is what the function returns. It is also possible to force a function to return earlier using a return statement as is done here. If at any point we find consecutive elements that are out of order, there is a forced return of false. If it gets all the way through the array it returns true.

This style is frowned on by some because it complicates things for those reading the code. Normally when you see a for loop, you know that it will run through the entire collection and the code will exit when it gets to the end of the collection. The return statement forces the for loop to exit earlier and as such, forces the reader to look more closely to figure out what is going on here. Code that behaves in slightly unexpected ways like this is a common source of errors, especially if the code gets modified later on.

You could get around having the return statement by using a while loop. This would remove the requirement of a return statement, but it does not really make the code more functional or easier to read. If anything, it might make it a bit harder to read and deal with. An alternative to this is to use the forall method of collections. The only challenge in this is picking a way to compare sequential elements using forall, as that method only works on one argument at a time. One way to get around this is to use forall on a Range that is the index. This is logically very similar to the original for loop except that your part of the code is completely functional.

def isSorted(a:Array[Double]):Boolean = {
  (0 until a.length-1).forall(i => a(i) <= a(i+1))
}

The fact that random access is a fast operation on an Array allows this code to be efficient. A second approach, which is also functional, is to use forall on a zipped tuple as is shown here.

def isSorted(a:Array[Double]):Boolean = {
  (a,a.view.tail).zipped.forall(_ <= _)
}

This code is a bit shorter, but significantly more advanced. In fact, it uses advanced topics from sections 8.5 and 10.4. The approach here is to make a tuple that contains the original elements first and everything but the first element second. If we zipped these two collections with the zip method, the result would be a sequence of tuples with each element and the one after it in a tuple and the first and last element appear only once in the first and last tuples. That zipped collection could then be dealt with using foreach. Such a solution would be very inefficient however, as it would actually construct a lot of new objects and do a lot of copying. The advanced topics are used to make it more efficient.

The first advanced topic is the view from section 8.5. A view is a representation of a collection that does not make a full copy unless forced to do so. In this code, taking a tail of a normal Array would produce a whole new Array and copy a lot of data over into it. However, the tail of the view is just a small object that will interact with code the same way the tail would, but through logic instead of making a real copy.

The second advanced topic is the use of the tuple zipped method from section 10.4. This is a method of the 2-and 3-tuple types that allows you to more efficiently run common functions that need to go through two or three collections at the same time. Instead of making a whole new zipped collection of tuples, this gives you back an object that has methods like map, filter, and forall declared so they take as many arguments as are in the tuple. So we get a version of forall, which needs a function of two arguments we can easily provide using the shorthand syntax.

Whichever approach you choose, you can test a particular sort with code like the following.

val nums=Array.fill(args(0).toInt)(math.random)
flaggedBubbleSort(nums)
assert(isSorted(nums))

This code creates an Array of random Doubles with a length determined by a command-line argument. It then calls a sort on that Array and then uses the assert function to make sure it is true. Assert is a standard function in Scala that can be called with one or two arguments. The first argument is a Boolean that should be true. If it is not true, an error results terminating the program. A second argument can be provided that is passed by-name and should provide a useful message if the assert fails.

The assert function can be used generally in your code. There is also a require function that performs similarly, but should be used for argument values. So at the top of a function, if the function will only work on certain argument values, you can use a call to require to make it so the code fails in an informative manner when bad arguments are passed in.

13.1.5 Sort Visualization

It can help you understand sorts if you get to see them in action. There are many different websites that have animations that will show you how these and other sorts work. One advantage of having already studied graphics is that we can write our own code to visualize what the sorts are doing. This is part of why the sorts above were set up to work with the Double type. It makes it easy to generate with the call to math.random and to draw because the random numbers are uniformly distributed between 0.0 and 1.0.

To set things up for the visualization, we will use the nums Array from above and put in some other code before the sort is called. We will start with the basics.

val img=new BufferedImage(nums.length,500,BufferedImage.TYPE_INT_ARGB)
val panel=new Panel {
  override def paint(g:Graphics2D) {
 g.drawImage(img,0,0,null)
  }
  preferredSize=new Dimension(img.getWidth,img.getHeight)
}
val frame=new MainFrame {
  title="Sort Animation"
  contents=panel
  resizable=false
}

This creates an Image that has a width of the number of values we have in the Array with a height of 500. It then makes a Panel that draws the Image and has a preferredSize equal to the size of the Image. Lastly, we make a Frame and add the Panel into it. We also use a flag in the Frame that we have not used before to make it so the Frame can not be resized.

With this framework in place, we can define a method that will render an Array of Doubles to the Image. The function should also have the ability to draw two marker lines to help indicate what is happening in a selection sort where we really do not move things very often, but spend a lot more time searching for the smallest element.

def renderSort(a:Array[Double],m1:Int = -1,m2:Int = -1) {
  import java.awt._
  val g=img.createGraphics
  g.setPaint(Color.white)
  g.fillRect(0,0,a.length,500)
  g.setPaint(Color.red)
  for(i <- 0 until a.length) {
  g.drawLine(i,500,i,500-(a(i)*400).toInt)
  }
  g.setPaint(Color.black)
  g.drawLine(m1,0,m1,50)
  g.setPaint(Color.green)
  g.drawLine(m2,0,m2,50)
  panel.repaint()
  Thread.sleep(20)
}

Starting from the top, this function takes the Array and two Ints. The Ints, locations for the markers, have default values of -1. This mean that you can call this function and pass only the Array if you do not need the markers, and they will be drawn out of bounds. The first line in the function is a local import of the java.awt package. Local imports are something that Scala allows so that you do not have to pollute the namespace as much importing everything at the top of the file. Instead you can import what you need, when and where you need it.

The middle of the function is basic drawing. It gets the Graphics2D for the Image and clears the Image with white. Then it draws in red, drawing one line for each element of the Array. The lines go from the bottom up and are taller for larger values. After drawing the Array values the markers are drawn in two different colors and the panel is told to repaint.

The last line of this method is something that we have not seen before, but which is required if we want to actually be able to see what is going on. The Thread.sleep(millis:Long) method causes the current execution to pause for the specified number of milliseconds. The number 20 was chosen to give 50 updates every second. This is slow enough you can see what is going on, but fast enough that you can sort a decent number of elements in a reasonable amount of time. We will talk much more about threads in chapter 21.

Now all that remains is to add some calls to renderSort in the sort algorithms. Putting them into the inner loop works best. For minSort you need to pass i and min as the marker values. The others do not need to draw the markers.

13.1.6 Order Analysis

We said previously that the sorts presented above are not the most efficient sorts. These sorts are presented first because they are fairly simple and straightforward. However, it is a reasonable question to ask how "fast" they really are. This is a challenging question to answer because speed can depend on many different things including the computer you are running on and exactly what you are sorting. To address this, algorithms are generally described in terms of their "order" in a particular operation. The order is a rough functional description of how the number of operations scales with the size of the input. We will talk in more detail about order in section 24.3. For now we will use the term fairly loosely.

So the idea of order analysis is to pick one or more operations that are particularly important to the algorithm and try to find a function that models how the number of those operations changes with the size of the input. In the case of sorts, there are two operations that are particularly significant, comparisons and assignments. Most of the time we worry about comparisons and for the code we have written here, that is appropriate.

So how many comparisons does the full bubbleSort do on an Array with n elements? The first pass through the loop does n − 1 comparisons. The second pass does n − 2. The third does n − 3 and so on. In math terms we can write the number of comparisons in this way:

C = i=1n1i = n(n1)2 = 12 n2  12 n.

This gives the exact number of comparisons between elements that will be done. It happens to be be exactly the same for the selection sort. We typically refer to this as O(n2). For order analysis we generally ignore coefficients and everything except the highest power. This does not provide a perfect picture of performance, but it works very well in general, especially when you talk about larger inputs.

The most important thing to note about an O(n2) algorithm is that if you double the number of elements, there will be four times as many comparisons done. As a result, you can generally expect the sort to take four times longer on twice as much stuff. The nature of the scaling is what we often care about the most. Different computers might be faster or slower, but just knowing this scaling and a runtime for a problem of a certain size gives you a good feel for what the runtime will be if you change the size of the problem.2

The number of comparisons in the selection sort and the bubble sort depends only on the size of the data set being sorted. It does not depend on the nature of the data. This is not the case for the flagged bubble sort or the insertion sort. These sorts can do very different numbers of comparisons depending on the data that is given to them. For example, if you give a flagged bubble sort an array that is already sorted, it will only run through it once, doing n − 1 comparisons. We would call that O(n). Granted, that is a best-case scenario. The best-case is rarely of interest. Instead, we typically worry about average-case and worst-case situations.3

So what are the average and worst-case behaviors of insertion sort? The outer loop clearly happens n − 1 times. So we can do a summation much like what we did above. However, the inner loop happens a variable number of times. In the worst-case, which would be getting data in reverse order, the inner loop will run i times and we get exactly the same behavior as with the bubble and selection sorts. In the best-case, which is data in the proper order, the inner loop does only one check. This gives linear performance, O(n), because

C= i=1n11 = n  1.

On average we expect something halfway between these extremes, we expect the inner loop to do i/2 comparisons. This gives

C = i=1n1i2 = n(n1)4 = 14 n2  14 n.

While this is better than the behavior of the other sorts by a factor of two, it is still O(n2). This is because doubling the size of the input still makes it take four times longer and that is what we generally care about the most.

So how good or bad is O(n2)? Functions like this that grow as a polynomial function are referred to in a very general way as being tractable, because you can use them for fairly large values of n and still expect the calculation to finish in a reasonable time. We will see later on that not all algorithms have this behavior. However, O(n2) is not great if the value of n gets extremely large. We have no problem using these sorts for Arrays with 100 or 1000 values. However, as you continue to add zeros, these methods will prove to be too slow. Each factor of 10 in the size causes the program to do 100 times as many comparisons and generally this leads to it taking 100 times longer to complete. As such, these methods become unusable if n gets big enough. Fortunately, there are alternatives.

13.1.7 Shell Sort (Diminishing Gap Sort)

The first alternative sort we will look at is the Shell sort, also called the diminishing gap sort. This sort was first proposed in 1959 by Donald Shell and it was one of the first general sort algorithms developed that is faster than O(n2), though some minor changes had to be made in order to get that performance for the worst case. The basic idea of this sort is one that might seem a bit counterintuitive. It performs an insertion sort repeatedly on different subsets of the full array. To start with, the subsets are taken to be groups of elements that are widely spaced. The spacing between the elements in each subset is called the "gap". As the alternate name implies, the gap is then decreased in size for each run through the sort.

The counterintuitive aspect of this sort is that performing insertion sorts multiple times should sort the array faster than doing doing it outright. This can work because a sort with a smaller gap size maintains the ordering at the larger gap size so work is not undone, and the insertion sort is very efficient on partially ordered data. The sorts with large gaps very little work compared to the smaller gaps because they do not contain many elements, but once you get to the smaller gaps the data is mostly sorted so you get close to the best-case performance of the insertion sort. The sort always ends with a gap size of 1, which is doing a normal insertion sort, so you know that the result is fully sorted.

The only trick to the Shell sort is figuring out what to do about the gap. The initial suggestion was to start at half the size of the Array and decrease it by factors of 2. This often works well, but spacings that are factors of 2 apart will keep even and odd subsets separate until you get down to a gap of 1 and in certain cases that can actually lead to O(n2) behavior. For this reason, we use a slightly different factor.

def shellSort(a:Array[Double]) {
  var gap=a.length/2
  while(gap>=1) {
  for(j <- gap until a.length) {
   var i=j-gap
   val tmp=a(j)
   while(i>=0 && a(i)>tmp) {
  a(i+gap) = a(i)
  i -= gap
  renderSort(a)
   }
   a(i+gap) = tmp
   renderSort(a)
  }
  gap=(gap/2.2).round.toInt
 }
}

This code was created by defining the var gap and writing the while loop, then cutting and pasting the insertion sort inside of the while loop. Once there, all the places where i was incremented or decremented by 1 were changed so the 1 was gap. The factor of 2.2 gives better spacings though you do need to round and convert back to an Int. Two calls to renderSort were left in this code so that you can easily visualize it. Of the sorts we have discussed so far, this one is probably the most important to visualize to get a grasp on what it is doing and why the process of reducing the gap is significant.

The exact order of the Shell sort is harder to pin down. This is largely because it varies with the selection of gap scaling and can range from O(n2) at the poor end all the way to O(n log2 n) at the good end. Reasonable implementations like the one shown here will tend to give O(n3/2) performance. This might not seem like a tremendous improvement, but if n gets large, the difference can be quite dramatic.

13.2 Searching

While sorting data is something that we do a fair bit with computers, searching for data is a far more common task. A lot of the things that you do with computers on a regular basis involve searching for information either just to retrieve it or to modify information related to what is found. This makes sense, as running through data is something that computers do particularly quickly and efficiently.

13.2.1 Sequential Search (Linear Search)

The most basic form of search is the sequential or linear search. This involves running through data one element at a time and checking each one to see if it matches what you are looking for. If it finds what you are looking for, it either returns the data or an index where it can be found. If it gets to the end without finding it, it will return something to indicate that the data was not there. The following code is a linear search through an Array of Ints.

def linearSearch(a:Array[Int],value:Int):Int = {
  var i=0
  while(i < a.length && a(i)!=value) {
  i+=1
  }
  if(i > a.length) -1 else i
}

This search returns the index of the first element in the Array whose value is the same as what was passed in. It does this with a while loop so that it can stop early if it is found.

The if statement is needed because the common idiom when returning an index is to return -1 when the value is not found. We can get rid of this conditional at the end if we count backwards.

def linearSearchForLast(a:Array[Int],value:Int):Int = {
  var i=a.length-1
  while(i >= 0 && a(i)!=value) {
  i-=1
  }
  i
}

This modified version of the code starts at the end and goes backwards. It is a bit simpler, but it fundamentally alters the description of the function as we now find the last occurrence instead of the first.

There are quite a few different methods on the Scala collections that do searching. For the collections that we have learned about, these are all performing linear searches. They include the following, roughly as they are defined in Seq[A]:4

  • def find(p: (A) => Boolean): Option[A]
  • def indexOf(elem: A, from: Int): Int
  • def indexOf(elem: A): Int
  • def indexOfSlice(that: Seq[A], from: Int): Int
  • def indexOfSlice(that: Seq[A]): Int
  • def indexWhere(p: (A) => Boolean, from: Int): Int
  • def indexWhere(p: (A) => Boolean): Int
  • def lastIndexOf(elem: A, end: Int): Int
  • def lastIndexOf(elem: A): Int
  • def lastIndexWhere(p: (A) => Boolean, end: Int): Int
  • def lastIndexWhere(p: (A) => Boolean): Int

Many of these methods come in pairs where one of the two takes an extra Int argument for an index in the collection that it should start working from. Only the first method returns an element from the collection, all the others return indices. The one that does return an element wraps it in an Option so that if no match is found it can return None. If no match is found for the methods that return an index, they will return -1.

13.2.2 Binary Search

Even though computers are very fast, linear search is far from ideal, mainly because searching is something that is done so very frequently. If the data is not ordered in any way, linear search is your only option. To understand this, imagine you are handed a normal phone book and asked to find the person who has a given phone number. Due to the fact that phone books are not ordered by phone numbers, your only recourse is to go through each and every line and check the numbers against what you are looking for. In any reasonably sized city this is something that no human would actually undertake.

If this were the only way to look through a telephone book people would not bother to keep them.5 However, people do keep telephone books because they rarely look things up by number. Instead, people normally look things up by name and the telephone book is sorted by name. This ordering of the elements can lead to much more efficient searches. You might not be able to write a good algorithm for how you really look things up in a telephone book, but we can consider your first step and use that as a direction for writing an efficient algorithm.

Given a large phone book and a name, you will open it up and look at what is on the page you open to. Odds are good that you will not get exactly the right page. However, comparing what you are looking for to what is on the page gives you a significant piece of information. If what you are looking for comes earlier in the alphabet than what is on the page you will only look at other pages before that one. You basically throw everything after that page out of your search without even looking at it. Similarly, if what you are looking for comes after the page you have opened to you will only consider pages after the current one. You will generally repeat this process in a manner that is not easy to describe in an algorithm just because your method of picking pages might be impacted by things like the binding of the book and whether one page sticks out a bit further than another.

The idea of looking at a location and only considering things before or after it based on a sorted order can be used to create fast searching algorithms. The most general of which is the binary search. In a binary search, you keep track of a range of elements that you are considering by two integer indexes. We will call them start and end. At any given time, you know that the value you are looking for, if it is present, will be at an index in the range i ∈ [start,end).6 So if we are searching the whole Array, then initially start is 0 and end is the length of the Array. We consider the midpoint of the range, mid=(start+end)/2, and check if the element at mid. If it is, we return mid. Otherwise we check if what we are looking for is greater or less than the element at mid and cut down our range accordingly.

To begin with, we will present an imperative version of this algorithm that uses a while loop and several var declarations.

def binarySearch(a:Array[Int],value:Int):Int = {
  var start=0
  var end=a.length
  var mid=(end+start)/2
  while(end>start && a(mid)!=value) {
  if(value < a(mid)) {
   end=mid
  } else {
   start=mid+1
  }
  mid=(end+start)/2
  }
  if(end<=start) -1 else mid
}

The while loop continues as long as the range includes at least one element and the midpoint is not the value we want. Inside the loop, a check is performed to see if the midpoint is less than or greater than what we are looking for. If it is less, we set end=mid. This works because end is exclusive and we have just verified that the element is not at mid. Otherwise we set start=mid+1. The start is inclusive so we have to move it one element beyond the mid. When the loop is completed we return either the value of mid or -1 based on whether the element was found or not.

This version of the code is fairly straightforward, but there is a simpler approach. Binary search happens to be an algorithm that lends itself very nicely to implementation as a recursive algorithm. The following code shows what this might look like.

def binarySearchRecur(a:Array[Int],value:Int,start:Int,end:Int):Int = {
  if(end <= start) -1 else {
  val mid=(start+end)/2
  if(a(mid) == value) mid
  else if(a(mid) < value) binarySearchRecur(a,value,start,mid)
  else binarySearchRecur(a,value,mid+1,end)
  }
}

Clearly this code is shorter than what we had before. Most people would also find this code a bit easier to read than the imperative version. The only drawback is that the function has two extra arguments. The normal way to get around that is to provide a wrapper function that only has two arguments and have it call this version. An appropriate wrapper could be written this way.

def binarySearch(a:Array[Int],value:Int):Int =
  binarySearchRecur(a,value,0,a.length)

The logic of the recursive version is identical to the iterative version. Only the approach has changed.

Now that we have code to do a binary search, it is interesting to ask what order this function is. Again we say that the array has n elements in it. The worst case is the situation when the element is not found and we get down to one element in the range that is not what we are looking for. So we need to figure out how many comparisons happen to narrow the range down from n to 1. After one comparison the range is cut to n/2. After two comparisons it is n/4. In general, after t comparisons, there are roughly n/2t elements left in the range.7

We now have enough information to find the maximum number of comparisons.

n/2t=1n=2tt=log2n

This is typically called O(log n) as the difference between logs of different bases is simply a constant multiplier. This order is generally considered to be quite fast, as it goes slowly as the input gets bigger.

To get a feel for this, let us look at a few examples of how log2 n scales with n. A list of approximate values for this are given in table 13.1. To really put this in perspective, consider the fact that the first number is the worst case for a sequential search and the second number is the worst case for a binary search. When n is small, the difference is not all that significant. However, as n gets large the cost savings of doing a binary search become quite apparent. The last two values of n are large enough that they pose a problem even for computers, despite their speed.

Table 13.1

Table of approximate values of log2 n as a function of n. We use the approximation that 210 ≈ 103. The reality is that 210 = 1024, but this approximation is rather close and is a good one to keep in your head for quick approximations

n

~log2 n

1,000

10

1,000,000

20

1,000,000,000

30

1,000,000,000,000

40

Of course, you can only use a binary search on sorted data and attempting an O(n2) sort on even a million items can be time consuming. So the real power of this O(log n) scaling is purely academic until we discuss some better ways to sort.

Linear Binary Search (Advanced)

The binary search is ideal for general sorted data. However, if you happen to know that your data is fairly evenly distributed you can do even better. Instead of having mid be the midpoint of the range, you can place it where you would expect the value to be based on data being linearly distributed between the values at start and end-1. Assuming the Array contains Doubles, this could be done with a line like the following.

val mid = start+(((value-a(start)/(a(end-1)-a(start))*(end-1-start)).toInt

This picks the value of mid based on a linear approximation. Some care would have to be taken to insure that this does not fall into an infinite loop. In addition, if the data is not uniformly distributed, this approach can wind up begin much slower than the normal binary search. In fact, it can degrade to O(n).

Searching for Doubles (Advanced)

Careful readers might have noticed that between the section on sorts and the section on searches, a small change was made. The sorting section used an Array[Double] while the searching section used Array[Int]. The choice of the Double type for sorting was motivated by the fact that they are easy to generate and visualize. However, they are not ideal for searching. This is due to the nature of the Double type. In our discussions we have only presented the Double type as the type we use for numbers when a fractional part is required. We have glossed over the details of what is happening with the Double type. To understand the reason that Doubles are not good for search algorithms requires us to dig a bit deeper into what they are.

The term Double stands for double precision floating point number. The Float type, which we have generally ignored, is a single precision floating point number. The Float type can also represent fractional numbers, but it has a smaller range and lower precision. As with all numbers on a computer, oating point numbers, be they single or double precision, are really stored in binary. The best way to think about a oating point number is to think of numbers in normalized scientific notation. In decimal, you can think of a number in scientific notation as being in the following form,

(1)s*m*10e

where s ∈ [0, 1], 1 ≤ m < 10 or m = 0. We call s the sign, m the mantissa, and e the exponent. Not much changes when the number goes to binary except that s, m, and e are stored in bits instead of decimal digits, and we want powers of 2 instead of 10. So scientific notation in binary would be

(1)s*m*2e.

This still does not explain why it is hard to do searching with a floating point number. The key to that comes from the fact that we only have a finite number of bits to use to store m. To understand the implication of this, consider a situation where you only have 7 decimal digits to write m. Now try to write the decimal form of the fraction 1/3. You would write (−1)0 ∗ 3.333333 ∗ 10−1. This is not exactly the same as 1/3, but it is as close as you can get with only 7 digits. The reality is that to write 1/3 in decimal you need an infinite number of digits. Lots of fractions require an infinite repeating representation in decimal.

This same thing happens in binary on the computer. Not having infinite binary digits can lead to some interesting results for factional numbers that can not be perfectly represented. To understand this, consider the following simple example from the Scala REPL.

scala> 0.1==1.0-0.9
res0: Boolean = false

Mathematically you expect the expression 0.1==1.0-0.9 to be true, but the decimal value 0.1 is an infinite repeating sequence in binary. As such, it is truncated at some point and we get an approximation. Similarly, 0.9 can not be represented perfectly either. The result is that subtracting 0.9 from 1 gives a value that is not exactly the same as the approximation to 0.1. To see how different the two are we can subtract one from the other.

scala> 0.1-(1.0-0.9)
res1: Double = 2.7755575615628914E-17

This is an extremely small number, but it is not zero and because it is not zero, the two are not equal.

This is a well-known challenge with floating point numbers and people who do numeric work have learned not to do checks for equality on them as the results are generally unpredictable because of the rounding that is part of even the simplest arithmetic. For our discussion, what we have seen is evidence that using == in a search algorithm on the Double type is likely to produce unexpected results. So the next question is, how do we get around that.

The basic idea behind the solution is that we generally consider floating point numbers to be equivalent as long as they are close enough to one another. So how close is close enough? That depends on whether you are working with single or double precision numbers. In math the Greek symbol ϵ, epsilon, is typically used to represent a vanishingly small value. In computer numerics it is used to describe the smallest value that you can add to one and still get a number greater than one. If you go smaller than ϵ, the value will be rounded off in the sum and all you will get back is one. Here is code that declares and calculates this for both the Double and Float types.

scala> val doubleEpsilon = {
  | var eps=1.0
  | while(1.0+eps > 1.0) eps*=0.5
  | eps*2.0
  |}


    
doubleEpsilon: Double = 2.220446049250313E-16


    
scala> val floatEpsilon = {
  | var eps=1.0f
  | while(1.0f+eps > 1.0f) eps*=0.5f
  | eps*2.0
  |}
floatEpsilon: Double = 1.1920928955078125E-7

Any single operation can be expected to have errors on the order of ϵ. When you string many operations together the error grows. As such, it is standard practice to consider two values equal if the relative error is less than the square root of ϵ. This means you only trust half the bits of precision in the number. As such, the following values are what you really find important.

scala> val sqrtDoubleEpsilon=math.sqrt(doubleEpsilon)
sqrtDoubleEpsilon: Double = 1.4901161193847656E-8


    
scala> val sqrtFloatEpsilon=math.sqrt(floatEpsilon)
sqrtFloatEpsilon: Double = 3.4526698300124393E-4

For simple order of magnitude purposes, you might remember that for Double this is about 10−8 and for a Float it is 10−4. We will use this approximate value in the code that follows.

So how could we modify our searches to work with the Double type? We simply need to replace the check for equality with a check of the relative difference between the two. For the linear search that would look like the following.

def linearSearch(a:Array[Double],value:Double):Int = {
 var i=0
 while(i < a.length && (a(i)-value).abs > 1e-8*value.abs) {
  i+=1
 }
 if(i > a.length) -1 else i
}

The second half of the condition in the while loop is performing the critical check. It takes the absolute value of the difference between the value in the Array and the value we are looking for. It compares to see if that is bigger than our approximate value for the square root of ϵ times the value we are looking for. You can not simply compare to 1e-8 because if both value and a(i) have a magnitude much smaller than unity, such a comparison could be erroneous. For example, imagine if the values were positions of atoms in an object measured in meters. A separation of 10−8 meters apart would likely be quite significant.

13.3 Performance and Timing

This chapter has introduced a number of different sorting and searching algorithms. We have discussed the performance of these algorithms in terms of order. This tells us how they scale as the number of inputs is changed and quite often this is all you really care about. You might not know or care about details of the hardware or the data sets you will be running on, or you know that such details will change over time. There are times though when you really do care exactly how fast an algorithm is on a particular machine and you need to compare it to other similar algorithms on that same machine. When you do this, you need to do timing tests on the algorithm.

There are a number of different ways to get timing information on a program. Linux has a command called time that will let you know how much time a program consumes when it is running. For some applications this is ideal. However, for our purposes we only want to measure how long a particular part of the program takes. We do not want to measure the time taken to start things up or to initialize the Array. We only want to measure the time spent sorting the Array. You can get very detailed information like this from a profiler,8 but that is overkill for what we want to do. For our purposes, we just need the ability to determine a start time and a stop time and take the difference between the two. We can do this by calling System.nanoTime(). This is a call to the Java libraries that returns a Long measuring the current time in nanoseconds.9

Another thing that we want to do to make the tests even is sort the same numbers for each sort. To make this happen, we really need to sort a copy of the Array and keep the original so that we can make other copies of it. We need to do this for each of the sorts so it is nice to put the code into a function that we can easily call with different sorts. To make it work with different sorts, we need to pass in the sort function as an argument. The following function does this for us.

def timeFunc(sortFunc:(Array[Double])=>Unit,a:Array[Double]) {
  val copy=Array.tabulate(a.length)(i => a(i))
  val start=System.nanoTime()
  sortFunc(copy)
  val end=System.nanoTime()
  println("Time:"+(end-start)/1e9)
  assert(isSorted(copy))
}

The print statement divides the time difference by 1e9. This gives us back a value in seconds instead of nanoseconds, which is much easier for us to read and deal with. We can invoke this by putting the following code at the end of our script.

args(1) match {
  case "bubble" => timeFunc(bubbleSort,nums)
  case "flagged" => timeFunc(flaggedBubbleSort,nums)
  case "min" => timeFunc(minSort,nums)
  case "insert" => timeFunc(insertionSort,nums)
  case "shell" => timeFunc(shellSort,nums)
}

Before you use this, you will need to make sure you take out the calls that do the rendering as the renders intentionally slow things down so you can see what they are doing.

If you play around with this some you will notice a few things. First, you have to get up to at least 10000 numbers in the Array before the timing means much. If the Array is too small, the sort will be so fast that the resolution of the machine’s clock will become a problem. Second, the amount of time spent doing the sort can vary on different invocations. For this reason, any true attempt to measure performance will run the code multiple times and take an average of the different values.

13.4 Classifying Bugs

By this point you have certainly learned that simply writing your code does not mean that it is correct and that it will work. Errors in programs are commonly referred to as bugs. The term bug is historical in origin because one of the first ones was a moth. The earliest computers were huge. They took up rooms the size of gymnasiums. An early malfunction in one turned out to be a moth that had flown in and was causing a short circuit.

While the term "bug" has stuck around, the implications of this term no longer fit. The term and its history imply that it is something beyond the control of the programmer that just put itself into the code and now the programmer has to search for it. In reality, virtually all modern bugs are really mistakes on the part of the programmer. They are things that the programmer put into the code and now the programmer needs to correct.

Not all bugs are the same. There are three fundamentally different types of errors.

  1. Syntax error – Error in the structure of the code that is found by the compiler.
  2. Runtime error – Error that causes the program to crash while running.
  3. Logic error – Error that does not crash the code, but causes it to produce the wrong answer.

Each of these deserves a fair bit of discussion. We will also talk about how they compare to one another and the way in which they impact programmers.

When you are first learning to program, the errors that you likely run into the most are syntax errors. These can be as simple as typos or misspellings. They can also be more complex like type mismatch errors or calling functions or methods with the wrong number of arguments. The common element that makes something a syntax error is that it is discovered by the compiler when it is trying to translate the human written code into a format that is more computer friendly. Different programming languages do different amounts of checking for errors at compile time. This is often called static checking because it can be done without actually running the program.

Scala does a significant amount of static checking for errors.10 The way we have been running our programs in the REPL or with scripts, the compile stage is not clearly separated from the running of the program. The Scala system is running a compiler in the background and then executing the results of the compile. The syntax errors display a message like the following:

timing.scala:5: error: not found: value This This
line will not compile
^
one error found

They tell you the name of the file along with a line number. Then they describe the error and show the line and where on the line the error occurred.

The second type of bug is a runtime error. This type of error occurs when everything is syntactically correct and the program compiles and runs. However, during the run this error causes the program to crash. In Scala a runtime error will produce a message that looks similar to the following:

java.lang.ArrayIndexOutOfBoundsException: -1
  at Main$$anon$1.<init>(timing.scala:5)
  at Main$.main(timing.scala:1)
  at Main$.main(timing.scala)
 ...

This message tells you what went wrong and then prints a stack trace that includes line numbers. In this simple example, the code failed on line 5.

There are many different reasons why runtime errors happen, and it might be dependent on user input. So a runtime error might not be found from running the program once or twice. To reduce the number of runtime errors in a program you have to run it with multiple different input test inputs as was discussed in section 8.5. In a general sense, it is impossible to prove that code has no runtime errors.

The last type of error is a logic error. An error is a logic error if the code compiles and runs to normal termination, but provides an incorrect output or behaves incorrectly in some other manner. These errors can come in all types of forms and there is no message that tells you there is a problem. You know there is a problem by checking the output or behavior of the program to see if it matches expectations. Like runtime errors, logic errors might not occur for all inputs. There might be only specific inputs that trigger the error.

If you have a choice, you want errors of types higher on the list. As a novice programmer you probably get tired of dealing with syntax errors and find them frustrating. However, the reality is that syntax errors are by far the best type of error. This is because syntax errors give you the most information on how to fix them and are detected by the compiler in a way that does not depend on inputs. Your second choice would be a runtime error for some of the same reasons, it provides you with some information related to what is wrong and, as a result, helps you fix the problem. By contrast, logic errors provide no information on how to fix them.

Different languages do more or less to help you with error detection. One of the significant advantages of Scala is that it is designed to maximize the number of errors that are syntax errors and reduce the number of runtime and logic errors. Part of this is the type checking system of Scala. The language does significant static type checking to make sure that all the values you are using are of a type that is appropriate for the usage.

Many of the higher-order functions and methods in Scala also help to prevent common errors that programmers face in other languages. This is also true of rules such as the requirement to initialize variables at declaration and the general preference of val declarations over var declarations. To understand this, simply consider the following line of code.

val dbl=nums.filter(_>0.5).map(_*2)

This line of code will give us a new collection of numbers that are twice the magnitude of the elements but only for the elements that were originally bigger than 0.5. There are not too many ways to mess up this line without having it be a syntax error. It is possible to get a logic error if you mistype the greater than or the multiplication, but that is something no programming language can really fix. The programmer has to correctly communicate the key logic, but this line does not have all that much other than that key logic. Despite this brevity, it is fairly easy to read.

To understand the real value of this line of code, you have to consider the alternative, which is what you would have to write in most other programming languages. We will write equivalent code that is specific for an Array. The Scala code works equally well for any sequence, but we will ignore that advantage for now. It will become more significant later on.

var cnt=0
for(i <- 0 until nums.length) {
  if(nums(i)>0.5) cnt += 1
}
val dbl=new Array[Double](cnt)
cnt=0
for(i <- 0 until nums.length) {
  if(nums(i)>0.5) {
  dbl(cnt)=nums(i){*}2
  cnt += 1
  }
}

The first loop counts how many elements are greater than 0.5. This is required because Arrays have to be given a size when they are created. Once we know how many there will be, we can make the Array. The second loop fills in the Array with values that are twice those in the original Array.

Clearly this second version is longer. More importantly, that are a lot more places where typos become runtime or logic errors. The reality is that the one line version is doing basically this same thing. However, most of the code is in the libraries and is not rewritten by each programmer every time. This works, in large part, because of the first-class functions and the ease with which function literals can be written in Scala.

13.5 Memory Layout

The memory of the computer is basically like a huge array of bytes. It is shared between the operating system and many different programs. Different parts of memory can be allocated for different things or associated with different devices. Scala hides most of the intricacies of memory from you. It is not a language designed for doing low-level system programming. At some point in your Computer Science training you should learn about the details of computer memory. For our purposes here, we will only care about the organization of memory inside of the allocation of a single program.

The memory for a program is broken into two broad pieces, the stack and the heap. These terms were chosen intentionally and the images they invoke in your mind are probably fairly accurate. A stack is orderly with one item placed on top of the previous one. A heap is much less organized with items placed almost at random. Local variables and function arguments are allocated on the stack. As was discussed in section 7.8, the memory model in Scala is such that the variables are references and they refer to objects. In this memory model, the objects are allocated on the heap. Every time a new function is called, the memory for the arguments and the local variables, along with some memory for bookkeeping, is allocated in a block that is referred to as the stack frame. If that function calls another, then another frame is allocated on top of it. When a function returns, the stack frame for that function is freed up. That same memory will be used later for another function.

This should help explain the output from a runtime error. The stack implicitly keeps track of where you are in each function when it calls the next. You can picture each of those functions stacked on top of the one that called it. That is what gets printed in the stack trace. Each line tells you what function has been called followed by the file name and line number.

The objects on the heap are allocated in free spaces. The memory for objects is freed up automatically when the object is no longer in use. The automatic freeing of heap memory is accomplished by a process called garbage collection. An object can be collected if it can no longer be reached by following references that start on the stack. Not all languages include garbage collectors. In those that do not, the programmer is responsible for freeing memory that was allocated on the heap.

13.6 Sorting/Searching with case classes

In the sorts and searches that we have looked at, we were working with numeric types. More generally, the code that was written will work with any type that works with the comparison operators. As such, our code would have worked with Array[Char] or Array[String] if you simply altered the type that was passed into the sort. The case classes we have written do not meet this requirement. As such, we need to make other alterations to the code beyond the type if we want to sort a case class. Fortunately, these alterations are not all that significant.

We will work with the following case class.

case class Weather(id:String,year:Int,month:Int,precip:Double,tmax:Double,
  tmean:Double,tmin:Double)

This case class was created to store historical weather data. It was used to represent records for monthly data on temperature and precipitation. You would load an entire file of these records into an Array. If you want to see the hottest ten months you could sort the array by the high temperatures. Code for such a sort is shown here.

def bubbleSortWeatherHighTemps(a:Array[Weather]) {
  for(j <- 0 until a.length-1) {
  for(i <- 0 until a.length-1-j) {
  if(a(i).tmax > a(i+1).tmax) {
   val tmp=a(i)
   a(i)=a(i+1)
   a(i+1)=tmp
  }
  }
  }
}

A bubble sort was picked because of the simplicity of the sort. It is very easy to see what was changed. The only changes are the type in the parameter for the Array that is passed in and the fact that the comparison is done between fields of the Array elements. After applying this sort to the Array you can use take or takeRight to pull off the elements at one end or the other.

case classes also present another alternative that we did not really get with single values, the ability to have the comparison based on more than one field. For the weather data, it is likely to be stored in the data file in chronological order. Proper chronological order is a combination of both the year and month fields. If you wanted to have the ability to search for entries by time you might want to have a binary search that can look for a particular year and month. Code for doing that is listed here.

def binarySearchWeather(a:Array[Weather],year:Int,month:Int):Int = {
  var start=0
  var end=a.length
  var mid=(end+start)/2
  while(end>start && (a(mid).year!=year || a(mid).month!=month)) {
  if(year < a(mid).year || (year==a(mid).year && month < a(mid).month)) {
   end=mid
  } else {
  start=mid+1
  }
  mid=(end+start)/2
  }
  if(end <= start) -1 else mid
}

Note that the comparison has become significantly more complex. It has to compare the year first and then if there is a tie, break that tie with the month.

Unfortunately, when written in this way, we have to write a completely separate sort or search for each ordering we might want on the case class. For example, if you wanted wettest months instead of hottest months, you would need a separate sort. You might feel that copying a whole sort or search function only to make such small changes is not very efficient and that there should be a way to make a sort or search that works more generally. We can help improve these functions for case classes here. Later, in chapter 19, we will gain the ability to abstract these ideas so that the function works with multiple types, not just different sort orders on a single type.

Back in section 6.4 we saw how we could make some recursive functions more powerful by passing functions into them. This approach is exactly what is taken by the higher-order functions in the Scala collections libraries. As such, we have not had to go to it ourselves much since then. However, the desire to not write a completely new sort or search function for each and every possible ordering on a case class provides motivation to pull out these ideas again.

If you were to write a new version of bubble sort that sorts the Weather objects by precipitation, you would find that the only thing you change is the code related to the comparison of the elements. In order to get a sort that can sort by high temperature, precipitation, or anything else related to the Weather type, all we have to do is make it so that we can vary the comparison from the outside. This can be accomplished by passing in a function that does the comparison. For our sorts, we have been using less than and greater than for comparison so we just need to pass in a function that represents one of these. We pick less than here, though the code could easily be rewritten with greater than. The code for the sort after this change looks like the following.

def bubbleSortWeather(a:Array[Weather],lessThan:(Weather,Weather)=>Boolean) {
  for(j <- 0 until a.length-1) {
  for(i <- 0 until a.length-1-j) {
  if(lessThan(a(i+1),a(i))) {
   val tmp=a(i)
   a(i)=a(i+1)
   a(i+1)=tmp
  }
  }
  }
}

The comparison operator is represented as a function that takes two Weather objects and returns a Boolean. A call to this function is used in the if statement.

Using this modified version, we could sort by the high temperatures with a call like this.

bubbleSortWeather(weather,(w1,w2)=>{w1.tmax < w2.tmax})

Alternately, the same method could be used to sort by precipitation with a call like this.

bubbleSortWeather(weather,(w1,w2)=>{w1.precip < w2.precip})

What is more, using this version of the sort, it is easy to change it so that it sorts from greatest to least instead of the standard least to greatest. In the case of precipitation this is done by changing the call in the following way.

bubbleSortWeather(weather,(w1,w2)=>{w1.precip > w2.precip})

All that is changed in the direction of the comparison operator. As you can see, this abstracted version that uses a function instead of a hard-coded comparison operator is far more flexible.

Having seen the benefits we can get from using this in our sort, it would be nice to enable searching in the same way. If you start to write the search code with the same comparison operator you will find that there is a significant problem, the search requires more than just a less than or greater than. The search requires that we be able to tell if two things are equal. The standard way to deal with this is to have a function that returns an Int instead of a Boolean. The Int is negative, zero, or positive to represent less than, equal to, or greater than, respectively. The other change that makes search a bit different is that the function only takes one Weather object because we just want to know where the thing we are looking for is relative to this value we are searching for.

Translating all of this into code gives the following.

def binarySearchWeather(a:Array[Weather],comp:(Weather)=>Int):Int = {
  var start=0
  var end=a.length
  var mid=(end+start)/2
  var c=0
  while(end>start && {c=comp(a(mid)); c!=0}) {
  if(c<0) {
  end=mid
  } else {
  start=mid+1
  }
  mid=(end+start)/2
  }
  if(end <= start) -1 else mid
}

The function is called comp and it takes a Weather and returns an Int. In order to prevent having the code call comp more times than it needs to, we introduce a variable named c that stores the result of the most recent comparison. We assign this in a code block in the condition. The loop checks to make sure it is not zero. The if checks if it is negative. After we calculate a new mid we also make a new comparison.

Now the question is, how could we use this search to find different elements in the Array. The first example we will give is one that duplicates the search we had above searching for a specific year and month.

binarySearchWeather(data,w => {
  if(w.year>1896) -1
  else if(w.year<1896) 1
  else if(w.month>2) -1
  else if(w.month<2) 1
  else 0
})

In this case, we are searching for February of 1896. This version uses a sequence of if expressions to determine if the element w comes before or after that time. With a little math we can make a shorter version taking advantage of the fact that there are 12 months in each year.

binarySearchWeather(data,w => (1896*12+2)-(w.year*12+w.month))

Searching for data based on date is made more complex because it depends on two values. If we use our flexible sort to reorder the Array by precipitation we could use the following to search for a month in which there were 1.43 inches of rain.

binarySearchWeather(data,w => {
  val diff = 1.43-w.precip
  if(diff > 1e-8) 1
  else if(diff < -1e-8) -1
  else 0
})

The logic in this can be challenging to write. Thankfully, the behavior of returning negative, zero, or positive is a common standard and the Scala library contains code that does this with the built-in numeric types. As such, the code can be expressed more simply in this way.

binarySearchWeather(data, 1.43 compare _.precip)

The compare call is actually a method we can call on a Double. The fact that a period has other meaning for numbers led to us using operator notation for the method instead of the more standard notation with dot and parentheses.

Bucket/Radix Sort (Advanced)

The sorts discussed earlier in this chapter require nothing more than the ability to compare elements. There are situations where you have more information about the values and you can use that to sort them more efficiently. Examples of such non-general sorts include bucket sort and radix sort. Both use knowledge about the data being sorted to break them into groups.

To understand how this works, consider the situation where you are given a large number of folders to sort by name and you are in a big room. Many people would start off by going through the stack and breaking it into different piles. The piles would be for different parts of the alphabet. If the original stack were really large and you had a really big room, you might make one stack for each starting letter. You could continue to break things down in this way, again for each pile or switch to some other approach when the pile is smaller. This general approach is called a bucket sort and the piles would be called buckets. The bucket sort is really a general approach to breaking sorting problems into smaller, more manageable pieces.

The radix sort uses a similar approach, but it is a specific algorithm for sorting integer values. It organizes values, one digit at a time. In a counterintuitive way, it starts from the least significant digit. This only works because each pass through preserves the order of the elements. So if A has a lower digit than B in one pass, and the next pass puts them in the same bin, A will come before B in that bin.

By convention we will implement our radix sort using decimal digits using division and modulo with powers of ten. The following code is a reasonably efficient implementation that sorts an Array[Int] using an Array[List[Int]] for the bins. Note that moving items to the bins is not an in-place operation. The values are copied back to the Array after binning, but it does require at least twice the memory of an in-place sort.

def radixSort(a:Array[Int]) {
  var max = a.max max a.min.abs
  var powerOf10 = 1
  while(max>0) {
  val byDigit = Array.fill(19)(List[Int]())
  for(num <- a) {
  val digit = num/powerOf10%10+9
  byDigit(digit) ::= num
  }
  var i = 0
  for(bin <- byDigit; num <- bin.reverse) {
  a(i) = num
  i += 1
  }
  powerOf10 *= 10
  max /= 10
  }
}

The max variable starts with the largest magnitude value and is divided by ten each time so that we know when to stop. The powerOf10 variable keeps track of what digit we are currently binning. Each time through the loop, 19 empty bins are set up. You might wonder why there are 19 bins instead of just 10. The answer is that division and modulo preserve sign. If you only use 10 bins, this sort can only work on positive values. By going up to 19, it is able to handle negative numbers as well. To make that work, the bin value we get from division and modulo is incremented by 9 so that the values slide up to the domain of the Array index values.

The binning runs through all the numbers in the array and conses them onto the List for the proper digit. Consing to a List adds to the front. For that reason, the for loop that moves the valus back to the Array has to run through each bin in reverse.

This sorting algorithm is O(kn) where k = [log10(max)]. In the case of an Int, the value of k can not be larger than 10. So the performance scales linearly for really large Arrays.

13.7 Putting it Together

To illustrate concepts in this chapter and link them together with earlier chapters we will return to the theme park example. Every month the theme park picks a top employee. This is based on performance relative to the average for that month. For every day of the month you have data that tells you what operators were working each ride, and how many people went on the ride. From this, we can calculate average ridership for each ride during the month as well as how many riders rode each ride for each day that a given operator was working it. Each operator can be given an efficiency for any particular ride as the average number of people who rode on days he/she was working divided by the average for all days. Averaging these efficiencies gives an overall rating to each operator. Those can be sorted and displayed to show relative performance. Code for doing that is shown here.

import scala.io.Source


  
case class DailyData(ride:String, operators:Array[String], numRiders:Int)
case class RideAverage(ride:String, avNum:Double)
case class OperatorDailyData(name:String, ride:String, numRiders:Int)
case class OperatorRideAverages(name:String, rideAvs:Array[RideAverage])
case class OperatorEfficiencyFactor(name:String,factor:Double)


  
def parseDailyData(line:String):DailyData = {
  val parts = line.split("*; *")
  DailyData(parts(0), parts.slice(1, parts.length-1), parts.last.toInt)
}


  
def readData(fileName:String):Array[DailyData] = {
  val source = Source.fromFile(fileName)
  val lines = source.getLines
  val ret = (lines.map(parseDailyData)).toArray
  source.close
  ret
}


  
def insertionSortByEfficiency(a:Array[OperatorEfficiencyFactor]) {
  for(j <- 1 until a.length) {
  var i=j-1
  val tmp=a(j)
  while(i>=0 && a(i).factor>tmp.factor) {
  a(i+1) = a(i)
  i -= 1
  }
  a(i+1) = tmp
  }
}


  
val data = readData(args(0))
val rides = data.map(_.ride).distinct
val averages = for(ride <- rides) yield {
  val days = data.filter(_.ride==ride)
  RideAverage(ride, days.map(_.numRiders).sum.toDouble/days.length)
}
val dataByOperator = for(day <- data; op <- day.operators) yield {
  OperatorDailyData(op, day.ride, day.numRiders)
}
val operators = dataByOperator.map(_.name).distinct
val opRideAverages = for(op <- operators) yield {
  val opDays = dataByOperator.filter(_.name == op)
  val rideAvs = for(ride <- rides; if opDays.exists(_.ride==ride)) yield {
  val opRides = opDays.filter(_.ride == ride)
  RideAverage(ride, opRides.map(_.numRiders).sum.toDouble/opRides.length)
  }
  OperatorRideAverages(op, rideAvs)
}
val operatorFactors = for(OperatorRideAverages(op, rideAvs) <- opRideAverages)
  yield {
  val factors = for(RideAverage(ride,av) <- rideAvs) yield {
  av/averages.filter(_.ride==ride).head.avNum
  }
  OperatorEfficiencyFactor(op,factors.sum/factors.length)
}
insertionSortByEfficiency(operatorFactors)
operatorFactors.foreach(println)

This code assumes that the file has a line of data for each day that starts with the ride name, followed by operator names, with number of riders at the end. Each of these is separated by semicolons. That data is read and used to calculate the various values needed for efficiency. Once the efficiencies have been calculated, all the employees are sorted with an insertion sort and the results are printed.

13.8 End of Chapter Material

13.8.1 Summary of Concepts

  • The act of ordering data according to some value is called sorting. It is a common operation on a computer as it benefits humans who view the data as well as programs when they process it. Several types of sorts were discussed in this chapter.
    • A bubble sort runs through the Array comparing adjacent elements and swapping them if they are out of order. This action is repeated until all the values are in proper order.
    • A selection sort picks specific elements and puts them into place. We demonstrated a minSort, which finds the smallest unsorted element and swaps it to the correct location. Selection sort does few swaps so it is most useful in a situation where moving data is an expensive operation.
    • An insertion sort takes each element and pushes it forward through the Array until it gets it into sorted order with the elements that came before it. Insertion sort is extremely efficient in situations where the data starts off close to the proper order.
    • The Shell sort is also called the diminishing gap sort. It does insertion sorts over partial data in such a way that things are moved toward proper ordering. It winds up being more efficient that any of the other three.
  • When we talk about the performance of different algorithms in Computer Science, we typically use order analysis. This gives a rough idea of how the number of times an operation is performed will scale with the size of the input. The first three sorts above all do O(n2) comparisons.
  • One of the most common tasks done on computers is looking for data, an activity we call searching.
    • If data is unorganized, that only approach is to go through all elements in a sequential/linear search. This type of search is O(n).
    • When the data has a sorted order to it, you can use a binary search to find elements. A binary search is significantly faster that a sequential search because it effectively throws out half the data with each comparison. This provides O(log(n)) performance.
  • To compare the performance of different algorithms when you really care about how quickly a program runs, you can measure how much time it takes. To measure the runtime of a whole program you can use the Linux time command. To measure only specific parts of code call System.nanoTime() before and after the section you want to time and subtract the results. Profilers can also provide more detailed information on what is taking time in a program.
  • Programmers often put errors into their code by accident. These errors are commonly called bugs. The process of fixing the errors is called debugging. Bugs can be categorized into different types.
    • Sytnax errors are errors where the programmer enters something that is not valid for the language. These are found when the program is compiled. The compiler can generally give you helpful information about what is wrong and where the error is in the code.
    • Runtime errors occur when the code compiles fine, but crashes during the run. The crash itself can print helpful information for you. Unfortunately, runtime errors often only occur under certain situations, mkaing them harder to track down and fix.
    • Logic errors are the term we use to describe when the code compiles and runs without error, but the result is inaccurate. These are generally the worst of the three as the computer does not give you pointers to where the error is occurring or what it is caused by. Instead, the programmer has to track it down. This can be done using print statements at the level we are working at or using a debugger.
  • The memory of the computer can be thought of as a really big array. Each program is given a different section of memory that is divided into the heap and the stack. The stack is well organized and is where local variables are allocated. Each function that is called gets a chunk of memory called a stack frame. When the function returns, the frame is released. The heap is where all objects are allocated in Scala. A garbage collector deals with objects that are no longer in use.
  • Sort case classes requires minor alterations to the sort so that the appropriate fields are compared.

13.8.2 Exercises

  1. Write a minMaxSort that finds both the minimum and maximum values in each pass and swaps them to the proper locations.
  2. Do timing tests on Arrays of Ints and Doubles for the sorts presented in this chapter. Note that the radix sort that is presented only works with integer types. You want to have the number of elements in the Array or List grow exponentially so that you can see variation over a large range. Recommended sizes could be 100, 300, 1000, 3000, 10000, 30000, etc. Plot the data as a log-log scatter plot.11
  3. Following onto the timing results, do a comparison of the number of comparisons done for different sorts presented in this chapter. Plot the results in the same way.
  4. While most of the work for these sorts is in comparisons, it is also interesting to look at the number of memory moves. A standard swap is three assignments. Add code to count how many assignments are done in each of the sorts and plot the results.
  5. Section 7.5.2 showed how you can abstract functions over types. This is something that will be dealt with in detail in the second half of the book. Review that section and see if you can write a version of insertion sort and min sort that is general with regards to type.
  6. Do a little web searching to find some other type of sort not described in this chapter and write it.

13.8.3 Projects

  1. This project is intended for people who have been working on graphics and ray tracing, but it does not immediately link so you can do it even if you haven’t been doing the previous ones. For this problem you will draw polygons to a Graphics2D using a "painter’s algorithm." That is where you draw things from the back to the front so that the things in front appear on top of the things behind them. Doing this properly is a challenging problem. You should base your drawing on the point in the polygon that is closest to the viewer.

    To keep things simple, the viewer will be at the origin facing out the z-axis. That way the x and y coordinates are roughly what you expect. To make it so that things that are further away are smaller you divide the actual x and y by the z value to get the location you would draw them at. Given a point (x,y,z) you would want it drawn on an image or panel at ((x/z+1)*size.width/2,(1-y/z)*size.height/2). To represent a polygon for drawing to the Graphics2D you should use a java.awt.geom.Path2D.Double. You can moveTo the first point, then use lineTo to make lines. When you get to the last point you call closePath to close it off.

    Store your polygons in a file. You can decide the exact format. In addition to having the points in each polygon, each one should have a color that it will be drawn in. Remember to sort them by the z value so that the one with the smallest z value is drawn last.

  2. The BASIC programming language was created to be a simple language for novice programmers. The original versions were organized by line number. Each statement of the program was a single line that has a number associated with it. The lines were ordered according to that number and flow control was implemented by allowing the program to jump to other line numbers. For this project you will create a simple GUI that lets you edit a simplified version of BASIC.

    For this simplified version you have a very limited set of possible commands. The GUI will also use a "line editor" style. The allowed commands including the following: GOTO, IF-THEN, INPUT, LET, and PRINT. Each of these must be preceded by a line number that is an integer. For our purposes, variable names are single characters and they will all be numbers. You can use a Double. The format of the commands is as follows.

    • The GOTO command should be followed by an integer number that is the line number the program should execute next. (Example: 100 GOTO 50)
    • The IF-THEN command has the following syntax: IF comp THEN #. The comp is a comparison that can have a variable or a number on either side and either = or < between them. The # is a line number that the execution will jump to if the comparison in the condition is true. If the comparison is false, the execution continues on the next line. (Example: 110 IF a<21 GOTO 50)
    • The INPUT command should be followed by a single variable name and when it is executed the program pauses and waits for the user to input a value that is stored in that variable. (Example: 120 INPUT b)
    • The LET command has the keyword LET followed by a variable name with an equal sign and then either a single number/variable or two number/variable operands that separated by an operator. The operator can be +, -, *, or /. The result of the operation should be stored in the variable before the equal sign. (Example: 130 LET a=b+3)
    • The PRINT command can be followed either by a variable name or a string in double quotes. When executed, this will print either the value of the variable or the string to the output.

    Variables do not have to be declared. They come into existence when first used and they should have a value of 0 to start with if no other value was given to them.

    The GUI for this program should have three main elements. The program itself is displayed in a ListView. This makes it simple for users the select a line to edit without letting them type random text. There should also be a TextField where the user can enter/edit lines. If a line is selected in the ListView, the text from it should appear in the TextField. The user can edit that line or enter anything else. The number at the beginning of the line will be user to put it in place. If a number is used that duplicates an existing line, the new one replaces the old one. Lastly there is a TextArea that shows the output when the program is run.

    When the user hits enter on the TextField your program should check if what was entered is a valid command. If it is, it should put it in the program and clear. If it is not, it should leave the text there and not alter the existing program.

    There should be at least four menu items for this program: "Save", "Open", "Remove", and "Run". The "Save" option saves the program to a text file. The "Open" option allows the user to select a file and open it up as a program. The "Remove" option will remove the currently selected lines in the program. Nothing happens if nothing is selected in the ListView. The "Run" option runs the program. If starts running at the first line and continues until execution goes beyond the last line.

  3. On the book, website, under this chapter, you will find files with historical weather data for some different cities in the United States along with a link to the source of the data. You should read this data into an Array of some case class that you create. The data is separated by commas. Note that the second line tells you what each column of data represents. You will skip the top two lines when reading the information. Write a script that will report the months with the five warmest average temperatures and those with the five lowest average temperatures.

    For a bit of an extra challenge make it so that the user can tell the program whether to report the top and bottom five months for any of the values in the file. You could do this with a lot of typing, but by passing in a function pointer you can cut down on the length of the code greatly.

    If you did any of the problems in the last chapter that included plotting points or data, you should consider sticking that functionality onto this project. That way you can bring up a GUI and show the plots of whatever field(s) the user selects.

  4. If you did a game for one of the projects in chapter 12, for this one you can enhance it by adding a high scores list. That means having some way to score games. It also means saving scores to a file in a format of your choosing. Lastly, the high scores need to be sorted. Each score record needs to have at least a score and an identifying name or initials. All other details are up to the student.

  5. If you have been working with the recipe projects, you now have the ability to order a shopping list according to where things are in the store. This follows most logically from project 11 (p.312), but it does not require the graphical functionality, only a data file listing what a isles different items are in and the ability for a user to make a shopping list.

    Your script should have the ability to sort the grocery list they build to go through the store in either ascending or descending order by row. The sorted list should be displayable in a TextArea so that the user can cut and paste it for printing.

  6. This project fits in with the sequence of schedule building projects. In particular, it makes sense to do if you have already done project 10 (p.282) where schedules were built and displayed in a GUI. That project showed the selected courses in the order the user selected them. However, it would often be helpful to have them displayed based on other criteria. For example, having them sorted by department and number or how much interest the user expressed in them could be helpful.

    For a bit of extra challenge, you could include some additional information with each course indicating things like whether it is required for graduation or what graduation requirement it fulfills. Allow the user to select from multiple orderings using a ComboBox.

  7. If you did project 15 (p.313) or 12 (p.282) looking at box scores you might have noticed that one significant feature that was missing was the ability to change the player listing so that it orders the players based on a particular stat. Now that you know how to sort you can fix this. How you do this depends on how you chose to display the box score, but it can be as simple as adding a few Buttons for sorting by different statistical categories.

Additional exercises and projects, along with data files, are available on the book’s website.

1We do not have to put n things in place because if there are only n places and n ‒ 1 are in the right place, the last one most also be in the right place as it is the only one left.

2Real hardware can break down these scaling arguments at certain critical points. For example, if the input set becomes larger than the cache of the machine, you will typically see a slowdown that does not scale quite as O(n2).

3The exception to this is if we happen to know that we will very often have inputs that produce best-case behavior.

4The exact signatures of some of the methods have been simplied so that they make sense at this point in the book.

5Thanks to rapidly changing technology and remarkable changes in computer speed and usability, it is not clear people are bothering to keep phone books anyway.

6As a reminder, this notation implies that the range is inclusive for start and exclusive for end.

7This is only exact if n is a power of two. Otherwise some rounding will occur, but that is a detail we can ignore when talking about the order.

8You can invoke the Java profiler with the Java -Xprof option. To get Scala to run this option you set the JAVA_OPTS environment variable to include -Xprof.

9A nanosecond is 10‒9 seconds.

10The significant static checking of types was a major influence in the selection of Scala for this textbook.

11This means that the N value should be the x-axis and the time it takes should be the y-axis where both axes use a log scale. The advantage of this is that functions of the form f(x) = xn appear as straight lines in this type of plot with a slope equal to n.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset