Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Data Structures and Algorithms

This chapter looks at how the principles of functional programming influence the design of data structures and algorithms. We won’t have the space to study either in depth, but we’ll learn some universal principles by studying a few important examples.

Functional languages provide a core set of common data structures with combinator operations that are very powerful for working with data. Functional algorithms emphasize declarative structure, immutable values, and side-effect-free functions.

This chapter is dense with details and it might be hard to digest on a first reading. However, the ideas discussed here are the basis for functional programming’s elegance, conciseness, and composability.

Let’s start with an in-depth discussion of lists, followed by a brief discussion of maps.

Lists

The linked list has been the central data structure in functional languages since the days of Lisp (as its name suggests). Don’t confuse the following classic definition with Java’s built-in List type.

As you read this code, keep a few things in mind. First, List is an Algebraic Data Type with structural similarities to Option<T>. In both cases, a common interface defines the protocol of the type, and there are two concrete subtypes, one that represents “empty” and one that represents “non-empty.”

Second, despite the similarities of structure, we’ll introduce a few more implementation idioms that get us closer to the requirements of a true algebraic data type, such as preventing undesired subtypes:

package datastructures;

public class ListModule {
  public static interface List<T> {

    public abstract T       head();
    public abstract List<T> tail();
    public abstract boolean isEmpty();
  }

  public static final class NonEmptyList<T> implements List<T> {

    public T       head()    { return _head; }
    public List<T> tail()    { return _tail; }
    public boolean isEmpty() { return false; }

    protected NonEmptyList(T head, List<T> tail) {
      this._head = head;
      this._tail = tail;
    }

    private final T _head;
    private final List<T> _tail;

    @Override
    public boolean equals(Object other) {
      if (other == null || getClass() != other.getClass())
        return false;
      List<?> that = (List<?>) other;
      return head().equals(that.head()) && tail().equals(that.tail());
    }

    @Override
    public int hashCode() { return 37*(head().hashCode()+tail().hashCode()); }

    @Override
    public String toString() { return "(" + head() + ", " + tail() + ")"; }
  }

  public static class EmptyListHasNoHead extends RuntimeException {}

  public static class EmptyListHasNoTail extends RuntimeException {}

  public static final List<? extends Object> EMPTY = new List<Object>() {

    public Object       head()    { throw new EmptyListHasNoHead(); }
    public List<Object> tail()    { throw new EmptyListHasNoTail(); }
    public boolean      isEmpty() { return true; }

    @Override
    public String toString() { return "()"; }
  };

  /* See the text for an explanation of this code */
  @SuppressWarnings(value = "unchecked")
  public static <T> List<T> emptyList() {
    return (List<T>) EMPTY; // Dangerous!?
  }

  public static <T> List<T> list(T head, List<T> tail) {
    return new NonEmptyList<T>(head, tail);
  }
}

First, we surround everything with a “module”, a class named ListModule. This is not strictly necessary, but it provides a place for us to define Factory methods that we’ll use as part of the public interface, rather than public constructors. Also, it’s convenient to define everything in one file. I’ll discuss some other benefits of ListModule below.

Next, we define an interface List<T> that holds items of type T (or subtypes of T). Following convention, a linked list is represented by a head, the left-most element, and a tail, the rest of the list. That is, the tail is itself a List, so the data structure is recursive. We’ll exploit this feature when implementing methods.

Member functions provide read-only access to the head and tail of the list. Hence, Lists will be immutable, although we can’t prevent the user from modifying the state within a particular list element itself. The isEmpty method is a convenience method to determine if the list has elements or not.

Next we have the class NonEmptyList that represents a list with one or more elements. Because a list is an algebraic data type, we need to control the allowed subtypes of List. Therefore, NonEmptyList is declared final.

Now the head and tail methods are getters for the corresponding fields, which are declared final so they are immutable.^[4] We’ll retain control over the structure of the list. Hopefully, the user will make the list elements immutable, too.

Because NonEmptyList never represents empty lists, isEmpty always returns false.

Why is the constructor protected? We want to control how lists are constructed, too. We will use static factory methods that are defined at the end of ListModule. This is not required, but it lets us use a construction “style” that is similar to the idioms used in functional languages.

The equals and hashCode method are somewhat conventional, but notice that both exploit the recursive structure of Lists. For equals, we compare the heads and then call List.equals on the tails. Similarly, hashCode effectively calls itself on the tail.

Recursion is also used in toString. It calls List.toString again when it formats the tail.

Now let’s discuss the representation of empty lists. What should happen if you call head or tail on an empty list? Neither method can return valid values, so we declare two exceptions that will be thrown if head or tail is called on an empty list.

Before we continue, those of you who know the Liskov Substitution Principle (which we’ll discuss in Chapter 5) might be crying, “foul!” Our List abstraction says that implementers should return valid objects, not throw exceptions. Isn’t this a violation of LSP?

After our discussion of the Option type in Chapter 2, we better not return null! We could change head to return Option<T> and tail to return Option<List<T>>. You should try this yourself (see the Exercises for this chapter).

Another approach, however, is to say that the list type specifies a protocol that you should never call head or tail on an empty list. To do so is an “exceptional” condition. If you think about it, you will have to check any list to see if it’s empty, one way or the other. You can either call isEmpty first and only call head or tail if it is not empty, or you can use Option as the return type and test for when None is returned, meaning the list is empty.

This checking may sound tedious, but it sure beats debugging NullPointerExceptions in production. Fortunately, you don’t need to do these checks very often, as we’ll see when we add combinator methods to List later on.

Back to the implementation. Recall that we defined None with a conventional class, even though all instances of None<T> for all types T are equivalent, because None carries no state information. It is effectively just a “marker” object. Empty lists are the same, stateless and used as list terminators and occasionally on their own. Now, however, we’ll really use just one instance, a Singleton object, to represent all empty lists.

ListModule declares a static final List<? extends Object> named EMPTY, an instance of an anonymous inner class. Its head and tail methods throw the exceptions we described above and its isEmpty method always returns true. Note the type parameter, ? extends Object, which means you could assign any List<X> for some X to EMPTY. This is needed for how we use EMPTY, which we’ll discuss in a moment. The following sidebar discusses what this type expression means.

No equals and hashCode methods are required, since there is only one empty list object, the default implementations for Object are sufficient. Also, toString returns empty parentheses to represent a list of zero elements.

Now we come to the public static Factory methods that clients use to instantiate lists, rather than calling constructors directly. Just as there are two concrete types, there are two factory methods, one for each type.

The first static method, emptyList “creates” an empty list. In fact, it returns EMPTY, but it appears to do something unspeakably evil; it downcasts from List<Object> to the correct List<T> type!

Well, this actually isn’t evil, because EMPTY carries no state, just like None. No ClassCastExceptions will ever occur when you use it. So, in practical terms, we are safe and our factory method hides our hack from users. We added the annotation to suppress warnings from the compiler.

Type parameters for generic methods like this are one of the few places where Java uses type inference when you call the method. Java will figure out the appropriate value for T from the type of the variable to which you assign the returned value.

One Subtype to Rule Them All?

Having to downcast EMPTY like this reflects a few limitations in Java’s type system that some other languages don’t have. Some languages define a special type that is the subtype of all other types, e.g., Nothing in Scala, where List<Nothing> would be a proper subtype of List<T> for all <T>. However, this feature wouldn’t be quite enough, due to another Java limitation. You can’t declare in the implementation of List<T> that any List<T2> is a subtype of List<T> if T2 is a subtype of T (called covariant subtyping). You can only make these declarations when List<T> is used to declare an instance, which is what we did for EMPTY. Here’s another example that demonstrates covariant subtyping:

List<? extends Object> EMPTY = new List<String>();

So, we are stuck with our hack if we want to use a Singleton for all empty lists.

The second factory method creates a non-empty list. We call it list to look similar to a constructor. Really, it’s effectively just a shorthand way of saying new NonEmptyList<T>(…) with less noise. Even the type parameter is inferred, as you’ll see when we discuss the test.

The primary benefit of factories is the way they create an abstraction for construction. Calling new is a form a strong coupling and prevents the substitution of instances of different concrete types, depending on the context. As a simple example, the list factory method could determine if an identical list already exists and return it instead. This would be safe since the lists are immutable (ignoring the possibility of mutable list elements).

We can see all this in action by looking at a test, ListTest. It’s long, so I’ll just show excerpts. For example, we’ll omit the equality tests^[5]:

package datastructures;
import static datastructures.ListModule.*;
...
public class ListTest {
  List<String> EMPTYLS = emptyList(); // The String parameter is inferred!
  List<Long>   EMPTYLL = emptyList();

  @Test(expected = EmptyListHasNoHead.class)
  public void callingHeadOnAnEmptyListRaises() {
    EMPTYLS.head();
  }

  @Test(expected = EmptyListHasNoTail.class)
  public void callingTailOnAnEmptyListRaises() {
    EMPTYLS.tail();
  }

  @Test
  public void callingTailOnAListWithMultiplElementsReturnsANonEmptyList() {
    List<String> tail = list("one", list("two", EMPTYLS)).tail();
    assertEquals(list("two", EMPTYLS), tail);
  }

  @Test
  public void callingHeadOnANonEmptyListReturnsTheHead() {
    String head = list("one", EMPTYLS).head();
    assertEquals("one", head);
  }

  @Test
  public void AllEmptyListsAreEqual() {
    assertEquals(EMPTYLS, EMPTYLL);
  }

  @Test
  public void ListsAreRecursiveStructures() {
    List<String> list1 = list("one", list("two", list("three", EMPTYLS)));
    assertEquals("(one, (two, (three, ())))", list1.toString());
  }
  ...
}

The test makes two “different” empty lists, one of type List<String> and one of type List<Long>, using the emptyList factory methods. However, the second to last test verifies that they are actually equal.

The first two tests verify that the appropriate exceptions are thrown if head and tail are called on empty lists. The next two tests verify that the head and tail of non-empty lists can be extracted.

The last test shows the nice recursive-looking representation that toString returns:

(one, (two, (three, ())))

Recursion is used in ListModule. A successful recursion must eventually terminate. You would have an infinite recursion if loops in a list were possible. The factory methods prevent this as they can only create lists terminated by EMPTY. Hence, the API enforces good behavior.

Tip

Pure functional programming uses recursion instead of loops, since a loop counter would have to be mutable.

We used a few idioms to enforce the algebraic data type constraint that the type hierarchy must be closed, with only two concrete types to represent all lists. The final keyword prevents subclassing NonEmptyList and using an anonymous class for EMPTY accomplishes the same goal. However, Java doesn’t give us a way to prevent other implementations of the List<T> interface itself, if we want to keep it public.

We are accustomed to saying that instances of a class can only have certain valid states and state transitions. Notice that algebraic data types are making the same kinds of assertions about types themselves, imposing a rigor that helps us think about allowed representations of state and transitions from an instance representing one state to an instance representing another state.

Maps

Let’s talk briefly about maps, which associate keys with values, as in this familiar Java example:

Map<String,String> languageToType = new HashMap<String,String>();
languageToType.put("Java",    "Object Oriented");
languageToType.put("Ruby",    "Object Oriented");
languageToType.put("Clojure", "Functional");
languageToType.put("Scala" ,  "Hybrid Object-Functional");

Maps don’t make good algebraic data types, because the value of defining an “empty” vs. a “non-empty” type (or similar decomposition) is less useful. In part, this reflects the fact that the “obvious” implementation of List is strongly implied by the head and tail design.

There is no such obvious implementation of Map. In fact, we need flexibility to provide different implementations for different performance goals. Instead, Map is a good example of an abstract data type (see Algebraic Data Types and Abstract Data Types).

I’ll leave it as an exercise for you to implement a functional-style map (see Exercises). Instead, let’s look at operations that work for lists, maps, and other collections.

Combinator Functions: The Collection Power Tools

You already think of lists, maps, etc. as “collections,” all with a set of common methods. Most collections support Java Iterators, too. In functional programming, there are three core operations that are the basis of almost all work you do with collections:

Filter: Create a new collection, keeping only the elements for which a filter method returns true. The size of the new collection will be less than or equal to the size of the original collection.
Map: Create a new collection where each element from the original collection is transformed into a new value. Both the original collection and the new collection will have the same size. (Not to be confused with the Map data structure.)
Fold: Starting with a “seed” value, traverse through the collection and use each element to build up a new final value where each element from the original collection “contributes” to the final value. An example is summing a list of integers.

Many other common operations can be built on top of these three. Together they are the basis for implementing concise and composable behaviors. Let’s see how.

Returning to our ListModule implementation, let’s add these methods (plus one other). Here is version 2 of ListModule, where I’ll only show what’s new to save space^[6]:

package datastructures2;
...
public class ListModule {
  public static interface List<T> {
    ...
    public      List<T>  filter    (Function1<T,Boolean> f);
    public <T2> List<T2> map       (Function1<T,T2> f);
    public <T2> T2       foldLeft  (T2 seed, Function2<T2,T,T2> f);
    public <T2> T2       foldRight (T2 seed, Function2<T,T2,T2> f);
    public      void     foreach   (Function1Void<T> f);
  }
  
  public static final class NonEmptyList<T> implements List<T> {
    ...
    public List<T> filter (Function1<T,Boolean> f) {
       if (f.apply(head())) {
        return list(head(), tail().filter(f));
      } else {
        return tail().filter(f);
      }
    }
    
    public <T2> List<T2> map (Function1<T,T2> f) {
      return list(f.apply(head()), tail().map(f));
    }
    
    public <T2> T2 foldLeft (T2 seed, Function2<T2,T,T2> f) {
      return tail().foldLeft(f.apply(seed, head()), f);
    }
 
    public <T2> T2 foldRight (T2 seed, Function2<T,T2,T2> f) {
      return f.apply(head(), tail().foldRight(seed, f));
    }
 
    public void foreach (Function1Void<T> f) {
      f.apply(head()); 
      tail().foreach(f);
    }
  }

  public static final List<? extends Object> EMPTY = new List<Object>() {
    ...
    public      List<Object> filter (Function1<Object,Boolean> f) { return this; }
    public <T2> List<T2>  map (Function1<Object,T2> f) { return emptyList(); }

    public <T2> T2 foldLeft  (T2 seed, Function2<T2,Object,T2> f) { return seed; }
    public <T2> T2 foldRight (T2 seed, Function2<Object,T2,T2> f) { return seed; }

    public void foreach (Function1Void<Object> f) {}
  };
}

There are five new methods declared in the List interface. We need two versions of fold, foldLeft and foldRight, for reasons we’ll discuss in a moment. Also, I’ve added a foreach method for convenience.

Each implementation for the five new methods in NonEmptyList is recursive, yet there are no checks for the end of the recursion! The corresponding implementation in EMPTY terminates the recursion. This means we have eliminated the need for conditional tests, replacing them with object-oriented polymorphism!

Recall that the filter method will return a new List. It takes a Function1<T,Boolean> f and applies f to each element. In Empty, filter just returns EMPTY. In NonEmptyList, if the result of applying f to head (f.apply(head())) is true, then filter constructs a new list with head and the result of calling filter on the tail. Otherwise, filter just returns the result of applying filter to the tail, thereby discarding head. So, filter is recursive and it terminates when it is called on an empty list.

The map method is slightly simpler, since it never discards an element. It also uses recursion to traverse the list, applying f to each element and building up a new list with the results. Note that f is now of type Function1<T,T2>, because the goal is to allow the original elements of type T to be transformed into instances of the new type, T2. This time, EMPTY’s map method calls emptyList, because it must return an object of type List<T2>, instead of an object of the original type.

The foldLeft and the foldRight methods are the hardest to understand, but they are actually the most important, as all other methods could be implemented using them! We’ll start with a general discussion of how these methods work, then return to the implementation details.

The reason there are two versions is because they traverse the collection and apply the function in different orders. In some cases, the ordering doesn’t matter. In others, the results will be different. There are other important differences we’ll see in a moment.

In a nutshell, foldLeft groups the elements from left to right, while foldRight groups them from right to left. It might help to start with an illustration of how these two methods work. Suppose I have a list of the integers 1 through 4. I want to add them using fold. Consider the following example:

List<Integer> listI =
  list(1, list(2, list(3, list(4, emptyList()))));
listI.foldLeft(0, new Function2<Integer, Integer, Integer>() {
  public Integer apply(Integer seed, Integer item) { return seed + item; }
});

Here is how foldLeft would add these numbers together:

((((0 + 1) + 2) + 3) + 4) == 10

The seed of 0 is first added to 1, then the result is added to 2, etc.

Now, here is the foldRight version:

List<Integer> listI =
  list(1, list(2, list(3, list(4, emptyList()))));
listI.foldRight(0, new Function2<Integer, Integer, Integer>() {
  public Integer apply(Integer item, Integer seed) { return item + seed; }
});

Here is how foldRight would add these numbers together. The result is:

(1 + (2 + (3 + (4 + 0)))) == 10

In this case, I exchanged item and seed in the body of apply to be consistent with the output and functional programming conventions.

Notice the similarity between the appearance of how listI is declared and how the foldRight algorithm is written in the comment. In fact, repeated application of our factory method list builds lists in a right-recursive way.

Since addition is associative, the answer is the same in both cases. You would get different answers if you did subtraction, for example.

So, we need two versions of fold because the order matters for non-associative operations. There are two other important differences.

First, imagine that listI is actually all positive integers, the natural numbers. We showed a simple representation in Lazy vs. Eager Evaluation. The NaturalNumbers class has a static value representing zero and the next method computes a value from the previous value you pass in.

Now look at the foldRight example again. Let’s rewrite our previous expression to make it infinite and let’s replace the literal numbers with calls to next (assuming we did a static import of everything in NaturalNumbers). For clarity, I’ll first show the expression with the literal numbers:

(1          + (2                + (3                      + (...))))
(next(ZERO) + (next(next(ZERO)) + (next(next(next(ZERO))) + (...))))

Of course, ZERO and 0 are actually equal. NaturalNumbers also defines take(n), which returns a List of the first n positive integers. Effectively, the recursion in foldRight will now terminate when it hits the end of this List, as if nested calls to next stop after n. If we call take(3), our expression reduces to the following:

(1          + (2                + (3                      + 0)))
(next(ZERO) + (next(next(ZERO)) + (next(next(next(ZERO))) + 0)))

When the recursion terminates in foldRight, it just returns the original seed value of 0.

So, we can see that foldRight can be used with infinite data structures, if only the first n elements will be evaluated.

However, foldRight has a drawback; it is not tail recursive. Why? Notice that we do an addition after the recursive call returns. The recursive call isn’t the last thing done, the tail of the algorithm. The tail-call optimization can’t be applied to foldRight.

However, foldLeft is tail recursive. Let’s write the left-recursive version of our last next example:

(((0 + 1         ) + 2               ) + 3)
(((0 + next(ZERO)) + next(next(ZERO))) + next(next(next(ZERO))))

Recall that (0 + next(ZERO)), etc. are recursive calls to foldLeft, but the addition now happens before the call, to construct the argument passed to the next invocation of foldLeft. Hence the recursion is a tail call, the last calculation done.

However, foldLeft can’t be used for infinite data structures. There is no place where we can replace a call to next with the seed, as for foldRight. So, foldLeft will eagerly evaluate the expression, blowing up on an infinite data structure.

Now let’s return to the implementations, starting with foldLeft. First, the function f is of type Function2<T2,T,T2>. The first T2 type parameter represents the seed. Recall that we are building up a new value that could be just about anything; a new list, a String, an Integer (for sums), etc. So, we have to pass a starter or “seed” value. Another conventional name for this argument is accumulator, since it will contain the “accumulation” of the work done up to a given point.

The second type parameter T for f is the type of the elements in the original list. The last type parameter T2 is the final return type of the call to foldLeft. Note that it must be the same as the seed type parameter.

Empty’s version of foldLeft simply returns the seed, terminating the recursion. In NonEmptyList’s foldLeft, foldLeft is called on the tail, passing as the new seed the result of applying f to the input seed and head.

The implementation of foldRight is similar. The seed is returned by Empty’s version of foldRight. However, the version in NonEmptyList has key differences compared to its version of foldLeft. Note that f is applied to the head and the result of the recursive call to tail().foldRight after the latter has returned. As we discussed above, this is why foldRight is not tail recursive.

Tip

Consider these concise and precise definitions: foldLeft “is the fundamental list iterator” and foldRight “is the fundamental list recursion operator” [Shivers].

To end our discussion of fold, note that there is a similar operation called reduce, which is like fold, but the initial value of the collection is used as the seed. Hence, fold is more general, because the type of the result doesn’t have to be the same as the type of the collection elements. Also, unlike fold, reduce will fail if used on an empty collection, since there is no “first” value!

Finally, we have foreach, the simplest of all these methods. Technically, foreach would be disallowed in “pure” functional programming, because it performs only side effects, as it returns void! The only useful work that can be done is for the input function f to do I/O or other state modifications. For example, you might use foreach in a main method as the outer loop for all other computations. Here is a contrived example that converts the input String[] args to a List<String> and then uses foreach to print out the list of arguments:

package datastructures2;
import datastructures2.ListModule.List;
import static datastructures2.ListModule.*;
import functions.Function1Void;

public class ForeachExample {
  public static void main(String[] args) {
    argsToList(args).foreach(new Function1Void<String>() {
      public void apply(String arg) {
        System.out.println("You entered: "+arg);
      }
    });
  }

  private static List<String> argsToList(String[] args) {
    List<String> result = emptyList();
    for (String arg: args) {
      result = list(arg, result);
    }
    return result;
  }
}

Actually, there’s a bug here; it prints the arguments in reverse order! (See Exercises).

I said that filter, map and fold are composable. All three are methods on List, of course. Two of them, filter and map, return a new List, while fold can return anything we want. One of our oldest problem-solving techniques is divide and conquer, where we decompose a hard problem into smaller, easier problems. We can divide complex computations into pieces using filter, map, and fold, then compose the results together to get the final result.

The following JUnit test shows how we can start with a list of integers, filter them to keep only the even values, multiple each of those by 2, then add them up:

package datastructures2;
import org.junit.Test;
import static org.junit.Assert.*;
import functions.*;
import static datastructures2.ListModule.*;

public class FunctionCombinatorTest {
 @Test
 public void higherOrderFunctionCombinatorExample() {
  List<Integer> listI =
    list(1, list(2, list(3, list(4, list(5, list(6, emptyList()))))));
  Integer sum = listI.filter(new Function1<Integer,Boolean>() {
     public Boolean apply(Integer i) { return i % 2 == 0; }
   })
   .map(new Function1<Integer, Integer>() {
    public Integer apply(Integer i) { return i * 2; }
   })
   .foldLeft(0, new Function2<Integer, Integer, Integer>() {
     public Integer apply(Integer seed, Integer item) { return seed + item; }
   });
  assertEquals(new Integer(24), sum);
 }
}

In fact, we call filter, map, and fold combinators, because they “combine” with their function arguments and they combine with each other to build more complex computations from simpler pieces. Combinators are arguably the most reusable constructs we have in programming.

Tip

The filter, map, and fold functions are combinators, composable building blocks that let us construct complex computations from simpler pieces. They are highly reusable. The combination of map and reduce was the inspiration for the MapReduce approach to processing massive data sets [Hadoop].

Finally, recall that I implemented these functions using recursion, but code that uses them avoids recursion, as in our FunctionCombinatorTest example. That means users of filter, map, and fold don’t have the drawbacks of recursion, namely the inefficient stack usage and the potential complexity that can arise in non-trivial recursive functions. We could even reimplement filter, map, and fold to eliminate recursion for better performance. Because these functions are used heavily, we would gain significant performance benefits at the expense of a less elegant implementation, but one that remains hidden behind the abstraction.

That’s a lot to digest! Once you’re ready for more, see [Bird2010] and [Hutton1999] for more on what these powerful operations can do.

Persistent Data Structures

It seems that if we want immutable values, we have to make a copy whenever we change a value. While this may be fine for small objects, it will be too expensive for large objects, like long lists and large maps.

Fortunately, we can have both immutability and acceptable performance if we only allocate memory for what is actually changing and we share the unchanged parts with the original object. This approach is called structure sharing. Tree data structures provide the balance of performance and space efficiency required to do this. The public abstraction might still be a List, a Map, or another data structure. The tree is only used for the internal storage. Note that the trees themselves and their nodes must be immutable. Otherwise, structure sharing will be dangerous, as mutations through one object will be seen by others that share the same substructure!

To simplify the discussion, let’s use unbalanced binary trees. They provide average O(log₂(N)) search times (unless the tree is really unbalanced). Real persistent data structures often use one of several 16- or 32-way balanced tree variants to further reduce search times and to optimize other performance goals. We’ll skip over these details and we won’t cover how you might implement a List, Map, or other object using a tree. However, [Spiewak2011] is an excellent presentation on several widely used persistent data structures (warning: Scala syntax). More technical details can be found in [Okasaki1998] and [Rabhi1999].

Figure 3-1 shows a tree at time “0” referenced as an object named value1.

Figure 3-1. Time 0, One Value

Now imagine a user wants to create a new tree that prunes off nodes a1 and its left branch, node a2, but keep node a3 and its right branch, node a4. All we have to do is create a new root node that points to a3 as its left branch and b1 as its right branch, as shown in Figure 3-2.

Figure 3-2. Time 1, Two Values, with Shared Substructures

Six of the original 8 nodes are shared by both trees. Only one new node allocation was required, the root node, value2.

Note that a history of the evolving values is being maintained. We still have value1 and as long as some code has a reference to it, it won’t be garbage collected. This is why these data structures are called persistent, not in the database sense (they aren’t normally saved to disk), but in the sense that old versions of an evolving structure will remain available as long as needed. We will exploit this feature in Software Transactional Memory.

Some Final Thoughts on Data Structures and Algorithms

From these examples, we can see how immutable values lead us to structure sharing as a way of making new values efficiently, where we share data that isn’t changing, rather than make full copies. This can only work if all the data elements are immutable. Different kinds of trees are the most useful data structures for implementing immutable collection types, because they can be chosen for optimizing various operations, like fast searching for values vs. fast updates.

The use of recursion is also universal, instead of looping. Recursion avoids mutable loop counters and it’s a natural fit for recursive data structures, like lists and trees.

However, we can avoid many uses of recursion by using our combinators, filter, map, and fold. We can do anything we want with collections using these modular, reusable, and composable functions.

Consider another example, a List of email addresses for our customers. We can filter for just the gmail addresses. We can map each address in the list to an appropriate anchor tag for displaying in a web page. We can fold over the list to group the users by domain. That is, we can build a map where each domain name is a key and the list of users at that domain is the corresponding value.

In contrast, now imagine that we wrote our own custom EmailAddresses class, for example, with one-off methods to do the filtering, mapping, and grouping I just described. We would write a lot more code (and tests) and the special-purpose nature of that code would make the class less attractive for reuse. If we follow this approach with our other domain concepts, we end up with far more code than we really need, with a relatively low density of value per line of code. There would be lots of little ad-hoc types and methods, most of which are seldom invoked and rarely reused.

You might argue that these custom types and methods provide a self-documentation feature. For example, EmailAddresses.groupUsersByDomain tells the reader exactly what’s going on. That’s useful, but there is a better way.

Interest in Domain-Specific Languages is another recent, popular trend (see, for example, [Ghosh2011a] and [Ghosh2011b]). DSLs try to capture the idiomatic language used by domain experts directly in the code. You can implement DSLs in both object-oriented and functional languages. Some languages provide better support for custom DSL syntax than others.

Back to our example, it can be useful to represent a domain with a DSL at the upper levels of abstraction, but questionable to create explicit object representations under this surface. We can have a DSL that says, for example groupUsersByDomain in emailAddresses, but implement it with List<EmailAddresses>.foldLeft(new HashMap<…>(), groupingFunction);, where groupingFunction does the “group by” magic on the users and domains.

In Functional Programming Is More Modular, I argued that objects operate at the wrong level of abstraction and they lack a standard set of protocols that are essential for the kind of reuse we want. The core data structures of functional programming and the combinators like filter, map, and fold bring us closer to that ideal.

Exercises

Add a factory method to ListModule that takes a variable argument list of elements and returns a properly constructed list.
Implement a new ListModule where head and tail return Options. This eliminates the slight smell of throwing exceptions for the empty list case. However, using Options makes some other code more awkward, as a unit test will show.
Re-implement the Option hierarchy following the idioms used for List; e.g., make None a static constant.
Implement a MapModule with an abstract data type Map. The implementation classes should use side-effect-free functions and immutability. How can you enable the use of alternative implementations that optimize performance and memory usage? What implementations would optimize the following:
1. A map that contains just a few key-value pairs.
2. A map that contains a few million key-value pairs.
3. A map that optimizes insertion performance.
4. A map that optimizes search performance.
5. A map that retains the order of insertion (e.g., for subsequent traversal).
ForeachExample prints the arguments in reverse order. Determine the cause and implement a fix. Hint: consider adding a useful method to ListHelper that is commonly found in List classes.
Reimplement the equals and toString methods in NonEmptyList using foldLeft or foldRight. Does the choice of fold method affect the results?
Reimplement the filter and map methods for Lists using foldLeft or foldRight.
Reimplement foldLeft and foldRight so they don’t use recursion. If you use mutable values, preserve thread safety.

^[4]We don’t care about using JavaBeans conventions for accessors in this case, because that convention doesn’t serve a useful purpose here.

^[5]The full listing is in the downloadable code examples, test/datastructures/ListTest.java.

^[6]The full listing is in the downloadable code examples, src/datastructures2/ListModule.java.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Data Structures and Algorithms

Create new playlist

Sign In

Sign Up