2. Composite Data

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Composite Data

2.0. Introduction

Now that we’ve got primitives out of the way, we need to start doing something with them. Single atomic values are great and all, but things get much more interesting when we start globbing them all together. As you’ll see soon enough, data manipulation is one of Clojure’s strong suits.

What makes Clojure so good at manipulating collections? It comes down to three things: immutability, persistence, and the sequence abstraction. Every one of Clojure’s built-in collection types has these properties and is thus unified in its API’s appearance and behavior.

As the great Alan J. Perlis (an early computer science pioneer) put it:

It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures.

This chapter introduces Clojure collections and where/how to use them. Finally, we wrap things up by showing you how to build your own feature-complete types that look and behave just like the rest of Clojure’s collections by leveraging Clojure’s capacity for interface polymorphism.

Immutability

Immutability means that a Clojure data structure, once created, can never change. You can only “modify” an immutable data structure by creating a new data structure that is a copy of the old, with the desired changes in place.

Immutability also means that Clojure data structures, however deeply nested, are simple values, just like the number 3 or the character z. It doesn’t make sense to speak of “changing” the value of 3— it just is. If you “change” it by, say, incrementing it, you don’t modify 3 itself. Instead, you end up with an entirely new and different value, 4. Clojure extends this notion of value to all data structures. In Clojure, any action that in another language would perform any kind of update on a data structure will instead return an entirely new one. You can continue to pass around and use both the old and the new versions with confidence, knowing that nothing you can do will cause any unintended changes elsewhere in your program.

This feature is extremely important in concurrent and parallel programming, where unexpected mutation is the source of a large class of bugs. With immutable data, any number of threads can read from the same data without any worrying about locks or race conditions—it’s always safe to read something that can’t change. “Clone” operations are not only free, but unnecessary.

Persistence

But, you may ask, how can that possibly be efficient? Surely it is impractical to do a full copy of an object every time you need to add something?

Yes, it would be, except for the feature of persistence. Persistence means that Clojure’s data structures, although logically immutable, can still share pieces of their internal structure for efficiency in both time and space. Essentially, updated versions of immutable data only need to store the deltas from pervious versions, rather than doing a full deep copy.

To make it performant, all of this uses some extremely clever algorithms, of course. See the book Purely Functional Data Structures by Chris Okasaki (Cambridge University Press) for a detailed description of how they work.

The Sequence Abstraction

From vectors, maps, sets, and lists to strings and streams, every last one of Clojure’s collections behaves in a similar, predictable fashion. They are simple tools for a more civilized age. This is on account of Clojure’s sequence abstraction.

The array of collection-manipulating functions in Clojure are all implemented in terms of one simple abstraction: every collection can be treated as a sequence of values. By implementing first, rest, and cons, any data structure—even ones you build yourself—can participate in the ISeq interface.

Then, you can use Clojure’s huge library of functions that can operate on sequences, with any of the data structures. All of functional programming’s most beloved functions (map, reduce, filter, etc.) will work interchangeably on any data structure. In essence, the sequence abstraction allows all the expressiveness of traditional list-based LISP programming without forcing you to actually use lists. Instead, use whatever type is most efficient for the task, knowing that you can consume them all in the same way when it makes sense to do so.

2.1. Creating a List

by Luke VanderHart

Problem

You want to create a list data structure in your source code.

Solution

There are two basic ways to specifically construct a list (a clojure.lang.PersistentList).

You can use parentheses in combination with a single quote to indicate that the list should only be read as a data structure, not immediately evaluated:

'(1 :2 "3")
;; -> (1 :2 "3")

Or, more commonly, you can use the list function, which takes a variadic number of arguments and constructs a list from them:

(list 1 :2 "3")
;; -> (1 :2 "3")

Discussion

Typically, between these two approaches, using the list function is the better choice. The problem with constructing quoted lists is that the quote also prevents evaluation of everything inside the list, which means that symbols will be returned as literal symbols, instead of resolving variables or calling functions. list, however, will evaluate its arguments in the normal way before constructing the list and is usually what is desired for nonmacro code:

(def x 2)

'(1 x)
;; -> (1 x)

(list 1 x)
;; -> (1 2)

That said, '() is the idiomatic way to create an empty list—it is more terse, and the concern about evaluating its contents is irrelevant when it’s empty.

Clojure includes both list and vector types. Both are sequential data structures. However, for most purposes, vectors are a better fit and are more commonly used in idiomatic Clojure.

There are a couple of reasons for this. Vectors have a cleaner literal syntax than lists and are just as space-efficient and performant. In addition, vectors support near-constant lookup time by index (O(log₃₂ n)), as opposed to lists, which require linear time (O(n)).

In general, the only reason to explicitly choose a list over a vector is if you need a data structure that supports efficient insertions at the beginning, which lists do; vectors are most efficient when appending items to the end.

2.2. Creating a List from an Existing Data Structure

by Luke VanderHart

Problem

You have an existing sequential data structure that you would like to convert into a list as its concrete data type.

Solution

The easiest solution is: don’t. Having a concrete list provides little or no advantage over simply using the sequence abstraction directly on your existing data, and for large data structures, conversion can be expensive.

If you do know that you need an explicit conversion of the concrete data structure, there are two ways to do it.

First, you could use the apply function to call the list function, passing it your existing data structure as its arguments:

(apply list [1 2 3 4 5])
;; -> (1 2 3 4 5)

Alternatively, you could use the into function to repeatedly conjoin elements from your original data onto a list. Note, however, that this approach has the effect of reversing the order of the original collection:

(into '() [1 2 3 4 5])
;; -> (5 4 3 2 1)

Discussion

These two approaches are both viable choices. However, what actually happens in each case is very different.

When using apply, you are actually invoking the list function with however many arguments are in the data structure. This may sound strange, particularly if the data structure contains millions of items. What does it mean to invoke a function with a million arguments? How does that even work, given that the JVM limits methods to 255 arguments (see the JVM class file specification)?

As it turns out, functions with variadic arguments (such as list) are handled in a somewhat special way: the argument list is passed in as a sequence. apply knows this and passes this sequence view of the original structure directly through to the receiving function. This is why it works; there is never actually a JVM method invocation with a million arguments.

into works quite differently: it takes two arguments, the first being a data structure and the second being a sequence. It then repeatedly conjoins items from the sequence onto the data structure provided using the conj function (discussed in greater detail elsewhere). This is why the sequence is reversed; items are always pulled from the front of the sequence, but conj on a list prepends the element being added. Therefore, the first element in the input sequence will end up being the last item in the list, and so on.

So why ever choose into over apply, given that it reverses the order? Speed. into utilizes Clojure transients, which provide a considerable performance improvement. On the author’s machine, converting a million-item vector to a list using apply took an average of 750 milliseconds, while using into took about half that time, for an average of 350 milliseconds. Of course, the list was in reverse order, and reversing either the input or the output negates the speed advantage. In the end, into is only advantageous in situations where a reversed order is acceptable.

2.3. “Adding” an Item to a List

by Luke VanderHart

Problem

You want to add an item to a list; or, putting it in functional terms, you want to derive from an existing list a new list that contains an additional item.

Solution

Use the conj function. conj is used to add an item or items to a logical collection and is polymorphic, meaning it works on multiple concrete data types, including lists:

(conj (list 1 2 3) 4)
;; -> (4 1 2 3)

You can also add multiple items at once:

(conj (list 1 2 3) 4 5)
;; -> (5 4 1 2 3)

Discussion

The behavior of conj may vary slightly depending on the concrete type. It always “adds” an item to an immutable collection by returning a new collection containing the new item, but may add the item to different places in the collection depending on what is most efficient for the particular type.

In the case of lists, conj will always add the item at the beginning of the list, since a linked list data structure supports constant-time insertion only at the beginning.

If you are familiar with Common Lisp or Scheme, you were probably expecting to see the cons function used, instead of conj. Clojure does have a cons function, but it has a slightly different purpose.

While conj will return a new concrete clojure.lang.PersistentList (when used on a list), cons will always construct a new sequence, where the item added is the first and the collection is the rest.

This distinction is subtle, especially since a clojure.lang.PersistentList and a sequence constructed using cons are both types of persistent linked lists, algorithmically speaking.

The best way to think about it is that conj is a concrete data structure operation, which will not change the concrete type of the data structure it is applied to, while cons is a sequence operation and only guarantees that it will return a sequence: in fact, it returns a cons cell (clojure.lang.Cons) that implements the sequence interface, no matter what type of seq-able collection you gave it to start with.

Unlike conj, cons is also guaranteed to return a sequence with the item prepended to the beginning no matter what the collection type is.

2.4. “Removing” an Item from a List

by Luke VanderHart

Problem

You want to obtain a list without a particular item in it, removing an item from the original list.

Solution

Removing the first item from a list is easily accomplished using one of two functions, rest or pop. Both work identically when used on a nonempty list:

(pop '(1 2 3))
;; -> (2 3)

(rest '(1 2 3))
;; -> (2 3)

Discussion

rest is actually a sequence function, used to obtain the tail of a sequence. Since Clojure lists implement the sequence interface directly, using rest on a list will always return another (possibly empty) list.

pop is similar to conj in that it operates on concrete data structures rather than the sequence interface. Like conj, it is polymorphic; also like conj, the position it removes the item from depends on what’s most efficient for the concrete type.

When used on an empty list, the behavior does differ; pop will throw an exception, while rest will return an empty list:

(pop '())
;; -> IllegalStateException Can't pop empty list ...

(rest '())
;; -> ()

Lists do not support removing items except at the first position. If you need to remove an item in the middle or at the end of a list, you’ll have to do so using the sequence manipulation functions, then convert the result back into a concrete list (if you absolutely need it to be a list, for some reason).

2.5. Testing for a List

by Steve Miner

Problem

You want to test if a value is a list.

Solution

The list? function may seem like the obvious choice, but in most cases, it’s better to use the more general seq? function as your test.

Discussion

The list? function specifically tests if the argument implements clojure.lang.IPersistentList, but in most cases, you really want to know if the value is a seq (implements clojure.lang.ISeq), which is a more general abstraction than a list.

Not everything that prints as a list (in parentheses) actually satisfies the list? test. In practice, you’ll often receive Cons and LazySeq values when manipulating lists. By focusing on the fundamental seq abstraction, you don’t need to worry about the details of those concrete implementations:

;; A list constructed via list satisfies both list? and seq?
(list? (list 1 2 3))
;; -> true
(seq? (list 1 2 3))
;; -> true

;; cons, however *looks* like a list, but is actually a Cons
(list? (cons 1 '(2 3)))
;; -> false
(type (cons 1 '(2 3)))
;; -> clojure.lang.Cons
(seq? (cons 1 '(2 3)))
;; -> true

;; range's lazy return value is a seq, but not a list
(list? (range 3))
;; -> false
(seq? (range 3))
;; -> true
(type (range 3))
;; -> clojure.lang.LazySeq

It’s almost always better to use seq? instead of list?.

2.6. Creating a Vector

by Luke VanderHart

Problem

You want to create a vector data structure, either as a literal or from an existing data structure.

Solution

By far, the easiest way to create a vector is using the literal vector notation of square brackets. However, it is also possible to use the vector function, which creates a vector of its arguments:

[1 :2 "3"]
;; -> [1 :2 "3"]

(vector 1 :2 "3")
;; -> [1 :2 "3"]

To construct a vector from an existing data structure, you can use the vec function, which takes any collection and returns a vector containing the same items:

(vec '(1 :2 "3"))
;; -> [1 :2 "3"]

Alternatively, you can use the into function, which takes two collections and repeatedly invokes conj on the first with items from the second:

(into [] '(1 :2 "3"))
;; -> [1 :2 "3"]

Discussion

There is rarely any reason to use the vector function over the literal vector syntax. Unlike lists, vectors are not evaluated as function calls (or anything else) in Clojure, so quoting is not a concern as it is with list literals.

Oddly enough, when constructing a vector from an existing collection, using the into approach is currently about 30% more performant on large collections compared to vec due to its use of transients to speed things up. If you’re converting large collections and speed matters, consider using into. Otherwise, vec is usually more readable.

2.7. “Adding” an Item to a Vector

by Luke VanderHart

Problem

You want to add an item to a vector, yielding a new vector containing the item.

Solution

When used on a vector, the conj function returns a vector with one or more items appended to the end:

(conj [1 2 3] 4)
;; -> [1 2 3 4]

(conj [1 2 3] 4 5)
;; -> [1 2 3 4 5]

Discussion

Vectors do not support adding new items anywhere aside from the end. If you need to insert an item in the middle, you will have to use a sequence manipulation function and convert back to a vector (if necessary) when you’re done.

Since vectors are associative (mapping integer indexes to values), you can also use the assoc function with an index equal to the current length of the vector (one greater than the maximum index) to append an item:

(assoc [:a :b :c] 3 :x)
;; -> [:a :b :c :x]

However, this approach is somewhat more fragile than conj. If the index you provide is too small, you might simply “overwrite” an earlier value in the vector; and if it’s greater than the vector’s current length, it will throw an IndexOutOfBoundsException.

Still, this technique is worth remembering. If you have code that is assoc-ing to a vector already, you can use this technique to produce new vectors with updated values.

2.8. “Removing” an Item from a Vector

by Luke VanderHart

Problem

You want to remove an item from a vector, obtaining a new vector without the item.

Solution

To efficiently remove an item from the end of a vector, use the pop function, which takes a vector and returns a new vector without the last item:

(pop [1 2 3 4])
;; -> [1 2 3]

Discussion

Although there is no operation designed specifically to remove items from the beginning of a vector, as pop does from the end, there is a function, subvec, that can be used to efficiently remove any number of items from the beginning or end of a vector. Given a vector, a start index, and an (optional) end index, it will return a vector from the start (inclusive) to end (exclusive) indexes.

The following example drops a single item from the beginning of a vector. You can use subvec like so:

(subvec [:a :b :c :d] 1)
;; -> [:b :c :d]

Or, to remove items from the beginning and the end of a vector, pass an end index to subvec as well:

(subvec [:a :b :c :d] 1 3)
;; -> [:b :c]

Because subvec exploits the internal representation of a vector to create a subvector that shares the internal structure of the original, it is extremely efficient and runs in constant time. It is the only way to efficiently remove items from the beginning of a vector.

While it is certainly also possible to use a function like rest or drop on a vector, these are technically sequence operations, not vector operations. The value they return is only guaranteed to be a sequence, not a concrete vector, and as such will not support the same features or performance guarantees that vectors do.

Of course, you can convert any sequence back into a concrete vector using vec or into [], but this can be an expensive operation for large vectors.

2.9. Getting the Value at an Index

by Luke VanderHart

Problem

You have a vector, and you want to retrieve the value the vector contains at a particular location (index).

Solution

There are several ways to do this.

Using nth

The nth function, which works on all sequences, is special-cased to provide constant-time performance when used with indexed collections such as vectors:

(nth [:a :b :c :d] 2)
;; -> :c

If given an index greater than the size of the vector, nth will throw an exception unless you pass it an optional third argument, which will be returned if the provided index is out of bounds:

(nth [:a :b :c] 4)
;; -> IndexOutOfBoundsException

(nth [:a :b :c] 4 :not-found)
;; -> :not-found

Using vectors as functions of their indexes

Vectors are themselves functions that when called with an integer argument, will return the value at that index:

(def v [:a :b :c])
(v 2)
;; -> :c

Using an out-of-range index when invoking a vector as a function will result in an IndexOutOfBoundsException.

Using get

Because vectors support the associative interface with integer indexes as keys, you can also use the get function to retrieve values by index:

(get [:a :b :c] 2)
;; -> :c

Unlike nth, when you pass an out-of-range index to get it will return nil, not throw an exception—unless, that is, you provide a default value to be returned if the key (the index, in this case) is not found:

(get [:a :b :c] 5)
;; -> :nil

(get [:a :b :c] 5 :not-found)
;; -> :not-found

Discussion

Which technique should you use? All work equally well, but the choice does emphasize the way in which you’re looking at your vector. nth focuses on its sequential nature, whereas get emphasizes its indexed, associative quality. Using the vector as a function is also consistent with the way all associative collections in Clojure act as functions of their keys.

Ultimately, when making a choice like this, you should consider:

What would make the code most evident?
What is the nature of the data in this case? For example, is it most fundamentally a sequence and only coincidentally a vector (implying nth), or fundamentally a correlation of values to indexes (implying get)?
What is the failure mode of the proposed technique? For example, would a nil return value or an exception be preferable?

2.10. Setting the Value at an Index

by Luke VanderHart

Problem

Given a vector, you would like to obtain a new vector with a different value at a particular index.

Solution

Use assoc to set the value at a particular index:

(assoc [:a :b :c ] 1 :x)
;; -> [:a :x :c]

assoc can also be used to set multiple indexes at the same time, by providing additional index/value pairs:

(assoc [:a :b :c ] 1 :x 2 :y)
;; -> [:a :x :y]

Discussion

As you may have noticed, assoc is the same function used to set the values of keys in a map. This is because vectors, like maps, are associative and implement the same interface (clojure.lang.Associative), which is what assoc uses under the hood.

Unlike with maps, however, the keys used when using assoc on a vector must be integer indexes within the range of the vector. Attempting to use a noninteger key will cause an IllegalArgumentException, and attempting to assoc an index greater than the size of the vector will throw an IndexOutOfBoundsException.

Note that it is possible to assoc to an index equal to the current size of the vector (one greater than the maximum index). This will have the result of appending the item to the end.

2.11. Creating a Set

by Luke VanderHart

Problem

You want to create an unordered collection of distinct objects, which can be tested for membership quickly.

Solution

Use a set literal to create a set of objects:

#{:a :b :c}
;; -> #{:a :c :b}

;; Duplicate elements in set literals are an error
#{:x :y :z :z :z}
;; -> IllegalArgumentException Duplicate key: :y :z ...

Use hash-set to create a set from arguments:

(hash-set :a :b :c)
;; -> #{:a :c :b}

(apply hash-set :a [:b :c])
;; -> #{:a :c :b}

Use set to create a set from another collection:

(set "hello")
;; -> #{e h l o}

Alternatively, use into with a set to create a new set:

(into #{} [:a :b :c :a])
;; -> #{:a :b :c}

(into #{:a :b} [:b :c :d])
;; -> #{:a :b :c :d}

Set construction performance

At the time of writing, the into technique is about three times faster than set for large collections of objects. Use it whenever you’re working with large sets where performance is a concern:

(def largeseq (doall (range 1e5)))

(time (dotimes [_ 100] (set largeseq)))
;; *out*
;; "Elapsed time: 5594.961 msecs"

(time (dotimes [_ 100] (into #{} largeseq)))
;; *out*
;; "Elapsed time: 1329.66 msecs"

Create a sorted set with sorted-set:

(sorted-set 99 4 32 7)
;; -> #{4 7 32 99}

(into (sorted-set) "the quick brown fox jumps over the lazy dog")
;; -> #{space a  c d e f g h i j k l m 
 o p
;;      q 
 s 	 u v w x y z}

Discussion

Sets are very useful data structures. They are commonly used when you have a collection of values but you are only concerned with the distinct values. Lookup of an element in a set is typically O(1).

The techniques just shown all construct hash sets—sets that are unordered and use a hash table as their internal representation.

Clojure also supports creating sorted sets, which maintain the order of their elements. Sets created with sorted-set keep their elements in ascending order using compare. This is useful when treating the set as a seq:

(def alphabet (into (sorted-set) "qwertyuiopasdfghjklzxcvbnm"))
(last alphabet)
;; -> z
(second (disj alphabet ))
;; -> c

Warning

All of the elements in a sorted set must be comparable against one another (e.g., you cannot have a sorted set that contains both strings and numbers). Attempting to add an uncomparable value will result in a runtime error.

Adding or removing objects in a sorted set will always return another sorted set.

If the values you want to store don’t have a natural sort order (or you don’t want to use their natural ordering), you can specify a custom comparator using sorted-set-by. The comparator used to create the set is preserved when adding or removing objects:

(def descending-set (sorted-set-by > 1 2 3))

(into descending-set [-1 4])
;; -> #{4 3 2 1 -1}

There are some performance trade-offs to consider when choosing between hash sets and sorted sets. Hash sets are based on hash tables, which offer constant time insert and lookup in most cases. However, they do require some degree of memory overhead. Sorted sets, based on a balanced red-black binary tree, are more memory-efficient but slower for lookup and insertion.

2.12. Adding and Removing Items from Sets

by Luke VanderHart

Problem

You want to obtain a new set with items added or removed.

Solution

The conj function supports sets, just as it does lists, vectors, and maps. Use it to add an item or items to a set: just pass it the set and any number of items to add:

(conj #{:a :b :c} :d)
;; -> #{:a :c :b :d}

(conj #{:a :b :c} :d :e)
;; -> #{:a :c :b :d :e}

To remove one or more items, use the disj function, which is specific to sets. It takes a set and one or more keys to remove:

(disj #{:a :b :c} :b)
;; -> #{:a :c}

(disj #{:a :b :c} :b :c)
;; -> #{:a}

Discussion

Since sets are unordered, there is no concept of “where” items are added or removed; either a set contains an item, or it doesn’t.

Note that both conj and disj return a set of the same concrete type as the original. A hash set will remain a hash set, and a sorted set will remain a sorted set.

Also worth noting is that these operations are simply no-ops if the set already does or does not contain the item being added or removed. conj returns the same set if it already contains the item, just as disj does if the specified item was already absent.

If you’re adding or removing large numbers of items to or from sets, you should also consider using the dedicated set manipulation functions from the clojure.set namespace: particularly clojure.set/union, which can be used to add the items of multiple sets together, and clojure.set/difference, which can be used to obtain a set of items not contained in another set. These are typically a far more natural expression of set operations than issuing many calls to conj or disj, or invoking them with large numbers of arguments.

2.13. Testing Set Membership

by Luke VanderHart

Problem

You want to check if an item is a member of a set.

Solution

The easiest way to check a single item is with the contains? function, which takes a set and an item and returns true if the item is a member of the set:

(contains? #{:red :white :green} :blue)
;; -> false

(contains? #{:red :white :green} :green)
;; -> true

The get function also works with sets and does basically the same thing, except instead of returning true or false, it returns the value itself if it is a member, or nil if it is not:

(get #{:red :white :green} :blue)
;; -> nil

(get #{:red :white :green} :green)
;; -> :green

Finally, sets are also functions. When you invoke them with a single argument, they work just like get, returning the argument if it is a member, and nil otherwise:

(def my-set #{:red :white :green})

(my-set :blue)
;; -> nil

(my-set :green)
;; -> :green

Note as well that keywords behave in the same manner for sets as they do with maps. Thus, the following is equivalent to having used get:

(:blue #{:red :white :green})
;; -> nil

(:green #{:red :white :green})
;; -> :green

Discussion

The choice between contains? and get is mainly an aesthetic one. However, if your set might contain nil as an actual value you care about, you’ll definitely need to use contains?, since a nil return from get wouldn’t tell you anything in that case.

The ability to use a set as a function is interesting, but it’s especially useful when you want to use it as a predicate function on a sequence operation. For example, it’s fairly common to want to filter a sequence to only contain items in a set. In this case, using the set itself is both easy and idiomatic:

(take 10
  (filter #{1 2 3}
          (repeatedly #(rand-int 10))))
;; -> (2 1 2 3 2 2 1 2 2 1)

This snippet first creates an infinite lazy sequence consisting of random numbers between 1 and 10, using repeatedly to call rand-int (wrapped in an anonymous function) over and over. Then it feeds this sequence through a filter, with a set of the numbers 1–3 as the filter predicate.

The result is another infinite lazy sequence, but containing only members of the predicate set.

This example is contrived. However, using sets as predicate functions is an extremely useful technique that that pops up quite frequently in Clojure projects.

2.14. Using Set Operations

by Luke VanderHart

Problem

You want to perform common operations on sets, such as taking the union, intersection, or difference of two sets, or you want to test if one set is a subset or superset of another.

Solution

All these functions are available in the clojure.set namespace, which is built into Clojure.

union takes any number of sets as arguments and returns a set containing their union (i.e., a set containing all the elements from all the sets):

(clojure.set/union #{:red :white} #{:white :blue} #{:blue :green})
;; -> #{:white :red :blue :green}

intersection also takes any number of sets as args and returns their intersection (a set consisting only of the items shared by all the argument sets):

(clojure.set/intersection #{:red :white :blue}
                          #{:red :blue :green}
                          #{:yellow :blue :red})
;; -> #{:red :blue}

difference takes a set as its first argument and returns it without elements from the sets given in the additional arguments:

(clojure.set/difference #{:red :white :blue :yellow}
                        #{:red :blue}
                        #{:white})
;; -> #{:yellow}

subset? returns true if and only if the first argument is a subset of the second (that is, if every member of the first set is also a member of the second):

(clojure.set/subset? #{:blue :white}
                     #{:red :white :blue})
;; -> true

(clojure.set/subset? #{:blue :black}
                     #{:red :white :blue})
;; -> false

superset? works the same way, except it returns true only if the first set is a superset of the second.

As you may have noticed, superset? is actually identical to subset?, only with the order of the arguments reversed.

Discussion

In general, you should try to use these set manipulation functions wherever they are applicable. Sets represent a sizable portion of the data most developers work with day to day, whether they are recognized and explicitly modeled as sets or not.

There are a large number of bugs that can be caused by assumptions regarding the behaviors of collections. In programming, the type of data structure used for a given purpose is actually a communication, from the initial writer of the code to future programmers, that tells a number of things about the nature of the collection. Sets are unordered, unique collections—they emphasize that the important fact is whether an item is a member of the set, not the order or number of occurrences.

If your data does represent a logical set, then model it using set data structures, and try to think about manipulating it in terms of set operations. In many cases, you will find that this makes your program substantially easier to reason about, and makes it more self-documenting regarding the source and intended use of the data it contains.

2.15. Creating a Map

by Luke VanderHart

Problem

You want to create an association that maps keys to values. You possibly want to maintain a specific ordering of keys.

Solution

Use map literals (curly braces) with alternating keys and values to create simple maps:

{:name ""
 :class :barbarian
 :race :half-orc
 :level 20
 :skills [:bashing :hacking :smashing]}

Keys and values can be any type. Commas may be used to delimit key/value pairs where the structure would be hard to discern at a glance:

{1 1, 8 64, 2 4, 9 81}

Tip

In Clojure, commas are whitespace, which means that they can be used anywhere in a form with no effect on the value; it is just one way to make your code easier to read.

Create an empty, unsorted map with a pair of braces: {}.

Create specific types of maps with map constructor functions. array-map, hash-map, and sorted-map each return a map of the corresponding type:

(array-map)
;; -> {}

(sorted-map :key1 "val1" :key2 "val2")
;; -> {:key1 "val1" :key2 "val2"}

If a key occurs multiple times in the argument list, the last value will be that used in the final return map.

Use sorted-map-by to create a sorted map using a custom comparator:

(sorted-map-by #(< (count %1) (count %2))
               "pigs" 14
               "horses" 2
               "elephants" 1
               "manatees" 3)
;; -> {"pigs" 14, "horses" 2, "manatees" 3, "elephants" 1}

Discussion

Clojure maps can have one of three distinct concrete implementations:

Array maps, clojure.lang.PersistentArrayMap: These are backed by a simple array. They are efficient for very small maps, but not for larger sizes.
Hash maps, clojure.lang.PersistentHashMap: These are backed by a hash table data structure. Hash tables support near constant-time lookup and insertion, but also require a certain amount of overhead space, using up slightly more heap space.
Sorted maps, clojure.lang.PersistentTreeMap: These are backed by a balanced red-black binary tree. They are more space-efficient than hash maps, but have slower insertion and access times.

Array maps are the default implementation for small maps (under 10 entries at the time of writing), and hash maps are the default for larger ones. Sorted maps can only be created by explicitly invoking the sorted-map or sorted-map-by functions.

Using assoc or conj on a sorted map will always yield another sorted map. However, assoc-ing onto an array map will yield a hash map once it reaches a certain size. The inverse is not true; using dissoc on a hash map will not yield an array map, even if it becomes very small.

2.16. Retrieving Values from a Map

by Luke VanderHart

Problem

You want to retrieve the value stored at a particular key in a map.

Solution

As with sets, there are several ways to retrieve the value of a key.

The most straightforward way is to use the get function, which, given a map and a key, returns the value stored at the key or nil if the map does not contain the key:

(get {:name "Kvothe" :class "Bard"} :name)
;; -> "Kvothe"

(get {:name "Kvothe" :class "Bard"} :race)
;; -> nil

If desired, you can also pass a third argument to be used as the default return value instead of nil if a map doesn’t contain the key:

(get {:name "Kvothe" :class "Bard"} :race "Human")
;; -> "Human"

If your map uses keywords as keys, you can use the keyword itself as a function. Keywords implement the IFn interface, and when invoked with a map as an argument, they will look themselves up in the map, returning the value if it is present or nil if not. You can also pass a second argument that will be used as a default return value in the case of a missing key, just as you can with get:

(:name {:name "Marcus" :class "Paladin"})
;; -> "Marcus"

(:race {:name "Marcus" :class "Paladin"} "Human")
;; -> "Human"

Finally, the third basic way to look up a value in a map is to use the map itself as a function, passing the key to be retrieved as the argument. As with get and keyword functions, it is also possible to pass a second argument for use as a default value if the key is not found; otherwise, nil is returned:

(def character {:name "Brock" :class "Barbarian"})

(character :name)
;; -> "Brock"

(character :race)
;; -> nil

(character :race "Human")
;; -> "Human"

There is a convenience function for looking up items in nested maps: get-in. Instead of passing a single key, you can pass a sequence of keys, and they will be successively looked up in a nested structure, as if repeatedly calling get on each level of the nested data structure. nil is returned if any key is missing:

(get-in {:name "Marcus" :weapon {:type :greatsword :damage "2d6"}}
        [:weapon :damage])
;; -> "2d6"

(get-in {:name "Marcus"}
        [:weapon :damage])
;; -> nil

get-in also takes an optional default value, which will be returned if any key in the nested hierarchy is missing:

(get-in {:name "Marcus"}
        [:weapon :damage]
        "1d2 (fists)")
;; -> "1d2 (fists)"

Note that get-in works with any associative data structure, not just maps. This means that it can be combined to work with, for example, indexes of vectors:

(get-in [{:name "Marcus" :class "Paladin"}
         {:name "Kvothe" :class "Bard"}
         {:name "Felter" :class "Druid"}]
        [1 :class])
;; -> "Bard"

Discussion

Which technique of the three discussed is the best to use? All have identical semantics, but in idiomatic Clojure, they convey different implications about the scenario in which they are used.

Typically, keyword-as-a-function lookup is used when maps are being used as “objects” and the keys as “fields”; where the map contains a relatively small, well-known set of keys; and when there is a reasonable expectation that the key will actually be present.

The get function and map-as-a-function lookup techniques, on the other hand, are more frequently used with large maps where the set of possible keys is more open-ended. There is less motivation for choosing between these two; the only difference to be aware of is that when the map provided is nil for some reason, using it in function position will throw an exception, while applying get to nil will simply return nil.

It is interesting to note, as well, that the ability to use a map itself as a function is not just an arbitrary convenience. In the technical sense of the word “function,” maps are functions of keys to values. Consider the following function definition and map:

(defn square [x] (* x x))

(def square {1 1, 2 4, 3 9, 4 16, 5 25})

Using an invocation of the form (square 3), the caller can actually be agnostic as to whether square is a “real” function or a map. Of course, the normally defined function has a number of advantages in this case. For one, it has an unlimited domain instead of just the keys enumerated. And the multiplication function is fairly fast, so precomputing results is not a win. But in some cases, for functions that do have a more naturally constrained domain and are more expensive to compute, being able to use a map implementation of a function can be a real boon to performance.

Because all of the different techniques for retrieving values from a map return nil if the key is not present, special handling is required if you need to differentiate the case in which a key does exist in a map with a value of nil from the case in which the key does not exist at all.

The easiest way to do this is to always provide a default value to be returned in the case of a missing key. To be absolutely sure that you can differentiate the default value from any possible value the map might contain, you can use a namespace-qualified keyword (e.g., ::not-found).

It is also possible to use the contains? function, which takes a collection and a key, and returns true if and only if the collection has a specific entry for that key (even if the value is nil).

The exact behavior of the contains? function often causes confusion, especially since many other languages have a function with a similar name that does something different.

In Clojure, contains? is not a search function—it does not inspect a collection to see if the item is present. Rather, it is a lookup function, and only works on associative or indexed collections. In other languages, it is often named containsKey or similar.

This means it works as one would expect on maps and sets, returning true if the specified key is a valid key or member in the collection. But for vectors, it will return true only if passed an integer between zero and the maximum index of the vector. And it will throw an exception if used at all with a list or a sequence.

2.17. Retrieving Multiple Keys from a Map Simultaneously

by Leonardo Borges

Problem

You want to retrieve multiple values from a map at one time.

Solution

Use vals and select-keys when the order of returned values is not important:

;; How many red and green beans are there?
(def beans {:red 10
            :blue 3
            :green 1})

(reduce + (vals (select-keys beans [:red :green])))
;; -> 11

Use juxt when order matters:

;; What are the red and green bean totals?
((juxt :red :green) beans)
;; -> [10 1]

Discussion

juxt and the combination of vals and select-keys are both apt tools for retrieving multiple keys from a map. There are subtleties to their behavior that are important to understand, though.

At first glance, the juxt approach seems to be the clear winner of the two. However, it only goes so far: the approach falls apart when any of the keys you wish to retrieve is not a keyword (more specifically, not a function). This is because juxt merely juxtaposes the return values of multiple functions. Since keywords are functions, it’s possible to juxt them and retrieve a strongly ordered list of values.

If the keys in the beans map were strings, it would not be possible to retrieve their values with juxt:

((juxt "a" "b") beans)
;; -> ClassCastException java.lang.String cannot be cast to clojure.lang.IFn ...

select-keys, on the other hand, is capable of pulling values for any number of arbitrary keys. The select-keys function takes a map and a sequence of keys and returns a new map populated with only those keys:

(def weird-map {"a" 1, {:foo :bar} :baz, 13 31})

(select-keys weird-map
             ["a" {:foo :bar}])
;; -> {{:foo :bar} :baz, "a" 1}

(vals {{:foo :bar} :baz, "a" 1})
;; -> (1 :baz)

Since maps are not ordered, it is not safe to assume that the ordering of keys and values is identical (even if you stumble upon an example where it is). In cases where you’re pulling multiple values from nonkeyword maps, it is probably easiest to wrap that interaction up via juxt:

(def a-str-then-foo-bar-map
  (juxt #(get % "a")
        #(get % {:foo :bar})))

(a-str-then-foo-bar-map weird-map)
;; -> [1 :baz]

You’ll avoid weird maps now, won’t you?

2.18. Setting Keys in a Map

by Luke VanderHart

Problem

You want to “change” a map by adding, setting, or removing keys.

Solution

The most basic way to change a map is using the assoc function. Given a map and any number of additional key/value pairs as arguments, it will return an updated map containing the respective keys and values:

(def villain {:honorific "Dr." :name "Mayhem"})
(assoc villain :occupation "Mad Scientist" :status :at-large)
;; -> {:honorific "Dr.", :name "Mayhem",
;;     :occupation "Mad Scientist", :status :at-large}

If used on a map that already contains a key, the assoc function will return an updated map with the newly specified value for the key:

(def villain {:honorific "Dr.", :name "Mayhem",
              :occupation "Mad Scientist", :status :at-large})
(assoc villain :status :deceased)
;; -> {:honorific "Dr.", :name "Mayhem",
;;     :occupation "Mad Scientist", :status :deceased}

To remove keys, use the dissoc function. Given a map and any number of keys, it returns a map minus those keys:

(def villain {:honorific "Dr.", :name "Mayhem",
              :occupation "Mad Scientist", :status :deceased})
(dissoc villain :occupation :honorific)
;; -> {:name "Mayhem", :status :deceased}

Discussion

It’s fairly common to have maps contained in other maps. If it is necessary to update a deeply nested value, nested calls to assoc quickly become inconvenient, especially since they need to be “inside-out.” Consider the following data structure:

(def book {:title "Clojure Cookbook"
           :author {:name "Ryan Neufeld"
                    :residence {:country "USA"}}})

If Ryan were to move back to his native land of Canada, fully updating the map representing this book using only assoc would look something like the following:

(assoc book :author
  (assoc (:author book) :residence
    (assoc (:residence (:author book)) :country "Canada")))

Obviously, this is inconvenient and difficult to read.

The assoc-in function removes this inconvenience, allowing you to specify a key path instead of a sole key. Instead of changing a value one key deep, a key path lists a sequence of keys, applied recursively to change a deeply nested value:

(assoc-in book
          [:author :residence :country]
          "Canada")
;; -> {:author {:name "Ryan Neufeld"
;;              :residence {:country "Canada"}}
;;              :title "Clojure Cookbook"}

The preceding sample first looks up the map associated with the :residence key in the nested data structure, then associates "Canada" with the :country key. Finally, the entire data structure is returned.

What if you needed to update a value based on its previous value, instead of just changing it?

Fortunately, Clojure provides update-in expressly for this purpose. Instead of taking a new value, update-in takes an update function. This function is invoked with the value retrieved at key path and any trailing arguments passed to update-in. It’s a peculiar function to wrap your head around at first. Perhaps it is best to illustrate with an example:

(def website {:clojure-cookbook {:hits 1236}})

;; Register 101 new hits to the Cookbook website
(update-in website                   ; 
           [:clojure-cookbook :hits] ; 
           +                         ; 
           101)                      ; 
;; -> {:clojure-cookbook {:hits 1337}}

: The map
: The key path
: The update function, +
: Additional arguments to +

update-in will also actually create maps for any of the keys in the vector that don’t exist. This means it can be used to create structure as well as to update values:

(update-in {} [:author :residence] assoc :country "USA")
;; -> {:author {:residence {:country "USA"}}}

Even though the starting map is empty, two empty maps are created for the values of the :author and :residence keys, meaning the assoc will be applied to a new, empty map.

One other common use case for maps is as the values of one of Clojure’s state constructs: atoms, refs, or agents. Clojure maps themselves are immutable values. In a very literal sense, if you “add” a key to a map, it is no longer the same value any more. But sometimes, it is necessary to preserve a logical identity for different values across time. That’s when to use one of the state management tools.

To update the value of a piece of state (ref, atom, or agent), you invoke its specific state transition function (alter, swap!, or send, respectively). State transition functions share a common form: they take the reference as the first argument, the function to apply to the value as the second argument, and any arguments to the function as additional arguments.

So, for example, to deeply update an item contained in a map referenced by an atom, you can invoke the swap! function (the state transition function for atoms), passing it your atom and the update-in function, along with the list of keys and the function to use to update the value:

(def retail-data (atom {:customers [{:id 123 :name "Luke"}
                                    {:id 321 :name "Ryan"}]
                        :orders [{:sku "Q2M9" :customer 123 :qty 4}
                                 {:sku "43XP" :customer 321 :qty 1}]}))

(swap! retail-data update-in [:orders] conj
       {:sku "9QED" :customer 321 :qty 2})

This will add a new order map to the list of orders contained in the map contained in the retail-data atom.

Although such triple combos are not terribly common, they illustrate the general consistency of functions that take other functions and arguments, and how they can be combined arbitrarily deeply. In this case, what starts with a single call to swap! ends up also updating a map and conjoining to a vector in the same form.

2.19. Using Composite Values as Map Keys

by Luke VanderHart

Problem

You’d like to use a value that isn’t a simple primitive type as a lookup key in a map. For example:

You’d like to use geographic or Cartesian coordinates as map keys.
You’d like to associate values with functions.

Solution

Because of its robust identity semantics on composite values, Clojure fully supports using any immutable value as a map key. More importantly, doing so is reasonably efficient.

For example, consider the data structure to represent a chessboard, an 8 × 8 grid where each position can have one of six possible types of piece. Rows are represented by the numbers 1–8, and columns by the letters a–h.

In Clojure, you can represent this directly as a map:

(def chessboard {[:a 5] [:white :king]
                 [:a 4] [:white :pawn]
                 [:d 4] [:black :king]})

Moving a piece then requires two operations, dissoc-ing the old position for a piece and assoc-ing the new position:

(defn move
  "Given a map representing a chessboard, move the piece at src
  to dest"
  [board source dest]
  (-> board
      (dissoc source)
      (assoc dest (board source))))

(move chessboard [:a 5] [:a 4])
;; -> {[:d 4] [:black :king]
;;     [:a 4] [:white :king]}

As another example of nontraditional map keys, consider the situation where you have a set of functions, and you want to be able to assign them each a “weighting” and multiply the return value by the corresponding weight whenever the function is called.

An easy way to do this is to store the functions and weights in a map, with the functions as keys:

(def plus-two (partial + 2))
(def plus-three (partial + 3))
(def weight-map {plus-two 1.0
                 plus-three 0.8})

Then you can use a simple wrapper function to apply the functions with the weights applied:

(defn apply-weighted
  "Given a weight map, a function, and args, applies the function
  to the args, multiplying the result by the weighting for the
  function. If the weight map does not specify a weight for the
  function, a default of 1.0 is used."
  [weight-map f & args]
  (* (get weight-map f 1.0)
     (apply f args)))

(apply-weighted weight-map plus-two 2)
;; -> 4.0

(apply-weighted weight-map plus-three 1)
;; -> 3.2

Discussion

A more traditional way to model the chess game would be with a two-dimensional array, or, in Clojure’s case, with a vector of vectors.

This is certainly a reasonable thing to do, and is (possibly) slightly more performant. However, it is a less clean model of the actual problem domain. It requires a translation, for example, from chess’s row/column numbers and letters to zero-indexed indexes. Using a map lets you store the positions directly, in native chess terminology.

Similarly, there are alternative implementations for the function-weighting example. It could be implemented using a cond statement with all the functions and weights enumerated, or by replacing the functions altogether with a protocol method that could then have varying implementations with different weights.

However, storing the functions and weights in a map has the benefit of making it obvious at a glance what the weightings for particular functions are. More importantly, it is possible to store multiple different sets of weights, and switch between different weight schemes dynamically at runtime.

2.20. Treating Maps as Sequences (and Vice Versa)

by Luke VanderHart

Problem

You want to treat the contents of a map as a sequence of entries. Alternatively, you want to convert a sequence of entries back into a map.

Solution

To obtain a sequence view of a map, simply call seq on it. Note that most sequence-processing functions call seq on their arguments themselves, so it’s usually not necessary to do this explicitly:

(seq {:a 1, :b 2, :c 3, :d 4})
;; -> ([:a 1] [:c 3] [:b 2] [:d 4])

This creates a sequence of key/value pairs, which you can then process as you would any sequence.

To create a map from a sequence, you can exploit the fact that conj, when applied to a map, can take a two-element vector as a key/value pair and use it to add the respective key and value on to the map:

(def m {:a 1, :b 2})
(conj m [:c 3])
;; -> {:a 1, :b 2, :c 3}

Because the into function uses repeated applications of conj to add items from one sequence onto a collection, this means it can be used to transform a sequence of pairs into a single map:

(into {} [[:a 1] [:b 2] [:c 3]])
;; -> {:a 1, :b 2, :c 3}

It is also possible to construct a map from two sequences: one containing keys and one containing values. This is the purpose of the zipmap function. Given two sequences, it will return a single map with keys from the first argument sequence and values from the second:

(zipmap [:a :b :c] [1 2 3])
;; -> {:c 3, :b 2, :a 1}

If one of the sequences passed to zipmap is shorter than the other, the extra values will be ignored, and the output map will only contain entries up to the length of the shortest sequence.

Discussion

When obtaining a sequence view of a hash map, the map entries will be returned in an arbitrary or undefined order. Conveniently, this order (although arbitrary) is guaranteed to be consistent if the same map is turned into a sequence multiple times.

When using a sorted map, the entries will be returned according to their sort order in the map. For example:

(seq (hash-map :a 1, :b 2, :c 3, :d 4))
;; -> ([:a 1] [:c 3] [:b 2] [:d 4])

(seq (sorted-map :a 1, :b 2, :c 3, :d 4))
;; -> ([:a 1] [:b 2] [:c 3] [:d 4])

There is another interesting fact about the entry values in this sequence. They are printed as vectors, and they are vectors insofar as they implement the full vector interface. However, their concrete type is not actually clojure.lang.PersistentVector; rather, they are a different kind of vector called a map entry, which not only is a vector but also supports the clojure.lang.MapEntry interface.

The MapEntry interface provides key and val functions that can be used to retrieve the key and value of an entry:

(def entry (first {:a 1 :b 2}))

(class entry)
;; -> clojure.lang.MapEntry

(key entry)
;; -> :a

(val entry)
;; -> :1

These functions should be preferred to using the first and second functions on map entries when processing maps as sequences, since they preserve the semantic of key/value pairs, making the code easier to read.

2.21. Applying Functions to Maps

by Luke VanderHart

Problem

You’d like to apply a transformation function to the keys or the values of a map.

Solution

Use one of these simple general-purpose functions, modified to suit any needs you have:

(defn map-keys
  "Given a map and a function, returns the map resulting from applying
  the function to each key."
  [m f]
  (zipmap (map f (keys m)) (vals m)))

(map-keys {"a" 1 "b" 2} keyword)
;; -> {:b 2, :a 1}

(defn map-vals
  "Given a map and a function, returns the map resulting from applying
  the function to each value."
  [m f]
  (zipmap (keys m) (map f (vals m))))

(map-vals {:a 1, :b 1} inc)
;; -> {:b 2, :a 2}

(defn map-kv
  "Given a map and a function of two arguments, returns the map
  resulting from applying the function to each of its entries. The
  provided function must return a pair (a two-element sequence.)"
  [m f]
  (into {} (map (fn [[k v]] (f k v)) m)))

(map-kv {"a" 1 "b" 1} (fn [k v] [(keyword k) (inc v)]))
;; -> {:a 2, :b 2}

Discussion

map-keys and map-vals are extremely straightforward. They each start by breaking the map, m, down into a sequence of keys and a sequence of values using the keys and vals functions, which return a sequence of the keys or values of a map, respectively. Then, they use the map function to transform either the sequence of keys or the sequence of vals. Finally, the zipmap function is used to recombine the key and value sequences into a single map, with the updates in place.

map-kv works a bit differently. It starts by converting the map into a sequence of map entries, then uses map to apply them to an anonymous function that destructures the key and value, and then passes the key and value to the caller-provided function. Finally, it uses into to repeatedly conjoin the resulting pairs onto an empty map, returning a new map consisting of the transformed keys and values.

The following example is identical, but does not use destructuring, so the high-level structure is a bit more clear:

(defn map-kv
  [m f]
  (into {} (map (fn [entry]
                  (f (key entry) (val entry)))
                m)))

It is easy to see that these three functions are all riffs on the standard map function, applied to map data structures. What about the other staple of functional programming, reduce?

Clojure already has a reduce-kv function built in, which was added in version 1.4.

reduce-kv takes three arguments: a function, an initial value, and an associative collection. The provided function must also take three arguments. reduce-kv reduces the provided collection by first applying the function to the initial value, the first key, and its corresponding value from the map. The resulting value is then reapplied along with the second key and value, the resulting value with the third key and value, and so on.

The following example uses reduce-kv to obtain the sum of all the values in a map:

(reduce-kv (fn [agg _ val]
             (+ agg val))
           0
           {:a 1 :b 2 :c 3})
;; -> 6

Note that an underscore (_) is used instead of key in the function argument declaration. This is idiomatic in Clojure to name any argument that isn’t actually used in the body.

It’s also possible to define map-kv using reduce-kv:

(defn map-kv
    [m f]
    (reduce-kv (fn [agg k v] (conj agg (f k v))) {} m))

which could be used in this example:

(map-kv {:one 1 :two 2 :three 3}
        #(vector (-> %1 str (subs 1)) (inc %2)))
;; -> {"one" 2, "three" 4, "two" 3}

2.22. Keeping Multiple Values for a Key

by Luke VanderHart

Problem

Normally, maps are strictly one value per key: if you assoc an existing key, the old value is replaced. However, sometimes it would be useful to have a map-like interface (a “multimap”) capable of storing multiple values for the same key.

You would like to create a map-like data structure that implements a multimap-like interface in Clojure.

Solution

To introduce such a capability on top of normal maps, create and extend a protocol MultiAssociative that defines this behavior:

(defprotocol MultiAssociative
  "An associative structure that can contain multiple values for a key"
  (insert [m key value] "Insert a value into a MultiAssociative")
  (delete [m key value] "Remove a value from a MultiAssociative")
  (get-all [m key] "Returns a set of all values stored at key in a
                    MultiAssociative. Returns the empty set if there
                    are no values."))

(defn- value-set?
  "Helper predicate that returns true if the value is a set that
  represents multiple values in a MultiAssociative"
  [v]
  (and (set? v) (::multi-value (meta v))))

(defn value-set
  "Given any number of items as arguments, returns a set representing
  multiple values in a MultiAssociative. If there is only one item,
  simply returns the item."
  [& items]
  (if (= 1 (count items))
    (first items)
    (with-meta (set items) {::multi-value true})))

(extend-protocol MultiAssociative
  clojure.lang.Associative
  (insert [this key value]
    (let [v (get this key)]
      (assoc this key (cond
                       (nil? v) value
                       (value-set? v) (conj v value)
                       :else (value-set v value)))))
  (delete [this key value]
    (let [v (get this key)]
      (if (value-set? v)
        (assoc this key (apply value-set (disj v value)))
        (if (= v value)
          (dissoc this key)
          this))))
  (get-all [this key]
    (let [v (get this key)]
      (cond
       (value-set? v) v
       (nil? v) #{}
       :else #{v}))))

and, of course, corresponding unit tests (using clojure.test):

(require '[clojure.test :refer :all])

(deftest test-insert
  (testing "inserting to a new key"
    (is (= {:k :v} (insert {} :k :v))))
  (testing "inserting to an existing key (single existing item)"
    (let [m (insert {} :k :v1)]
      (is (= {:k #{:v1 :v2}}
             (insert m :k :v2)))))
  (testing "inserting to an existing key (multiple existing items)"
    (let [m (insert (insert {} :k :v1) :k :v2)]
      (is (= {:k #{:v1 :v2 :v3}}
             (insert m :k :v3))))))

(deftest test-delete
  (testing "deleting a non-present key"
    (is (= {:k :v} (delete {:k :v} :nosuch :nada))))
  (testing "deleting a non-present value"
    (is (= {:k :v} (delete {:k :v} :k :nada))))
  (testing "deleting a single value"
    (is (= {} (delete {:k :v} :k :v))))
  (testing "deleting one of two values"
    (let [m (insert (insert {} :k :v1) :k :v2)]
      (is (= {:k :v1} (delete m :k :v2)))))
  (testing "deleting one of several values"
    (let [m (insert (insert (insert {} :k :v1) :k :v2) :k :v3)]
      (is (= {:k #{:v1 :v2}} (delete m :k :v3))))))

(deftest test-get-all
  (testing "get a non-present key"
    (is (= #{} (get-all {} :nosuch))))
  (testing "get a single value"
    (is (= #{:v} (get-all {:k :v} :k))))
  (testing "get multiple values"
    (is (= #{:v1 :v2} (get-all (insert (insert {} :k :v1) :k :v2) :k)))))

(run-tests)
;; -> {:type :summary, :pass 11, :test 3, :error 0, :fail 0}

Discussion

First, this code defines a protocol to implement the set of functions that comprises the multimap behavior. A protocol is a great choice in this situation: it ties together several methods that perform related operations on the same object, and it allows for multiple concrete implementations.

In this case, there are three methods required to implement the desired functionality. Note that the protocol implementation does not override or reimplement any of the core map methods (assoc, dissoc, etc.). It is only the semantics of the new behavior that differ from those of regular maps. Clojure defines very strong semantics around core functions. Breaking or overriding these expectations is always a bad idea, especially when using a distinct set of functions makes it clear when multimap functionality is being used.

The concrete implementation of MultiAssociative extends the protocol to the clojure.lang.Associative interface. It would certainly be possible to implement it on something more targeted, such as IPersistentSet, but since it only requires something associative for the implementation, it’s best not to be too specific. Coding against clojure.lang.Associative also gives several additional capabilities for free. For example, there is now automatically a “multivector” that can store multiple values at each index (provided they are added using insert).

Reading the code, you’ll notice that a good deal of the actual logic is devoted to making sure that single values are stored plainly, while multiple values are wrapped in a set. This is maintained both when inserting and when deleting items, requiring the functions to run a check on what type the value is and wrap or unwrap accordingly. Similarly, get-all needs to wrap single values in a set before returning, since it specifies that it must return a set.

This is a design decision that has several benefits and trade-offs. The alternative would be to always wrap the values in a set, even single values. This would make the code a bit simpler and would eliminate most of the type checking as well as the wrapping and unwrapping of forms.

However, the simplicity would come with a price. If values (even single values) were always wrapped in a set, the map being used as a multimap would not be easily usable via the normal map functions. It would contain a lot of odd-looking single-item sets, and if anything were added to it using assoc, it would be incompatible with future uses of insert on that key.

In essence, the wrapping and unwrapping is to allow any map to be usable via both the standard Associative and the MultiAssociative interfaces, without requiring the user to keep track of which “kind” of a map it is. Values inserted using assoc can be read with get-all, and values inserted using insert can be removed with dissoc. All expectations regarding normal maps should hold. In the case of a normal get on a key with multiple values, a set containing multiple items will be returned. This is probably what the user would expect upon inspecting the data.

There is one more feature of this code that deserves commentary: the use of ::multi-value metadata on the sets used to store multiple values, applied and tested using the value-set and value-set? functions.

This is to handle the edge case where the intended value for a key is itself a set. The code needs a way to disambiguate between sets it creates in order to manage multiple keys for a value and sets that are simply values provided by users.

This is accomplished by placing metadata on sets created to contain values. A namespace-scoped keyword is used to ensure that it will not collide with any possible existing metadata on values provided by the user. Then, all the code has to do is check if a set has the ::multi-value metadata to know whether it’s a set containing values, or is itself a value.

2.23. Combining Maps

by Tom Hicks

Problem

You have two or more maps you wish to combine to produce a single map.

Solution

Use merge to combine two or more maps with no keys in common:

(def arizona-bird-counts {:cactus-wren 8})
(def florida-bird-counts {:gull 20 :pelican 14})

(merge florida-bird-counts arizona-bird-counts)
;; -> {:pelican 14, :cactus-wren 8, :gull 20}

Use merge-with when you want more explicit control of the merge strategy for keys that exist in more than one map:

(def florida-bird-counts    {:gull 20 :pelican 1 :egret 4})
(def california-bird-counts {:gull 12 :pelican 4 :jay 3})

;; Merge values with + to get their totals
(merge-with + california-bird-counts florida-bird-counts)
;; -> {:pelican 5, :egret 4, :gull 32, :jay 3}

Discussion

In both merge and merge-with, maps are combined from left to right, returning a new immutable map as a result. This functions much like a “left fold.” merge is the simpler function of the pair, always returning the last value it sees for every key.

When mappings for the same key exist in more than one map, the latter mapping is used in the result. This can be useful, for example, when you receive new totals throughout the day, but only for values that have changed:

;; Favorite ice cream flavor votes throughout the day
(def votes-am {:vanilla 3 :chocolate 5})
(def votes-pm {:vanilla 4 :neapoliton 2})
(merge votes-am votes-pm)
;; -> {:vanilla 4, :chocolate 5, :neapoliton 2}

merge-with facilitates powerful recipes for map combination by allowing you to control how values are merged. You can imagine merge-with as reduce for maps with common keys. The first argument to merge-with is a merge function that will be invoked for each pair of duplicated values.

With careful choice of map value types, merge-with provides some concise solutions to common problems. For example, by merging with clojure.set/intersection, you can find the intersection of “like” and “dislike” sets in a team of programmers:

(def Alice {:loves #{:clojure :lisp :scheme} :hates #{:fortran :c :c++}})
(def Bob   {:loves #{:clojure :scheme}       :hates #{:c :c++ :algol}})
(def Ted   {:loves #{:clojure :lisp :scheme} :hates #{:algol :basic :c
                                                      :c++ :fortran}})

(merge-with clojure.set/intersection Alice Bob Ted)
;; -> {:loves #{:scheme :clojure}, :hates #{:c :c++}}

It is also possible to merge nested maps by creating a recursive merge function:

(defn deep-merge
  [& maps]
  (apply merge-with deep-merge maps))

(deep-merge {:foo {:bar {:baz 1}}}
            {:foo {:bar {:qux 42}}})
;; -> {:foo {:bar {:qux 42, :baz 1}}}

As you saw in the previous examples, merge-with is a versatile tool: we used + to add values of the same key, clojure.set/intersection to find shared values of multiple sets, and a recursive function deep-merge to recursively merge nested maps. merge-with is a very powerful function, indeed.

2.24. Comparing and Sorting Values

by Luke VanderHart

Problem

You want to compare two values according to some comparison function, or you want to sort a collection by comparing all the items in it.

Solution

Use the clojure.core/compare function to compare two items. They must be comparable with respect to each other. For example, a double can be compared to a ratio because they’re both numbers, but a string can’t be compared to a vector.

compare returns a negative number if the first argument is less than the second, zero if it is logically equal, and a positive number if it is greater:

(compare 5 2)
;; -> 1

(compare 0.5 1)
;; -> -1

(compare (/ 1 4) 0.25)
;; -> 0

(compare "brewer" "aardvark")
;; -> 1

To sort an entire collection, pass it to the clojure.core/sort function. sort applies compare as needed and returns a sorted sequence.

For example, the following code breaks down a string into a sequence of characters (sort calls seq on its argument), then sorts them. The result is concatenated back to a string, for better readability:

(apply str (sort "The quick brown fox jumped over the lazy dog"))
;; -> "        Tabcddeeeefghhijklmnoooopqrrtuuvwxyz"

As seen previously, many of Clojure’s data types have a natural comparison order, which is what compare uses. For example, numbers, dates, and strings all sort as one would expect, from low to high, based on the well-understood and accepted inherent ordering between them.

If you want to sort a data type that does not have a natural ordering, or if you want to override the natural sort (such as sorting a set from high to low), you are not limited to using the built-in comparator function. sort allows you to specify a custom comparison function that can perform any operation you like to determine the relative ordering between two items. This function must take two arguments. It can return values like compare does (that is, a positive integer, a negative integer, or zero). Alternatively, it can return a Boolean value (i.e., a predicate function). The predicate function should return true if and only if the first argument should be sorted before the second argument.

This means that you can pass regular Clojure predicates to sort:

(sort > [1 4 3 2])
;; -> (4 3 2 1)

(sort > [1 4 3 2])
;; -> (1 2 3 4)

Or, you can write your own arbitrary comparator. For example, the custom comparator used in the next example cares only about the length of a string, not the contents of it; strings will be sorted as equal if they have the same number of characters, whatever those characters are:

(sort #(< (count %1) (count %2)) ["z" "yy" "zzz" "a" "bb" "ccc"])
;; -> ("z" "a" "yy" "bb" "zzz" "ccc")

Discussion

Under the hood, Clojure uses Java’s built-in sort mechanism. Java uses a slightly modified merge sort algorithm that is highly performant for the vast majority of cases. It requires n log(n) comparisons in the worst case and performs at near O(n) when the input is largely sorted already.

The sort is also stable, meaning that if two items are equal in terms of the comparator function being used, their relative ordering will remain unchanged after sorting.

Although you can use any predicate as a comparison function or write your own comparison function that returns a positive/negative/zero integer, the actual function must behave properly in order to work. Specifically, it must:

Have a consistent total order for all the members being sorted: If x is sorted before y, and y is sorted before z, then x must always be sorted before z. In other words, there must always be a single fully deterministic sort order for a given collection and comparator, without any contradictions or inconsistencies caused by the comparison function.
Be consistent with the .equals method and Clojure’s = function: If two items are logically equal, then that must be reflected in the comparison function. When using the integer return values, the function ought to return 0. When using a predicate function, (pred x y) and (pred y x) should return the same thing in the case where x and y are equal.
Have no side effects: The comparison function may be called an arbitrary number of times as the sort is evaluated.

Comparators and the JVM

Clojure fully participates in Java’s comparison and sorting mechanisms. All Clojure objects that have a natural order implement java.lang.Comparable and implement the compareTo method.

More importantly, every Clojure function actually implements the java.util.Comparator interface. This means that you can pass a Clojure function to any Java method that requires an instance of java.util.Comparator, and it will invoke the function with two arguments. This is what allows you to pass arbitrary Clojure functions as the comparator to sort. The function object itself is actually being used as a Java comparator, and invoking the Java .compare method on a Clojure function will actually call it, passing it the two values being compared as two arguments.

Because predicate functions (those returning a Boolean value) do not map exactly to the positive/negative/zero integer return values expected from a java.util.Comparator, Clojure itself handles the logical mapping between them. If a function used as a comparator (that is, (pred x y)) returns true, the implementation will return -1, indicating that x is less than y in the given sort. If not, it will invoke the function again with the arguments reversed. If (pred x y) and (pred y x) are both false, it is assumed that the objects are equal, and the implementation returns 0. Otherwise, it presumes x is greater than y and returns 1.

sort-by

Sometimes, you want to sort a collection not by the values themselves, but by some derivative function of the values. For example, say you have the following data, and you’d like to sort alphabetically by name. Unfortunately, maps don’t have a natural sort, so you’ll need to tell Clojure how to sort the data:

(def people [{:name "Luke"   :role :author}
             {:name "Ryan"   :role :author}
             {:name "John"   :role :reviewer}
             {:name "Travis" :role :reviewer}
             {:name "Tom"    :role :reviewer}
             {:name "Meghan" :role :editor}])

One option would be to use a custom comparator, which extracts the :name key and then invokes compare on it:

(sort #(compare (:name %1) (:name %2)) people)
;; -> ({:name "John", :role :reviewer}
;;     {:name "Luke", :role :author}
;;     {:name "Meghan", :role :editor}
;;     {:name "Ryan", :role :author}
;;     {:name "Tom", :role :reviewer}
;;     {:name "Travis", :role :reviewer})

However, there’s an easier way. The sort-by function works the same as sort, but takes an additional function keyfn as an argument to apply to the elements before sorting them. Instead of sorting on the elements themselves, it sorts the result of applying keyfn to the elements.

So, passing in :name as the keyfn (as discussed in Recipe 2.16, “Retrieving Values from a Map”, keywords are functions that look themselves up in a map), you can call:

(sort-by :name people)
;;->  ({:name "John", :role :reviewer}
;;     {:name "Luke", :role :author}
;;     {:name "Meghan", :role :editor}
;;     {:name "Ryan", :role :author}
;;     {:name "Tom", :role :reviewer}
;;     {:name "Travis", :role :reviewer})

Like sort, sort-by also takes an optional comparator function that it will use to compare the values extracted by the keyfn.

For another example, the following expression uses the str function as a keyfn to sort the numbers from 1 to 20 not on their numeric value, but lexographically as strings (meaning that “2” is greater than “10,” etc.). It also demonstrates using a custom comparator to specify the results in descending order:

;; Descending lexographic order
(sort-by str #(* -1 (compare %1 %2)) (range 1 20))
;; -> (9 8 7 6 5 4 3 2 19 18 17 16 15 14 13 12 11 10 1)

Natural sort of data structures

Some compositive data structures can also be compared if they implement Comparable, are of the same type, and contain comparable values. The comparison order is implementation dependent. For example, by default, vectors are compared first by their length, then by the result of applying compare to their first value, then to their second value if the first is equal, etc.:

(sort [[2 1] [1] [1 2] [1 1 1] [2]])
;; -> ([1] [2] [1 2] [2 1] [1 1 1])

Some data structures are not comparable. For example, the fact that a set is defined to be unordered means that a meaningful greater-than/less-than comparison is not possible in the general case, so no comparison is provided.

2.25. Removing Duplicate Elements from a Collection

by John Cromartie

Problem

You have a sequence of elements and you want to remove any duplicates, while possibly preserving the order of elements.

Solution

When the sequence of elements you’re working with is of a bounded, reasonable size, use set to coerce the collection into a hash set containing only distinct values:

(set [:a :a :g :a :b :g])
;; -> #{:a :b :g}

When the sequence is infinite, or you wish to maintain ordering, use distinct to return a lazy sequence of unique values in a collection in the order they appear:

(distinct [:a :a :g :a :b :g])
;; -> (:a :g :b)

Discussion

There are a number of trade-offs between these two approaches. For starters, set consumes the entire sequence to produce a new set collection. Because of this, set cannot be used to filter an infinite sequence. distinct, on the other hand, is designed for consuming lazy sequences. The value of distinct is a lazy view or projection over another sequence, yielding new values the first time they appear:

(defn rand-int-seq
  "Returns an infinite sequence of ints from [0, n)"
  [n]
  (repeatedly #(rand-int n)))

;; Taking the set of an infinite sequence will *never* return:
;; (set (rand-int-seq 10)) ; don't do it!

;; However, if you limit the seq, set will work
(set (take 10 (rand-int-seq 10)))
;; -> #{0 1 2 3 4 7 8 9}

;; distinct works no matter what
(take 10 (distinct (rand-int-seq 10)))
;; -> (8 3 4 6 0 5 9 7 1 2)

Since distinct produces new values as it sees them, it does maintain ordering. set, on the other hand, returns an unordered set.

If distinct is both ordered and lazy over sequences of any length, what is the advantage of set? Speed. Using distinct is by far the slowest option; simply calling set is about two times faster.

2.26. Determining if a Collection Holds One of Several Values

by John Touron

Problem

You have a collection and you want to determine if it holds one of several possible values.

Solution

Use some, along with a set:

(some #{1 2} (range 10))
;; -> 1

(some #{10} (range 10))
;; -> nil

Discussion

Since sets can act like functions, they can be used as predicates to test whether the argument is a member of the set. This idiom will test each item in a collection, returning either the first match or nil if a match couldn’t be found. However, a problem arises when nil or false is a member of the set you’re using to test a collection with. Consider the following:

(if (some #{nil} [1 2 nil 3])
  ::found
  ::not-found)
;; -> :user/not-found

(if (some #{false} [1 2 false 3])
  ::found
  ::not-found)
;; -> :user/not-found

Because the some function returns the value returned from the predicate function, not just true or false, using it with sets that contain nil or false probably isn’t what you want—it will return nil or false if the item actually is in the set. The simplest solution is to test for nil or false separately, using the nil? or false? predicate functions built into Clojure:

(if (some nil? [nil false])
  ::found
  ::not-found)
;; -> :user/found

(if (some false? [nil false])
  ::found
  ::not-found)
;; -> :user/found

Or, to test both at once:

(if (some #(or (false? %)
               (nil? %))
      [nil false])
  ::found
  ::not-found)
;; -> :user/found

2.27. Implementing Custom Data Structures: Red-Black Trees—Part I

by Leonardo Borges

Problem

You want to implement a data structure in Clojure with very specific performance characteristics.

For example, you need fast, efficient in-memory searches across a large, random, and ever-changing dataset.

Solution

After identifying that Clojure’s core data structures are not appropriate for your domain, your first step is to determine what data structure is appropriate.

For the purpose of this recipe, assume you are trying to choose and implement a data structure appropriate for fast in-memory search of a large, random, and ever-changing dataset. At first, a binary search tree (BST) seems like a good solution. A BST is most efficient over a sorted dataset, however. Adding and removing large amounts of data may unbalance a BST and degenerate its performance to that of a linked list.

Red-black trees (RBTs) are similar to BSTs, but are self-balancing. This would be an appropriate data structure for the dataset in question.

The next step is to implement the data structure itself. The implementation of RBTs relies on pattern matching. Use core.match to simplify the implementation of an RBT. Add [org.clojure/core.match "0.2.0"] to your project’s dependencies, or start a REPL with lein-try:

$ lein try org.clojure/core.match

First, implement the core of an RBT, the balance and insert-val functions. By using core.match, it is possible to succinctly express the required behaviors based on the shape of the tree:

(require '[clojure.core.match :refer [match]])

(defn balance
  "Ensures the given subtree stays balanced by rearranging black nodes
  that have at least one red child and one red grandchild"
  [tree]
  (match [tree]
         [(:or ;; Left child red with left red grandchild
               [:black [:red [:red a x b] y c] z d]
               ;; Left child red with right red grandchild
               [:black [:red a x [:red b y c]] z d]
               ;; Right child red with left red grandchild
               [:black a x [:red [:red b y c] z d]]
               ;; Right child red with right red grandchild
               [:black a x [:red b y [:red c z d]]])] [:red [:black a x b]
                                                            y
                                                            [:black c z d]]
               :else tree))

(defn insert-val
  "Inserts x in tree.
  Returns a node with x and no children if tree is nil.

  Returned tree is balanced. See also `balance`"
  [tree x]
  (let [ins (fn ins [tree]
              (match tree
                     nil [:red nil x nil]
                     [color a y b] (cond
                                    (< x y) (balance [color (ins a) y b])
                                    (> x y) (balance [color a y (ins b)])
                                    :else tree)))
        [_ a y b] (ins tree)]
    [:black a y b]))

With insertion and balance out of the way, the only remaining function to implement is a find-val function for testing if a value is present in an RBT. The easiest way to do this is by breaking down individual tree nodes with match and recursively scanning for the desired value:

(defn find-val
  "Finds value x in tree"
  [tree x]
  (match tree
         nil nil
         [_ a y b] (cond
                    (< x y) (recur a x)
                    (> x y) (recur b x)
                    :else x)))

With all of this in place, it is now possible to create and query an RBT:

(def rb-tree (reduce insert-val nil (range 4)))

rb-tree
;; -> [:black [:black nil 0 nil] 1 [:black nil 2 [:red nil 3 nil]]]

(find-val rb-tree 2)
;; -> 2

(find-val rb-tree 100)
;; -> nil

Discussion

For anyone who has ever had to implement a red-black tree—or at least attended a class in computer science where the algorithm was taught—the implementation of balance might seem extremely short. The reason for this is threefold:

Our red-black tree is persistent: operations on it, such as insert and balance, are not destructive.
balance and find-val use core.match to codify logic as patterns to match.
Nodes are represented as vectors.

The two latter points are related, as you’ll see shortly.

Conveniently enough, core.match allows us to match on the shape and values of a data structure and perform structural binding at the same time. For example, the following tries to match a-vector against two clauses:

(def a-vector [1 2 3])

(match a-vector
       [_ y] (str "Got y: " y)
       [_ _ z] (str "Got z: " z))
;; -> "Got z: 3"

The first clause matches a two-element vector, whereas the second matches a three-element vector. Given that a-vector has exactly three elements, it matches the second clause. In the expression that follows, named values (such as z) are bound to the positions they match.

This is why it was convenient to represent nodes as vectors—it makes pattern matching against them a breeze:

(def rb-node [:red nil 3 [:black nil 4 nil]])

(match rb-node
       [:red left value right]   (str "Red node with value: " value)
       [:black left value right] (str "Black node with value: " value))
;; -> "Red node with value: 3"

Assuming this new custom data structure meets your performance criteria, what is left? (You are benchmarking all of your custom data structures, right?) Unlike built-in data structures, this custom data structure doesn’t work with core functions such as map and filter.

In the second part of this recipe, Recipe 2.28, “Implementing Custom Data Structures: Red-Black Trees—Part II”, we’ll rectify this situation by participating in the core sequence abstraction.

2.28. Implementing Custom Data Structures: Red-Black Trees—Part II

by Ryan Neufeld; originally submitted by Leonardo Borges

Problem

You want to use Clojure’s core sequence functions (conj, map, filter, etc.) with your custom data structure.

Solution

In part one of this recipe (Recipe 2.27, “Implementing Custom Data Structures: Red-Black Trees—Part I”), you implemented all the functions necessary for creating an efficient red-black tree. What’s missing is participation in Clojure’s sequence abstraction.

The most important part of participating in sequence abstraction is the ability to expose values of a data structure sequentially. The built-in tree-seq is well suited for this task. One extra step is needed, however; tree-seq returns a sequence of nodes, not values.

Here’s the final rb-tree->seq function:

(defn- rb-tree->tree-seq
  "Return a seq of all nodes in an red-black tree."
  [rb-tree]
  (tree-seq sequential? (fn [[_ left _ right]]
                          (remove nil? [left right]))
            rb-tree))

(defn rb-tree->seq
  "Convert a red-black tree to a seq of its values."
  [rb-tree]
  (map (fn [[_ _ val _]] val) (rb-tree->tree-seq rb-tree)))

(rb-tree->seq (-> nil
                  (insert-val 5)
                  (insert-val 2)))
;; -> (5 2)

Since RBTs most closely resemble sets, they should adhere well to the IPersistentSet interface. Extend the IPersistentSet and IFn protocols to a new RedBlackTree type, implementing all of the necessary functions. It’s also wise to implement the multimethod print-method for RedBlackTree, as the default implementation will fail for RedBlackTree as implemented:

(deftype RedBlackTree [tree]
  clojure.lang.IPersistentSet
  (cons [self v] (RedBlackTree. (insert-val tree v)))
  (empty [self] (RedBlackTree. nil))
  (equiv [self o] (if (instance? RedBlackTree o)
                    (= tree (.tree o))
                    false))
  (seq [this] (if tree (rb-tree->seq tree)))
  (get [this n] (find-val tree n))
  (contains [this n] (boolean (get this n)))
  ;; (disjoin [this n] ...) ;; Omitted due to complexity
  clojure.lang.IFn
  (invoke [this n] (get this n))
  Object
  (toString [this] (pr-str this)))

(defmethod print-method RedBlackTree [o ^java.io.Writer w]
  (.write w (str "#rbt " (pr-str (.tree o)))))

Note

disjoin and the corresponding remove-val functions are left as exercises for the reader.

It is now possible to use a RedBlackTree instance like any other collection—in particular, instances act like sets:

(into (->RedBlackTree nil) (range 2))
;; -> #rbt [:black nil 0 [:red nil 1 nil]]

(def to-ten (into (->RedBlackTree nil) (range 10)))

(seq to-ten)
;; -> (3 1 0 2 5 4 7 6 8 9)

(get to-ten 9)
;; -> 9

(contains? to-ten 9)
;; -> true

(to-ten 9)
;; -> 9

(map inc to-ten)
;; -> (4 2 1 3 6 5 8 7 9 10)

Discussion

In the end, it doesn’t take a lot to participate in the sequence abstraction. By implementing a small handful of interface functions, the red-black tree implementation from Recipe 2.27, “Implementing Custom Data Structures: Red-Black Trees—Part I”, can participate in an array of sequence-oriented functions: map, filter, reduce, you name it.

At its essence, clojure.lang.IPersistentSet is an abstraction of what it means to represent a mathematical set structure; this matches a tree data structure well. A set isn’t a list or sequence, though. So how is RedBlackTree then said to be participating in the sequence abstraction?

In Clojure, types extending the clojure.lang.ISeq interface are true sequences, represented as a logical list of head and tail. While IPersistentSet does not inherit from ISeq, it does share a common ancestry with it. Both interfaces extend clojure.lang.IPersistentCollection and its parent clojure.lang.Seqable. As luck would have it,^[5] sequence functions rely on collections being Seqable, not ISeq. Since RedBlackTree can be read as a sequence, it is Seqable and can be operated on by all of the sequence functions you know and love.

Most of the functions in the IPersistentSet interface are self-explanatory, but some deserve further explanation. The function cons is a historical name for constructing a new list by appending a value to an existing list. seq is intended to produce a sequence from a collection, or nil if empty:

IPersistentSet.java:

public interface IPersistentSet extends IPersistentCollection, Counted {
  public IPersistentSet disjoin(Object key);
  public boolean contains(Object key);
  public Object get(Object key);
}

IPersistentCollection.java:

public interface IPersistentCollection extends Seqable {
  int count();
  IPersistentCollection cons(Object o);
  IPersistentCollection empty();
  boolean equiv(Object o);
}

Seqable.java:

public interface Seqable {
  ISeq seq();
}

The most challenging part of any Seqable implementation is actually making a sequence out of the underlying data structure. This would be particularly challenging if you needed to write your own lazy tree-traversal algorithms, but luckily Clojure has a built-in function, tree-seq, that does precisely this. By leveraging tree-seq to produce a sequence of nodes, it is trivial to write an rb-tree->seq conversion function that lazily traverses a RedBlackTree, yielding node values as it goes.

tree-seq accepts three arguments:

branch?: A conditional that returns true if a node is a branch (not a leaf node). For RedBlackTree, sequential? is an adequate check, as every node is a vector.
children: A function that returns all of the children for a given node.
root: The node to begin traversal on.

Note

tree-seq performs a depth-first traversal of trees. Given how red-black trees are represented, this will not be an ordered traversal.

With a sequence conversion function in hand, it is easy enough to write the seq function. Similarly, cons and empty are a breeze—simply utilize the existing tree functions. Equality testing can be a bit more difficult, however.

For the sake of simplicity, we chose to implement equality (equiv) between only RedBlackTree instances. Further, the implementation compares a sorted sequence of their elements. In this case, equiv is answering the question, “Do these trees have the same values?” and not the question, “Are these the same trees?” It’s an important distinction, one you’ll need to consider carefully when implementing your own data structures.

As discussed in Recipe 2.26, “Determining if a Collection Holds One of Several Values”, one of the big bonuses of sets is their ability to be invoked just like any other function. It’s easy enough to provide this ability to RedBlackTrees too. By implementing the single-arity invoke function of the clojure.lang.IFn interface, RedBlackTrees can be invoked like any other function (or set, for that matter):

(some (rbt [2 3 5 7]) [6])
;; -> nil

((rbt (range 10)) 3)
;; -> 3

Even with the full IPersistentSet interface implemented, there are still a number of conveniences RedBlackTree is lacking. For one, you need to use the kludgy /→RedBlackTree or RedBlackTree. functions to create a new RedBlackTree and add values to it manually. By convention, many built-in collections provide convenience functions for populating them (aside from literal tags like [] or {}, of course).

It’s easy enough to mirror vec and vector for RedBlackTrees:

(defn rbt
 "Create a new RedBlackTree with the contents of coll."
 [coll]
 (into (->RedBlackTree nil) coll))

(defn red-black-tree
  "Creates a new RedBlackTree containing the args."
  [& args]
  (rbt args))

(rbt (range 3))
;; -> #rbt [:black [:black nil 0 nil] 1 [:black nil 2 nil]]

(red-black-tree 7 42)
;; -> #rbt [:black nil 7 [:red nil 42 nil]]

You may also have noticed printing is not a concern of the sequence abstraction, although it is certainly an important consideration to make for developing developer- and machine-friendly data structures. There are two types of printing in Clojure: toString and pr-based printing. The toString function is intended for printing human-readable values at the REPL, while the pr family of functions are meant (more or less) to be readable by the Clojure reader.

To provide our own readable representation of RBT, we must implement print-method (the heart of pr) for the RedBlackTree type. By writing in a “tagged literal” format (e.g., #rbt), it is possible to configure the reader to ingest and hydrate written values as first-class objects:

(require '[clojure.edn :as edn])

;; Recall ...
(defmethod print-method RedBlackTree [o ^java.io.Writer w]
  (.write w (str "#rbt " (pr-str (.tree o)))))

(def rbt-string (pr-str (rbt [1 4 2])))
rbt-string
;; -> "#rbt [:black [:black nil 1 nil] 2 [:black nil 4 nil]]"

(edn/read-string rbt-string)
;; -> RuntimeException No reader function for tag rbt ...

(edn/read-string {:readers {'rbt ->RedBlackTree}}
                 rbt-string)
;; -> #rbt [:black [:black nil 1 nil] 2 [:black nil 4 nil]]

Table of Contents for 2. Composite Data

Create new playlist

Sign In

Sign Up

Chapter 2. Composite Data

2.0. Introduction

Immutability

Persistence

The Sequence Abstraction

2.1. Creating a List

Problem

Solution

Discussion

See Also

2.2. Creating a List from an Existing Data Structure

Problem

Solution

Discussion

See Also

2.3. “Adding” an Item to a List

Problem

Solution

Discussion

See Also

2.4. “Removing” an Item from a List

Problem

Solution

Discussion

See Also

2.5. Testing for a List

Problem

Solution

Discussion

See Also

2.6. Creating a Vector

Problem

Solution

Discussion

See Also

2.7. “Adding” an Item to a Vector

Problem

Solution

Discussion

See Also

2.8. “Removing” an Item from a Vector

Problem

Solution

Discussion

See Also

2.9. Getting the Value at an Index

Problem

Solution

Using nth

Using vectors as functions of their indexes

Using get

Discussion

See Also

2.10. Setting the Value at an Index

Problem

Solution

Discussion

See Also

2.11. Creating a Set

Problem

Solution

Set construction performance

Discussion

Warning

See Also

2.12. Adding and Removing Items from Sets

Problem

Solution

Discussion

See Also

2.13. Testing Set Membership

Problem

Solution

Discussion

See Also

2.14. Using Set Operations

Table of Contents for
2. Composite Data