Chapter 7
Performance

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

—Donald Knuth

In this chapter you will implement tools and techniques to make code execute faster. How fast can a dynamic, lazy, functional, immutable language be, you ask. Pretty fast. You can expect Clojure code to take around 2–5 times the amount of CPU of equivalent Java code. With a little work you can tweak it to be as fast as Java code. In turn, you can expect Java code to perform at around 1.2 times C. This is amazing considering the high-level abstractions you get to leverage from Clojure. Compared to dynamic languages like Ruby, Clojure is blazingly fast.

So, how does Clojure achieve such great performance? The answer is the JVM. Clojure embraces its host platform and relies on its mature byte code optimization. The same holds true for ClojureScript (relies on the Closure Compiler) and ClojureCLR (embraces the .NET runtime). There is an intentional tradeoff made to embrace the underlying platform and not completely abstract it away. The benefit is performance and interoperability. The downside is you have to live with some unsightly details of the underlying platform leaking into your code.

Clojure has several advantages for producing high performance code:

  • Clojure is data oriented. Choosing the right datastructure is the prerequisite for writing fast code. Clojure encourages usage of O(1) and O(log n) data structures by providing the map, the set, and the vector literal notation.
  • Clojure is a concise language, with powerful functional abstractions. Concise expressions aid algorithmic clarity. Being able to express algorithms succinctly is a great advantage in writing correct and efficient algorithms.
  • Interop allows you to delegate heavy lifting tasks to Java libraries. When you really need every drop of CPU for a task, chances are there is a library for that. If not, you always have the option to create and use your own Java classes for critical bottlenecks in your system.
  • Clojure exposes a range of options for trading off between high-level dynamic code and annotated bare metal code.

There are also several aspects of Clojure that commonly produce poor performance. Memory allocation is the biggest aspect to be alert to. Reflection in tight loops commonly causes unexpected slowness, and Clojure is often cited as slow due to its terrible startup time, which is a constant and visible tax. There is no way around it, Clojure is just not good at that. If you look past startup time though, you will find it performs great once it is up and running.

High-level abstractions have a cost. But being able to reason with high-level abstractions is very powerful. If you are able to identify bottlenecks, you can split your code into elegant and performant code, and get large speed gains by tuning the performant code. You can always code with algorithmic complexity in mind. There is no point looking up items from a sequence when you can use a map instead. But how do you approach the performance of your code beyond that?

  • Think—Thinking is by far the most important part. Define your performance goals as an implicit functional requirement for systems you build. Be very clear about what level of performance is acceptable, at what point performance becomes irrelevant, and what are reasonable system requirements to achieve the desired performance.
  • Measure
    • Time your suspicious functions, then benchmark, and profile.
    • Check that appropriate data structures and algorithms are used. Use transients where data structures are built. Look for reduce calls, especially for opportunities to use transients.
    • Check your concurrency, especially if the bottleneck involves external resources. Clojure has concise concurrency abstractions; take advantage of them.
    • Memoize judiciously.
    • Use type hint. For interop code that is called very frequently, reflection costs quickly add up and can become significant.
    • Measure memory allocation and garbage collection. Check your use of laziness. This can be a source of memory consumption and garbage collection.
  • Monitor—Monitoring is the only way to know that you are allocating system resources effectively. You are unable to reach all the way down to first principles of computation to fully reason about your system because the computation stack is so complex. It is inevitable that your system assumptions will break, so you need to monitor for it and learn from it.

WHAT IS PERFORMANCE?

A basic view of performance is how quickly your program calculates some result. A more sophisticated definition of performance is the ratio of resources your system consumes, relative to how much value it produces. The practical definition of performance might be: “The system is down, CPU usage is at 100%, I need to fix this right now!” It is important to think of performance in a holistic fashion, because all of these definitions hold some truth, and yet in and of themselves are too myopic.

Any system is built making a trade-off between how much time was invested in creating the system, how well it provides value, and the costs of running and maintaining it. Be aware of where the curve pivots, and at what distance from that pivot your activities are wasteful in either direction.

Focusing too deeply on any one aspect of performance can be wasteful and even counter-productive. It is no use spending hours annotating code with types to avoid reflection, when using a different data structure would have given a far better outcome. Even worse, micro-optimizing the wrong thing can be a barrier to refactoring to a better solution.

Examining performance teaches you how your system actually works. Small changes can have huge impacts. Software typically changes often, so it is unlikely that every change will be examined for performance, unless you have measurement and automated tests that produce metrics.

Achieving high performance requires experimenting with code, with libraries, and with analysis tools. Experiments are only effective when measurement is involved. Keep a record of your results as you try new approaches.

This chapter is all about experimenting and measuring. In order to provide some examples of the techniques, we will be measuring the performance of the example project tracking solution used throughout this book. In order to provide self-contained examples, the focus will be on a feature for running Monte Carlo project simulations. The idea is that you will take a project and iterate to a new state in the project with some simple rules. This will give you one possible path that the project could take. If you run many such simulations, you may be able to draw some insights into important variables to control.

CHOOSING THE RIGHT DATA STRUCTURE IS A PREREQUISITE FOR PERFORMANCE

Vectors have O(1) lookup by index. Sets and hashmaps have O(logn) lookup by key. Sequences by contrast are O(n). Choosing the right data structure usually boils down to not doing combinatorial sequential lookups O(n∧2) or higher. Exponential complexity is the primary performance assassin for algorithms.

Checking for the existence of an element in a sequence is usually a built-in feature of a language. In Java you have indexOf. But Clojure does not have an equivalent for sequences. This is a case of Clojure purposefully discouraging a common bad choice. When you need to check membership of an element in a sequence, you should not be using a sequence at all. A set is the right datastructure for testing membership.

Consider two possible data models you can design for the project simulation code. The first represents the team as a list of names, and the stories as a list of stories containing details of the story and an id. This sort of data is very much what you might expect as the result of an SQL query.

(def project1-bad
  {:title "Build a task tracking solution"
   :team '("Jeremy" "Justin" "Michael" "Nick" "Timothy")
   :stories '({:id "aaa"
                  :title "Build a ClojureScript front end"
                  :status "Ready"}
               {:id "bbb"
                :title "Build backend services"
                :status "Done"}
               {:id "ccc"
                :title "Tune performance"
                :status "In Progress"})})

As soon as the number of stories starts to grow, any use of the list representation will suffer dramatically. It is certain that even if the list representation shows no problems, you will run into a scenario where it just breaks down completely for a larger dataset.

A better choice is to use a set to store the team, and make the stories a map of story id to story details. This sort of data is more like how a database might internally be structured in the sense that using the unique identifier as a key in a map is similar to indexing the stories.

(def project1-better
  {:title "Build a task tracking solution"
   :team #{"Jeremy" "Justin" "Michael" "Nick" "Timothy"}
   :stories {"aaa" {:title "Build a ClojureScript front end"
                    :status "Ready"}
             "bbb" {:title "Build backend services"
                    :status "Done"}
             "ccc" {:title "Tune performance"
                    :status "In Progress"}}})

Is this really better? At such a small scale, the only difference is semantic. Using a set is explicitly stating that only unique names are allowed in our data model. Similarly ids must be unique, because they are keys in a map. This alone is reason enough to choose the right data structure. It communicates intent.

Keeping a watchful eye for the right data structure is very different from trying to optimize your code statements for best branch first. The latter is highly situational and impossible to do without benchmarking. The former is correct thinking that leads to efficient algorithms. Choosing the right data structure is never a premature optimization. Thinking about the correct data representation helps you think about the problem domain, the data you are storing, and how that data will be consumed. It is always worth spending the time upfront to choose a data model that best represents sensible access semantics. Do not defer thinking about the complexity implications of data representation on some notion of it being an optimization. Choosing the right data structure is the prerequisite for correct and performant code.

BENCHMARKING

You often know roughly what part of your system is not performing at an acceptable level. You might have a clue such as "When the user adds a story with a very long description, the interface hangs." Begin by thinking about why this may be the case, how this part of the system is bottlenecking, and what you can reasonably hope to achieve. We have a plan of attack and some hypotheses in hand. So now it is time to measure, experiment, and try some techniques to improve the situation.

When constructing experiments you should keep in mind that granular measurements are more practical than fine measurements. This is a truth that is easy to lose sight of as you delve into the complexities of computer behavior. When you move down to measuring and tuning operations instead of algorithms, you enter into a battle to optimize a complex stack of layered yet intertwined abstractions.

These algorithms are implemented in Clojure, which in turn relies on Java classes to provide fundamental abstractions, which in turn relies on the Java compiler to create efficient byte code. The JVM goes to great length to execute the code efficiently. The JVM inlines function calls, does hotspot analysis, removes redundant loads, does copy propagation, and eliminates dead code. The byte code is an abstraction around CPU instructions and memory. On the CPU itself further optimization is occurring such as caching and common branch detection. This tall stack of abstractions allows you to be algorithmically expressive, but along the way you sacrifice optimal use of computational resources. There is a tradeoff between precise control over exactly what the hardware does, and the ability to compose abstract concepts to express useful calculations.

Optimizations in one layer may directly counteract optimizations in another layer. So, you are affected by both the complexity of a tall stack of abstractions, and also how those layers interact with each other. Because of these complex interactions, performance is almost never the sum of its parts. Micro-benchmarking operations in isolation will not accurately match the performance of the combination of those operations.

Micro-benchmarks don't compose, so don't micro-benchmark everything. More detail is worse, but making a very small change can have a large impact over a wide scope. So focus on larger scopes, and the most heavily used functions.

Do use benchmarks to reason about performance. Don't randomly try things to improve performance, but instead make educated guesses about how the code can be made to run faster. Just be cautious about any assumptions you make when it comes to overall performance in the context of the code you will be executing in this chapter.

Timing Slow Things

Measuring slow things sounds a little boring, but most of the big performance wins covered in this chapter are going to come from measuring slow things; finding unexpectedly large chunks of sunk time. Slower is better when it comes to finding a scope of code to time. Try to organize your performance experiments in a way that decomposes the time in halves. Binary search is an excellent way to narrow in on the important aspects of your program.

Often you will be surprised that decomposition is nowhere near the midpoint. This surprise is a good thing, since it means that one of the assumptions is wrong. Let's face it, assumptions about performance are very often wrong. The good news is that finding false assumptions presents an opportunity to better understand your system, and will likely reveal a key insight to improving it.

Profiling tools allow you to avoid having to invest heavily in up front binary time-split style timings. Profiling automatically percolates up the scopes of code that you should investigate to discover bottlenecks.

Performance tuning requires a lot of experimentation. So using a REPL to time and adjust functions is convenient. Start a REPL and follow along with the examples in this book, and take some time to stray from the examples as well. Let's time the top level entry point.

(time (sim-project project1))

You get the final stimulation state returned back to you. Above, the result you can see is a message printed "Elapsed time: 7299.223 msecs". The time macro is great because you can wrap any part of your code in (time ...) to see how long it is taking. You can quickly try wrapping a few suspicious expressions and discover a bottleneck that can be improved. Perhaps you suspect that a lot of time is being spent in standup, so you can check that without changing the behavior of the code.

(defn standup [project]
  (time
    (-> project
      (complete-stories)
      (start-stories))))

Now you can see dozens of output messages like "Elapsed time: 70.007 msecs." This certainly seems like a good place to investigate. Ad hoc timing is often easier than going to the trouble of profiling execution, but is less direct. Often you can quickly spot something obvious and get some quick wins, so it is always a good place to start. Be aware that this approach has limitations and switch to another approach when ad hoc timing is no longer yielding traction.

Depending on your environment and goals, relying on printed output can be distracting. Sometimes you want the actual time taken as a value. Let's adapt the time macro to return the number of milliseconds as a value, so that you can make some assertions about your performance goals.

(defmacro time-msecs
  "Returns the time in msecs taken to evaluate expr."
  [expr]
  "(let [start# (. System (nanoTime))]
     ∼expr
     (/ (double (- (. System (nanoTime)) start#)) 1000000.0)))

Now you can write a test to describe your goal.

(deftest slow-tests
  (is (< (b/time-msecs (sim/sim-project sim/project1))
         1000)
      "A full project simulation should take less than one second"))

This approach works great for measuring calculations longer than 10 milliseconds, but breaks down for shorter periods.

Benchmarking external resource access is less susceptible to JVM variance. It doesn't make any sense to worry about nanosecond accuracy when timing web service request round trips. Monitoring is often done at the service level, which is usually perfect. You should rely on monitoring instead of benchmarking for external resources as much as possible.

Use Criterium for Timing Fast Things

Benchmarking code is usually not as simple as calling (time (some-fn ...)). The time of execution can vary wildly, depending on how and when it is invoked due to advanced techniques such as adaptive optimization and dynamic recompilation employed by the JVM. This becomes a big problem with trying to time short calculations that take less than 10 milliseconds.

Consider this puzzling case.

(time (inc 1))
"Elapsed time: 0.028 msecs"
(time (dotimes [i 1000] (inc 1)))
"Elapsed time: 0.23 msecs"

Doing 1000 times more calculations takes only 10 times as long. Clearly there is more going on than meets the eye.

A naïve approach to writing benchmark tests might be something like start a JVM, run benchmark1, benchmark2, benchmark3, and shutdown. After benchmark1, the JVM is optimized for benchmark1, but wrongly optimized for benchmark2. The name for this order dependent performance is path dependency. Changing the ordering changes the benchmark results. You need a fresh JVM for each benchmark to isolate them.

The Criterium library (https://github.com/hugoduncan/criterium) provides a way to work around most of those issues. It includes a warm-up period, during which the Just-In-Time (JIT) compiler can work its magic. The library also mitigates the effects of garbage-collection (GC) prior to measuring the speed. Finally, it computes not just the mean time, but also upper and lower quantiles, and standard deviation. You get a wealth of insightful statistics about your benchmarks to better understand the performance profile of the code.

Evaluation count : 18900 in 60 samples of 315 calls.
             Execution time mean : 3,574752 ms
    Execution time std-deviation : 1,127694 ms
   Execution time lower quantile : 2,304367 ms ( 2,5%)
   Execution time upper quantile : 5,652621 ms (97,5%)
                   Overhead used : 3,443486 ns
Found 4 outliers in 60 samples (6,6667 %)
     low-severe     2 (3,3333 %)
     low-mild      2 (3,3333 %)
Variance from outliers : 96,4221 % Variance is severely inflated by outliers

Where and how should benchmarks be run? The easy answer is from the REPL. This has the same advantages and drawbacks as using the time macro for ad hoc timing. Keep in mind that Criterium is aimed at CPU/memory bound tasks and is not appropriate for slow functions such as IO bound operations. Trying to get 60 samples of a function that takes 10 second is going to take 10 minutes. That's not very interactive, nor is the additional accuracy valuable in such a scenario.

Consider using Criterium inside tests, because it is good to assert performance to avoid releasing bugs. If you gather this data regularly you can plot performance per source control commit, to see how your system is being affected by changes. You can make assertions using quick-benchmark, which returns the benchmarks instead of printing a report.

To gather all of the results into a form convenient for charting, you can use a fixture.

(defn add-results! [k results]
  (swap! benchmarks assoc-in [(:time git-commit-info) k] results))
(defn save-gathered-benchmarks [f]
  (f)
  (with-open [w (io/writer (doto (io/file "benchmarks" benchmark-filename)
                             (io/make-parents)))]
    (clojure.pprint/pprint @benchmarks w)))

Then, create a macro to define a benchmark to be recorded.

(defmacro bench-press [title acceptable-time & form]
  '(deftest ∼(symbol (kebab/->kebab-case title))
     (testing ∼title
       (let [results# (assoc (cr/quick-benchmark ∼@form nil)
                        :acceptable-time ∼acceptable-time)]
         (add-results! ∼title results#)
         (is (< (first (:mean results#)) ∼acceptable-time))))))

Now you can very concisely express benchmarks that will be run in continuous integration builds.

(use-fixtures :once b/save-gathered-benchmarks)
(b/bench-press "Remove 1 done story" (* 1000 b/nsecs)
  (remove-done-stories project1))

The fixture produces a file containing the full performance results relative to source control commits, suitable for charting. With this information on hand, finding out what change introduced a performance bottleneck is easy.

{#inst "2016-02-01T08:37:59.000-00:00"
 {:git-commit-info
  {:id "d2cbd0ded4b16300d76231b6d553edbfc3fb1668",
   :message "namespacing and adding some example improvements" ...}
  "Bug?"
  {:mean
   [3.4741639538209525E-6
    (3.3825290092479442E-6 3.5348896790521507E-6)],
   :final-gc-time 95103961,
   :execution-count 38459,
   :variance
   [1.2185041088601622E-14
    (3.4091764724026322E-15 2.3422252112283815E-14)] ...}}}}

Criterium is the de facto standard library for benchmarking CPU/memory bound calculation. It addresses many of the challenges to accurately measuring the time taken for code to execute, and how variable the time span is over many runs. Use it to measure fast things.

Use Test Selectors for Performance Tests

Tests and namespaces should be labeled to limit benchmark execution to only occur in continuous integration test runs, or when you explicitly request them in development. Prevent benchmarks from running in development test by default; they are too slow for interactive development. To label a namespace or test, attach a metadata flag.

(deftest:perf slow-tests ...)

Attaching a metadata flag to the namespace is equivalent to labeling all of the tests in the namespace.

(ns:perf mush.simulation-test ...)

You can configure test selectors in the project.clj file. Let's set up a selector for the :perf label, and make the default be everything except :perf tests. The default occurs when no selector is specified.

:test-selectors {:perf :perf
                 :default (complement :perf)}

When you execute lein test :perf, only the tests labeled with :perf are run. If you execute lein test, only the tests not labeled with :perf are run. There is a special selector :all, which you don't need to specify in the configuration. If you execute the lein test :all, then all of the tests are run regardless of labels. Test selectors work with lein-test-refresh as well. With this configuration in place, you can have your CI server execute lein test :all to include your performance tests, but not be bothered by them during interactive development.

PARALLELISM

Say that you identified that calculating completion-rate is a major bottleneck, which in turn is spending the most wall clock time in member-info.

(defn member-info [member]
  (let [history (fetch-history member)
        profile (fetch-profile member)]
    (merge history profile)))

Here you see how two external services are being queried for data. These requests will be made sequentially; the history will be fully fetched before the profile is requested. You can instantly improve this by starting both requests immediately, and only returning the combined result when both are available. One way to express this is making both requests inside a future, and deref them for the final result.

(defn member-info-better [member]
  (let [history (future (fetch-history member))
        profile (future (fetch-profile member))]
    (merge @history @profile)))

Now instead of waiting 21 milliseconds to get all of the data, it arrives in 11 milliseconds! Clojure has very concise and expressive abstractions for dealing with concurrency. See pmap, pcalls, future, and core.async.

MEMOIZATION

Caching is the wildcard of performance optimization. If your program is slow, just remember the answers instead of recalculating them! Caching is very effective at improving performance.

Clojure's memoize function takes a function and returns a function that does exactly the same thing as the original function, but remembers the results per input arguments it has previously calculated. Let's see it in action by defining a function f that prints a message and returns the increment of its argument.

(defn f [x]
  (println "Calculating...")
  (inc x))

Now define g to be the memoization of function f, and call the function g with the same input twice.

(def g (memoize f))
(g 1)
(g 1)

Notice that the message "Calculating…" is only displayed once. The second time that you call g it does not execute the body of the function, because g was able to look up the answer in its map of already calculated results. When the calculations occurring in the body of the function are intensive, looking up the answer is far faster than re-calculating it.

Returning to completion-rate you will observe that this function is called many times per step, but the return value never changes for the entire simulation. Using memoization provides a massive improvement.

(def completion-rate-better (memoize completion-rate))

Memoization enables you to write expressively without concern for the cost of repeatedly calling a function. With some care you can organize code according to intensive, but useful calculations, or slow IO requests that front load work and remember the answers.

Beware! Caching can subtly change the logic of your system. Imagine if completion-rate was affected every step. Clearly if that is the case you should then only memoize it per step. This is easy to accomplish in this case since you can see that you only need to call the function once.

 (defn complete-stories-alternative [project]
  (let [rate (completion-rate project)]
    (reduce (fn complete-story [acc story]
              (update-story-status acc story "Done"))
            project
            (take (rand-int (int (* rate 2)))
                  (stories-by-status project "In Progress")))))

But say, instead you were relying on data fetched from the member-info function in several paths of your simulation. Ensuring that member-info is only queried once per step would be much more challenging. In this case you can avoid messing with the code by using a new memoization of the function for every step. Thus, your code can remain free of the concern of resource access cost. One way to express this is to wrap the sim-week step with a redef of member-info as a memoization. Redundant calls are eliminated.

(defn sim-week-alternative [project]
  (with-redefs [member-info (memoize member-info)]
    (sim-week project)))

The important thing is to have a good grasp on what is both correct and fast. There are many other ways to do small scope memoization, such as passing a memoized function to other functions to use, so be alert for opportunities.

Beware! Memoizing everything is going to eat up more memory than you have available, resulting in an Out of Memory failure. The domain of inputs must be bounded to some reasonable size.

The core.memoize library provides time-to-live and least-recently-used memoization, which are well proven strategies for trading off time and memory. Most long-lived service applications will have opportunities to cache, but need to carefully control the extent to which the cache can grow. Think about how the cache should be controlled when you use memoization. As a rule of thumb, most long-lived memoized functions should be created with core.memoize.

INLINING

The shape of your code can have unexpected performance implications. The Just-In-Time compiler (JIT) will try to inline frequently called functions to avoid the overhead of invocation. The heuristic it uses depends upon how often the function is called and how big it is. As well as removing the function call, inlining enables other optimizations by reorganizing execution.

Extracting functions can lead to better performance. Functions that are too big can't be inlined without bloating the call sites. Thus, smaller functions are often faster because it helps the JVM find the most efficient inlining.

This may seem counter intuitive at first. If function calls have a cost, many smaller functions should cost more, right? Thinking this way is reasonable, but wrong. It highlights the dangers of making assumptions about how written code actually executes at the nanosecond scale. The only real way to know is to benchmark it.

Check for deeply nested function calls in hot code paths because there are depth limits to where the JIT can apply inlining. For example, Ring handlers often have many middleware wrappers, which may be preventing optimization that would otherwise occur.

To go deep into the impact of code shape on the byte code produced, you can use the nodisassemble library to discover what byte code is being produced from your source code.

When you discover a hot bottleneck, it is worth experimenting with rearranging your code scopes, because it will affect the ability of the JVM to optimize your calculations. Always do this guided by a benchmark, and avoid assuming that a very reasonable rule of thumb actually results in the performance impact you expect.

Persistent Data Structures

Amazingly, the cost of immutability is low. So low that immutability is a good default. The low cost of updates to Clojure's collections is achieved through an implementation of ideal hash trees. Clojure data structures use shared structure to allow efficient creation of new values. If you start with a map of data and add a new key value, you are returned a new map. But that map is not a naïve copy of the original map. Instead it is a new data structure that shares the original data, and provides novelty.

Clojure's built in collections perform in a wide range around half an order of magnitude slower than their mutable counterparts. For most usages this is blazingly fast. So, they make an excellent default because of the strong safety guarantee that immutability brings. However, if you are really searching for speed, you need to be willing to use mutable collections where warranted.

Clojure programs tend to have many small maps/vectors/sets. There are small version maps (array-maps) and vectors and sets (custom classes) that are even more efficient. These get converted if they expand. Generally you can be comfortable relying on Clojure to do the right thing. If these costs are significant, you can avoid them by carefully managing the types used.

Nested maps are also very commonly used in Clojure programs. This is a good thing. But there can be occasions where nested map operations such as assoc-in and get-in may be improved by unrolling. (get :a (get :b (get :c (get :d m))) is not the same thing as (get-in m [:d :c :b :a]) in terms of the byte code produced. The latter is implemented as recursive function calls.

The ultimate way to manage the types in your program is to use protocols and records. These constructs make no compromises in relation to the host platform. They are literally as fast as Java classes. You do not need to leave Clojure to get the highest performing data structures possible.

A datatype is created using deftype, defrecord, or reify with explicit fields. They provide access to the highest-performance primitive representation and polymorphism mechanism of the host. Accessing fields in these types is faster than using maps. Records behave like maps for undefined fields, so you can check whether your Record has accumulated unexpected key values during use by examining the _____extmap field.

SAFE MUTATION WITH TRANSIENTS

The best way to make use of mutation safely is to limit the scope they exist in. So long as they don't escape into your program at large, they can't cause trouble. Clojure has an explicit mechanism for defining well-scoped mutation called transients.

The function transient returns a new, transient version of a collection, in constant time. The function persistent! returns a new, persistent version of the transient collection, in constant time. The transient collection can't be used after it is persisted. Any such use will throw an exception. Consider this example for constructing a set.

(persistent! (reduce conj! (transient #{}) coll))

Clearly this is the safe use of mutation, because the mutation is limited to building up a collection, and only an immutable version of that collection is returned for use by the rest of the program. Clojure uses transients for creating its collections, so you don't need to do this yourself, but you can use the same pattern where you may otherwise be doing many modifications in a well-scoped way.

The secret formula to using transients is:

  1. Identify looping code that is using immutable datastructures.
  2. Inside the scope of the function use a transient accumulator.
  3. Change modifying functions to their transient equivalents.
    assoc!, conj!, 'dissoc!', 'disj!', 'pop!'.
  4. Persist before returning the result.

Let's apply this formula to create a transient version of a function that will remove all "Done" stories from a simulation.

(defn remove-done-stories-with-transient [stories]
  (persistent!
    (reduce
      (fn [acc story]
        (dissoc! acc (:id story)))
      (transient stories)
      (filter #(= (:status %) "Done") stories))))

PROFILING

Profiling is observing the runtime characteristics of a program, such as time spent by function or memory consumption, in order to better understand and optimize them. The main question profiling seeks to answer is "Where does the program spend most of its time?" Even experienced programmers are often surprised by the results—performance is a complex matter. Performance problems and opportunities hide in unexpected places.

The JVM has a solid selection of excellent profilers, all of which work great with Clojure programs. The widely used profiles are:

  • Java Flight Recorder and Java Mission Control
  • YourKit
  • VisualVM

Ultimately the most useful thing they produce is a table of where the time is spent in your program. But they also have a bunch of features such as measuring memory allocations, which are extremely useful for understanding how your code is performing. Clojure programs are especially susceptible to memory allocation and garbage collection, so be sure to look beyond just the timing results (see Figure 7.1).

Snapshot of a Clojure program showing memory allocation and garbage collection.

Figure 7.1

AVOIDING REFLECTION WITH TYPE HINTING

Reflection is the inspection of classes, interfaces, fields, and methods at runtime. Clojure is a dynamic language in the sense that code gets compiled at runtime, but Clojure code is always compiled. You rarely have to deal with types in Clojure, and generally eschew them, but the compiler will take advantage of types that it knows about. Type hinting is a special syntax to tell the compiler precisely what methods and fields are being used when doing interop. Where interop is ambiguous, the compiled code will use reflection to discover at runtime the method or field and call it.

Let's take a look at an example. Imagine part of your simulation algorithm involved counting how many stories were bugs. The definition of what a bug is might be to check for the string bug in the title of a story.

(defn bug? [s]
  (not (neg? (.indexOf s "bug"))))

The intention is that the input parameter s is a string, and you will use the Java method indexOf to see if it contains the substring "bug". The result of indexOf is -1 when the substring is not found. The compiler has no way to know that s will be a string. So it will compile byte code that will use reflection to work with any input. But reflection has a cost.

You can bind the global var *warn-on-reflection* to true in order to have the compiler notify you that it is producing reflection code. The easiest way to do this is to specify it in your project.clj file.

:global-vars {*warn-on-reflection* true}

Now, if you restart your test-refresh or repl session, you will see in the output console a reflection warning.

Reflection warning, mush/simulation.clj:150:3 - call to method indexOf can't be
  resolved (target class is unknown).

So let's make a type-hinted version to benchmark against. Annotate s with ∧String to indicate to the compiler that you want to call the String instance indexOf method on s.

(defn bug?? [∧String s]
  (not (neg? (.indexOf s "bug"))))

And now you benchmark both approaches to see the impact.

(def stories
  ["When the user clicks new story, the app crashes. (bug)"
   "Add a team member profile page. (feature)"
   "Build a backend service to store stories."
   "Every second story created splits into three stories. (bug)"])
(criterium.core/quick-bench
  (count (filter sim/bug? stories)))
=> Execution time mean : 739.032651 ns
(criterium.core/quick-bench
  (count (filter sim/bug?? stories)))
=> Execution time mean : 18.578044 µs

Look very carefully at the time units. 18.578 µs is 18578 ns. The type hinted version is 25 times faster! Clearly, reflection can have a big impact on your simulation if the bug? function is called many times over many stories over many steps.

There is an additional benefit to type hinting. Type hinting makes inlinable byte code. The JVM can further optimize execution beyond simply avoiding reflection.

Annotating types can be quite effective for bottleneck code. But it is a good style to be minimalistic about type hinting. Avoid annotating unnecessarily, as it makes your code noisy and does not provide a benefit. Even worse, an incorrect type hint can be extremely confusing. When type hinting, focus on input parameters, and try to avoid returning something that needs a type hint. This will help limit the amount of hinting necessary.

JAVA FLAGS

Leinigen sets the -XX:+TieredStopAtLevel=1 flag set by default to optimize for startup time. You can remove this flag by using the ∧:replace metadata on your :jvm-opts setting.

:jvm-opts:replace ["-server" ;; maximize peak operating speed
              "-Xms256m" ;; heap minimum
              "-Xmx256m"  ;; heap maximum
              "-XX:+AggressiveOpts" ;; point performance compiler optimizations
              "-XX:+UseCompressedOops"] ;; neutralize penalty imposed by 64 bit JVM

There is widespread disagreement about what flags are the best to use for Clojure and Java applications. Many flags are situational and get a good or bad reputation from specific instances of them being useful or horrible in a particular scenario.

MATH

Clojure's numeric stack is much more forgiving than Java's numeric stack. Arithmetic operators are much more powerful, but they are slower. In Java if you add two integers together that combine to a number greater than can fit in a result integer, you get an exception. In Clojure you get the result upcast to the next biggest type that can hold your result. Generally this is very convenient, but for some applications you may want to take direct control over numerical behavior for best performance.

When dealing with numbers, things that can make for faster math are:

  • unchecked-add and *unchecked-math*.
  • Avoiding boxing, using primitives by casting (int x) and hinting ∧int.
  • Using Java arrays.

One especially interesting library for math is core.matrix, which provides several different implementations of the same matrix abstractions with different tradeoffs.

SUMMARY

The performance of a computer program or system depends on many factors. Fortunately, Clojure and ClojureScript have great answers to those considerations. Clojure is an abstraction of the host platform, but there are many facilities to reach down to lower levels.

General solutions are difficult when it comes to performance. You should keep an open mind to narrowing the scope of your solution. Define your performance goals and think about what is possible and what is acceptable. You have many tools at your disposal for optimization, so creating a plan of attack is useful. Be ready to change your hypothesis as assumptions become invalidated, and send your experiments in a different direction where necessary.

Be wary of making assumptions about performance based on the results of micro-benchmarks. Fast operations are not the sum of their component operations. Focus on measuring larger scopes of code that are important to your system. Measuring slow things is usually more valuable than measuring fast things.

Think, measure, and monitor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset