CHAPTER 9
Algorithmic Adjoint Differentiation

In this second introductory chapter, we discuss in detail how to automatically generate calculation graphs and conduct adjoint calculations through them to produce all the differentials of any calculation code in constant time, automatically and in a noninvasive manner.

We consider a mathematical calculation that takes a number of scalar inputs images and produces a scalar output images. We assume that the calculation is written in C++ (including perhaps nested function calls, overwriting of variables, control flow, and object manipulation). Our goal is to differentiate this calculation code in constant time, like we did in the previous chapter, but automatically. For this purpose, we develop AAD code, that is, code that runs adjoint differentiation over any calculation code. In order to do that, the AAD code must understand the sequence of operations involved in the calculation code, a notion known to programmers as “meta-programming” or “introspection.” It is achieved through operator overloading, whereby all mathematical operators and functions are replaced by custom code when applied to specific types, as opposed to native C++ types like images. So the calculation code must be templated on the number type, the type used for the representation of numbers. Safe for templating, the calculation code remains untouched. All this will be clarified throughout the chapter.

The differentiation code in this chapter is only meant for pedagogical purposes. The implementation is incomplete, not optimal, and most of the code and concepts explored here are unnecessary for the implementation of AAD. Its purpose is to understand what exactly is going on inside AAD. We will also see that calculation graphs turn out to be unnecessary for adjoint differentiation, and that, instead, an optimal implementation uses “tapes,” data structures that “record” all the mathematical operations involved in the calculation as they are executed. However, it is only by exploring calculation graphs in detail that we understand exactly how AAD works its magic and why tapes naturally appear as the better solution.

A complete, optimal AAD implementation is delivered in the next chapter, and further improved in the rest of the book. The resulting AAD library is available from our online repository.

9.1 CALCULATION GRAPHS

Definition and examples

The first notion we must understand to make sense of AAD is that of a calculation graph. Every calculation defines a graph. The nodes are all the elementary mathematical calculations involved: images, images, images, images, images, images, images, images, and so forth.1 The edges that join two nodes represent operation to argument relations: child nodes are the arguments to the calculation in the parent node. The images (childless nodes) are scalar numbers and inputs to the calculation.2

The structure of the graph reflects precedence among operators: for instance the graph for images is:

Structure of a nodal graph for xy + z that reflects precedence among operators; child nodes are the arguments to the calculation in the parent node.

because multiplication has precedence, unless explicitly overwritten with parentheses. The graph for images is:

Structure of a nodal graph node for x(y + z), where multiplication has precedence, and explicitly overwritten with parentheses.

The graph also reflects the order of calculation selected by the compiler among equivalent ones. For instance, the graph for images could be:

Structure of a nodal graph for x + y that reflects the order of calculation selected by the compiler among equivalent ones.

or:

Structure of a unique nodal graph for x + y that reflect the compiler's choice of the calculation order, and always represent calculations left to right.

The compiler selects one of the possible calculation orders. Hence, the graph for the calculation code is not unique, but the graph for a given execution of the code is, the compiler having made a choice among the equivalent orders.3 We work with the unique graphs that reflect the compiler's choice of the calculation order, and always represent calculations left to right. Hence, the graph:

Illustration of a nodal graph of nested function, where the compiler respects the left-to-right order in the code.

represents images and:

Illustration of a calculation nodal graph for y = y + x referring to mathematical operations.

represents images.

The following example computes a scalar images out of 5 scalars images (where images is purposely left out of the calculation) with the formula:

images

We also compute and export the partial derivatives by finite differences for future reference. We will be coming back to this code throughout the chapter.4

images

Its graph (assuming the compiler respects the left-to-right order in the code) is:

Illustration of a nodal graph in which variable y is reused and overwritten in the calculation of the sum.

It is important to understand that complicated, compound calculations, even those that use OO (object-oriented) architecture and nested function calls, like the ones that compute the value of an exotic option with MC simulations or FDM, or even the value of the xVA on a large netting set, or any kind of valuation or other mathematical or numerical calculation, all define a similar graph, perhaps with billions of nodes, but a calculation graph all the same, where nodes are elementary operations and edges refer to their arguments.

Calculation graphs and computer programs

It is also important to understand that calculation graphs refer to mathematical operations, irrespective of how computer programs store them in variables. For instance, the calculation graph for:

images

is most definitely not:

Structure of a nodal graph for a recursive function call defining new functions, called with 3 as argument; the graph is only concerned with mathematical operations and their order of execution.

but:

Structure of an acyclic nodal graph where a node cannot refer to itself, either directly or indirectly.

Although the variable images is reused and overwritten in the calculation of the sum, the sum defines a new calculation of its own right in the graph. Graph nodes are calculations, not variables.

Similarly, recursive function calls define new calculations in the associated graph: the following recursive implementation for the factorial:

images

defines the following graph when called with 3 as argument:

Structure of a directed acyclic graph representing a chain of calculations for the code y = x > 0 ? x * a : x + a;

The fact that the function is defined recursively is irrelevant; the graph is only concerned with mathematical operations and their order of execution.

Directed Acyclic Graphs (DAGs)

Computer programs do not define just any type of graph.

First, notice that a calculation graph is not necessarily a tree, a particularly simple type of graph most people are used to working with, where every child has exactly one parent.5 In the example on page 323, the node images has two parents and the node images has three. Calculation graphs are not trees, so reasoning and algorithms that apply to trees don't necessarily apply to them.

Second, the edges are directed, calculation to arguments. An edge from images to images means that images is an argument to the calculation of images, like in images, which is of course not the same as images. Graphs where all edges are directed are called directed graphs. Calculation graphs are always directed.

Finally, in a calculation graph, a node cannot refer to itself, either directly or indirectly. The following graph, for instance:

Structure of the directed acyclic graph for the above code when executed with a positive input x.

is not a calculation graph, and no code could ever generate a graph like this. The reason is that the arguments to a calculation must be evaluated prior to the calculation.

Graphs that satisfy such property are called acyclic graphs. Therefore, calculation graphs are directed acyclic graphs or DAGs. We will be applying some basic results from the graph theory shortly, so we must identify exactly the class of graphs that qualify as calculation graphs.

DAGs and control flow

The DAG of a calculation only represents the mathematical operations involved, not the control flow. There are no nodes for images, images, images, and friends. A DAG represents the chain of calculations under one particular control branch. For instance, the code

images

has the DAG:

Structure of the directed acyclic graph for the above code when executed with a negative input that doesn't have a composite dag of the type.

when executed with a positive input images and the different DAG:

Illustration displaying a directed acyclic graph that is identifying nodes by their evaluation order.

when executed with a negative input. It doesn't have a composite dag of the type:

Illustration of a directed acyclic graph displaying intermediate results for the first evaluation.

This is not a calculation DAG.

We work with calculation DAGs, not code DAGs. Calculation DAGs refer to one particular execution of the code, with a specific set of inputs, and a unique control branch, resulting in a specific output. They don't refer to the code that produced the calculation. AAD differentiates calculations, not code. It follows that AAD does not differentiate through the control flow itself: since there are no “if” nodes in the calculation DAG, the AAD “derivative” of an “if” statement is 0.

This may look at first sight like a flaw with AAD, but how is any type of differentiation supposed to differentiate through control flow, something that is by nature discontinuous and not differentiable?

We will discuss control flow and discontinuities further in Chapter 11. For now, we focus on building DAGs and applying them for the computation of differentials in constant time.

9.2 BUILDING AND APPLYING DAGS

Building a DAG

A DAG is not just a theoretical representation of some calculation. We apply operator overloading to explicitly build a calculation DAG in memory at run time, as follows. This code is incomplete, we only instantiate the images, images and images nodes, and extremely inefficient because every operation allocates memory for its node on the DAG. We also don't implement const correctness and focus on functionality. This code is for demonstration purposes only. We suggest readers take the time to fully understand it. This inefficent code nonetheless demonstrates the DNA of AAD.

First, we create classes for the various types of nodes, images, images, images, and leaf, in an object-oriented hierarchy:

images

We apply a classical composite pattern (see GOF, [25]) for the nodes in the graph. Concrete nodes are derived for each operation type and hold their arguments by base class pointers. The systematic use of smart pointers guarantees an automatic (if inefficient) memory management. It has to be shared pointers because multiple parents may share a child. We can't have a parent release a child while other nodes are still referring to it. When all its parents are gone, the node is released automatically.

Next, we design a custom number type. This type holds a reference to the node on the graph corresponding to the operation last assigned to it. When such a number is initialized, it creates a leaf on the graph. We use this custom type in place of doubles in our calculation code.

images

images is a C++11 standard function associated with smart pointers that returns a copy of the smart pointer of the proper type with its stored pointer casted dynamically.6 This is necessary because only leave nodes store a value.

Next, we overload the mathematical operators and functions for the custom number type so that when these operators and functions are executed with this type, the program does not evaluate the operator as normal, but instead constructs the corresponding node on the graph.

images

This completes our “AAD” code; we can now build graphs automatically. This is where the first A in AAD comes from. We don't build the DAG by hand. The code does it for us when we instantiate it with our number type. We could hard replace “double” by “Number” in the calculation code, or make two copies, one with doubles, one with Numbers. The best practice is template the calculation code on its number type, like this:

images

and call it with our custom type like that:

images

Because we called images with Numbers in place of doubles, line 5 does not calculate anything. All mathematical operations within images execute our overloaded operators and functions. So, in the end, what this line of code does is construct in memory the DAG for the calculation of images.

We templated the code of images for its number representation type. To template calculation code in view of running of it with AAD is called instrumentation. In Chapter 12, we instrument our simulation code.

Traversing the DAG

Lazy evaluation What can we do with the DAG we just built?

First, we can calculate it. We evaluate the DAG itself, without reference to the function images, the execution of which code built the DAG in the first place.

How do we proceed exactly? First, note that arguments to an operation must be computed before the operation. This means that the processing of a node starts with the processing of its children.

Second, we know that nodes may have more than one parent in a DAG. Those nodes will be processed multiple times if we are not careful. For this reason, we start the processing of a node with a check that this node was not processed previously. If this is verified, then we process its children, and then, and only then, we evaluate the operation on the node itself, mark it as processed, and cache the result of the operation on the node in case it is needed again from a different parent. If the node was previously processed, we simply pick its cached result.

In this sequence, the only step that is specific to evaluation is the execution of the operation on the node. All the rest is graph traversal logic and applies to evaluation as well as other forms of visits we explore next.

If we follow these rules, starting with the top node, we are guaranteed to visit (evaluate) each node exactly once and in the correct order for the evaluation of the DAG.

Hence, we have the following graph traversal and visit algorithm: starting with the top node,

  1. Check if the node was already processed. If so, exit.
  2. Process the child nodes.
  3. Visit the node: conduct the calculation, cache the result.
  4. Mark the node as processed.

The third line of the algorithm is the only one specific to evaluation. The rest only relates to the order of the visits. More precisely, the traversal strategy implemented here is well known to the graph theory. It is called “depth-first postorder” or simply postorder. Depth-first because it follows every branch down to a leaf before moving on to the next branch. Postorder because the children (arguments) of a node are always visited (evaluated) before the node (operation). It follows that visits start on the leaf nodes and end on the top node. Postorder implements a bottom-up traversal strategy and guarantees that every node in a DAG is processed exactly once, where children are always visited before their parents. It is the natural order for an evaluation. But we will see that other forms of visits may require a different traversal strategy.

We implement our evaluation algorithm, separating traversal and visit logic into different pieces of code. The traversal logic goes to the base class, which also stores a flag that tells if a node was processed and caches the result of its evaluation. For our information only, the order of calculation is also stored on the (base) nodes. Finally, we code a simple recursive method to reset the processed flags on all nodes.

images

The images algorithm isn't optimal, as most of the code in this chapter. Next, we code the specific evaluation visitor as a virtual method: evaluation is different for the different concrete nodes (sum in a images node, product in a images node, and so on):

images

We start the evaluation of the DAG from the Number that holds the result and, therefore, refers to the top node of the DAG. Recall that Numbers contain a shared pointer to the last assignment.

images

Now we can evaluate the DAG:

images

The evaluation of the mathematical operations in the calculation did not happen on line 5. That just constructed the DAG. Nothing was calculated. The calculation happened on line 7 in the call to images, directly from the DAG after it was built. That it returns the exact same result as a direct function call (with double as the number type) indicates that our code is correct.

To see this more clearly, we can change the inputs on the DAG and evaluate it again, without any further call to the function that built the DAG. The new result correctly reflects the change of inputs, as readers may easily check:

images

This pattern to store a calculation in the form a DAG, or another form that lends itself to evaluation, instead of conducting the evaluation straightaway (or eagerly), and conducting the evaluation at a later time, directly from the DAG, possibly repeatedly and possibly with different inputs set directly on the DAG, is called lazy evaluation. Lazy evaluation is what makes AAD automatic; it is also the principle behind, for example, expression templates (see Chapter 15) and scripting (see our publication [11]).

Calculation order There is more we can do with a DAG than evaluate it lazily. For instance, we can store the calculation order number on the node, which will be useful for what follows. We identify the nodes by their calculation order and store it on the node. We reuse our postorder code; we just need a new visit function. In this case, the visits don't depend on the concrete type of the node; we don't need a virtual visit function, and we can code it directly on the base node.

images

The lambda defines a data member images that starts at 0. Since C++14, lambdas not only capture environment variables, but can also define other internal variables in the capture clause. The evaluation of the lambda modifies its images member, so the lambda must be marked mutable.

images

Our DAG is numbered after the execution of the code above, and each node is identified by its postorder. We can log the intermediate results in the evaluation as follows:

images

after evaluation of the DAG. When we evaluate the DAG repeatedly with different inputs, we get different logs accordingly:

images

For the first evaluation we get the log:

  • Processed node 1 result = 3
  • Processed node 2 result = 5
  • Processed node 3 result = 1
  • Processed node 4 result = 5
  • Processed node 5 result = 2
  • Processed node 6 result = 7
  • Processed node 7 result = 21
  • Processed node 8 result = 4
  • Processed node 9 result = 3.04452
  • Processed node 10 result = 12.1781
  • Processed node 11 result = 33.1781
  • Processed node 12 result = 24.0445
  • Processed node 13 result = 797.751

For the second evaluation, we get different logs reflecting images.

We can display the DAG, this time identifying nodes by their evaluation order:

No figure provided.

and intermediate results (for the first evaluation):

No figure provided.

Reverse engineering There is even more we can do with the DAG. For instance, we can reverse engineer an equivalent calculation program:

images

We get the log:

  • y1 = 3
  • y2 = 5
  • y3 = 1
  • y4 = y2 * y3
  • y5 = 2
  • y6 = y4 + y5
  • y7 = y1 * y6
  • y8 = 4
  • y9 = log(y7)
  • y10 = y8 * y9
  • y11 = y7 + y10
  • y12 = y7 + y9
  • y13 = y11 * y12

We reconstructed an equivalent computer program from the DAG. Of course, we can change the inputs and generate a new computer program that reflects the new values of the leaves and produces different results.

Note that we managed to make visits (evaluation, instruction logging) depend on both the visitor and the concrete type of the visited node. Different things happen according to the visitor (evaluator, instruction logger) and the node (images, images and so forth). We effectively implemented GOF's visitor pattern; see [25]. This pattern is explained in more detail and systematically applied to scripts in our publication [11] (where we study trees, not DAGs, but the logic is the same).

We learned to automatically generate DAGs from templated calculation code with operator overloading. We learned to traverse DAGs in a specific order called postorder and visit its nodes in different ways: lazily evaluate the DAG or reverse engineer an equivalent sequence of instructions.

This covered the “Automatic” in “Automatic Adjoint Differentiation.”7 We now move on to “Adjoint Differentiation” and use the DAG to calculate differentials, all of them, in a single traversal; hence the key constant time property.

Before we do this, we need a detour through basic formalism and adjoint mathematics.

9.3 ADJOINT MATHEMATICS

We must now further formalize and generalize the adjoint mathematics introduced in the previous chapter.

In what follows, we identify the nodes images in the DAG by their postorder traversal number. We call images the result of the operation on images.

We denote images the final result on the top node images.

We denote images the set of children (arguments) nodes of images.

Mathematical operations are either unary (they take one argument like images or images or the unary images that changes a number into its opposite) or binary (two arguments, like images or the operators images, images, images, images). Hence images is a set of either 0, 1, or 2 nodes. Besides, because of postorder, the numbers of the nodes in images are always less than images. The only nodes images for which images is empty are by definition the leaves.

The leaf nodes are processed first in postorder traversal, so their node number images is typically low. Leaves represent the inputs to the calculation.

We define the adjoint of the result images on the node images and denote images the partial derivative of the final result images to images:

images

Then, obviously:

images

and, directly from the derivative chain rule, we have the fundamental adjoint equation:

images

The adjoint of a node is the sum of the adjoints of its parents (calculations for which that node is an argument) weighted by the partial derivatives of parent to child, operation to argument. The operations on the nodes are elementary operations and their derivatives are known analytically.

Note the reverse order to evaluation: in an evaluation, the result of an operation depends on the values of the arguments. In the adjoint equation, it is the adjoint of the argument that depends on the adjoints of its operations.

Hence, just like evaluation computes results bottom-up, children first, adjoint computation propagates top down, parents first.

More precisely, the adjoint equation turns into the following algorithm that computes the adjoints for all the nodes in the DAG (all the differentials of the final result) and is a direct translation, in graph terms, of the back-propagation algorithm of the previous chapter.

Traversal starts on the top node, which adjoint is known to be 1, and makes its way toward the leaf nodes, which adjoints are the derivatives of the final result to the inputs.

The adjoints of every node are seeded to 0 (except the top node, which is seeded to 1) and the visit to node images consists in the propagation of its adjoint images:

images

to the adjoints images of all its arguments (child nodes) images. In the terms of the previous chapter, the visit of a node implements the adjoint of the operation represented on the node. After every node has been processed, exactly once and in the correct order, the adjoints of all nodes have correctly accumulated in accordance with the adjoint equation.

As we observed in the previous chapter, images is known analytically, but depends on all the images. This means that we must know all the intermediate results before we conduct adjoint propagation. Adjoint propagation is a post-evaluation step. It is only after we built the DAG and evaluated it that we can propagate adjoints through it. At this point, all the intermediate results are cached on the corresponding nodes, so the necessary partial derivatives can be computed.8

We may now formalize and code adjoint propagation: we set images and start with all other adjoints zeroed: for images, images. We process all the nodes in the DAG, top to leaves. When visiting a node, we must know its adjoint images, so we can implement:

images

for all its kids images. This effectively computes the adjoints of all the nodes in the DAG, hence all the partial derivatives of the final result, with a single DAG traversal hence, in constant time.

This point is worth repeating, since the purpose of AAD is constant time differentiation. All the adjoints of the calculation are accumulated from the top-down traversal of its DAG. In this traversal, every node is visited once and its adjoints are propagated to its child nodes, weighted by local derivatives. Hence, the adjoint accumulation, like the evaluation, consists in one visit to every node (operation) on the calculation DAG. Adjoint accumulation, which effectively computes all derivative sensitivities, is of the same order of complexity as the evaluation and constant in the number of inputs.

It is easy enough to code the visit routines. We store adjoints on the nodes and write methods to set and access them, like we did for evaluation results. We also need a method to reset all adjoints to 0, like we did for the processed flag. Finally, we log the visits so we can see what is going on.

images

Notice how the propagation code uses the intermediate results stored on the child nodes to compute partial derivatives, as mentioned earlier. The propagation of adjoints on the unary and binary operator nodes implements the adjoint equation. Leaf nodes don't propagate anything; by the time propagation hits them, their adjoints are entirely accumulated, so their override of images simply logs the results.

We could easily code the visits but not start the propagation. It is clear that doing the same as for evaluation, something like:

images

is not going to fly. Postorder traverses the DAG arguments first, but for adjoint propagation it is the other way around. Adjoints are propagated parent to child. We must find the correct traversal order.

9.4 ADJOINT ACCUMULATION AND DAG TRAVERSAL

Can we find a traversal strategy for adjoint propagation such that each node is visited exactly once and all adjoints accumulate correctly?

Preorder traversal

An intuitive candidate is preorder. Like postorder, preorder is a depth-first traversal pattern, but one that visits each node before its children. We reprint the postorder code and write the preorder code below (on the base Node class). We also add a method propagateAdjoints() on the Number class to start adjoint propagation visits in preorder from the result's node:

images

Here is the result:

  • Propagating node 13 adjoint = 1
  • Propagating node 11 adjoint = 24.0445
  • Propagating node 7 adjoint = 24.0445
  • Accumulating leaf 1 adjoint = 168.312
  • Propagating node 6 adjoint = 72.1336
  • Propagating node 4 adjoint = 72.1336
  • Accumulating leaf 2 adjoint = 72.1336
  • Accumulating leaf 3 adjoint = 360.668
  • Accumulating leaf 5 adjoint = 72.1336
  • Propagating node 10 adjoint = 24.0445
  • Accumulating leaf 8 adjoint = 73.2041
  • Propagating node 9 adjoint = 96.1781
  • Propagating node 7 adjoint = 28.6244
  • Accumulating leaf 1 adjoint = 368.683
  • Propagating node 6 adjoint = 158.007
  • Propagating node 4 adjoint = 230.14
  • Accumulating leaf 2 adjoint = 302.274
  • Accumulating leaf 3 adjoint = 1511.37
  • Accumulating leaf 5 adjoint = 230.14
  • Propagating node 12 adjoint = 33.1781
  • Propagating node 7 adjoint = 61.8025
  • Accumulating leaf 1 adjoint = 801.3
  • Propagating node 6 adjoint = 343.414
  • Propagating node 4 adjoint = 573.555
  • Accumulating leaf 2 adjoint = 875.829
  • Accumulating leaf 3 adjoint = 4379.14
  • Accumulating leaf 5 adjoint = 573.555
  • Propagating node 9 adjoint = 129.356
  • Propagating node 7 adjoint = 67.9623
  • Accumulating leaf 1 adjoint = 1277.04
  • Propagating node 6 adjoint = 547.301
  • Propagating node 4 adjoint = 1120.86
  • Accumulating leaf 2 adjoint = 1996.69
  • Accumulating leaf 3 adjoint = 9983.43
  • Accumulating leaf 5 adjoint = 1120.86
  • a0 = 9983.43
  • a1 = 1120.86
  • a2 = 1277.04
  • a3 = 73.2041
  • a4 = 0

We conducted 35 visits through a DAG with 13 nodes. All nodes were visited, but many were visited multiple times. For example, node 7 propagated 4 times.

However, the derivatives are wrong. We recall that we computed them with finite differences earlier, and the correct values are: 950.736, 190.147, 443.677, 73.2041, and 0.

Actually, the values are incorrect because nodes are visited multiple times. Every time, their adjoint accumulated so far is propagated so their arguments accumulate adjoints multiple times. This is easily remedied by setting the adjoint to 0 after its propagation to child nodes. This way, only the part of the adjoint accumulated since the previous propagation is propagated the next time around:

images

Now we get the correct values, exactly the same derivatives we had with finite differences, but we still visited 35 nodes out of 13, and the remedy we implemented is really a hack: if every node was visited once, as it should, such remedy would not be necessary.

Note that AAD and finite differences agree on differentials, at least to the fourth significant figure. This illustrates that analytic derivatives produced by AAD are not different, in general, from the numerical derivatives produced by bumping. They are just computed a lot faster.

At least AAD is faster when implemented correctly. A poor implementation may well perform slower than finite differences. Our implementation so far certainly does. Efficient memory management and friends are addressed in the next chapter, but DAG traversal is addressed now. Our implementation will not fly unless we visit each node in the DAG once; 35 visits to 13 nodes is not acceptable. Preorder is not the correct traversal strategy.

Breadth-first traversal

Preorder is still a depth-first strategy in the sense that it sequentially follows paths on the DAG down to a leaf.

We may try a breadth-first strategy instead, where visits are scheduled by level in the hierarchy: top node first, then its children, left to right, then all the children of the children, and so forth.

Breadth-first traversal cannot be implemented recursively.9 It is implemented as follows: start on the top node with an empty queue of (pointers on) nodes. Process a node in the following manner:

  1. Send the children to the back of the queue.
  2. Visit the node.
  3. Process all nodes in the queue from the front until empty.

Easy enough. The implementation takes a few lines:

images

The images caller is unchanged.

Unfortunately, breadth-first is no better than preorder: it computes the correct results (providing we still zero adjoints after propagation) but still conducts 35 visits, and node 7, for instance, still propagates four times. The order is not the same, but the performance is identical.

Breadth-first is not the answer.

9.5 WORKING WITH TAPES

We believe it was useful to experiment with different unsuccessful traversal strategies to earn a good understanding of DAGs and how they work, and we took the time to introduce some essential pieces of graph theory and put them in practice in modern, modular (although inefficient) C++ code.

Now is the time to provide the correct answer.

What we need is a traversal strategy that guarantees that the entire, correct adjoints accumulate after a single visit to every node. The only way this is going to happen is that every node must have completed accumulating its adjoint before it propagates. In other terms, all the parents of a node must be visited before the node.

This particular order is well known to graph theory. It is called the topological order. In general, it takes specialized algorithms to sort a graph topologically. But for DAGs we have a simple result:

The topological order for a DAG is the reverse of its postorder.

This is simple enough, and the demonstration is intuitive: postorder corresponds to the evaluation order, where arguments are visited before calculations. Hence, in its reverse order, all the parents (operations) of a node (argument) are visited before the node. In addition, postorder visits each node once, hence, so does the reverse. Therefore, the reverse postorder is the topological order.10

The correct order of traversal in our example is simply: 13, 12, 11,…, 1.

What this means is that we don't need DAGs after all. We need tapes. We don't need a graph structure for the nodes; we need a sequence of nodes in the order they evaluate (postorder). After we completed evaluation and built the tape, we traversed it in the reverse order to propagate adjoints. One single (reverse) traversal of the tape correctly computes the adjoints for all the nodes. We computed all the derivative sensitivities in constant time.

We propagate adjoints once through each node just as we evaluate each node once. The complexity of the propagation in terms of visits is therefore the same as for the evaluation. Both conduct the same number of visits, corresponding to the number of operations. Propagation visits are slightly more complex, though; for example, to propagate through a multiplication takes two multiplications. So the total complexity of AAD is around three evaluations (depending on the code): one evaluation + (one propagation =  two evaluations). In addition, we have a significant administrative overhead: with our toy code, allocations alone could make AAD slower than bumping. We will reduce admin costs to an absolute minimum in the next chapter, and further in Chapter 15.

For now, we refactor our code to use a tape.

Conceptually, a tape is a DAG sorted in the correct order.

Programmatically, this simplifies data structures. The tape is a sequence of nodes, so it is natural to store the nodes in a vector of (unique pointers). That way, a node's lifespan is the same as the tape's. We also save the overhead of shared pointers. Nodes still need references to their arguments to propagate adjoints, but they don't need to own their memory: we can use dumb pointers for that.

We also simplify the code and remove lazy evaluation and numbering: these were for demonstration and we don't need them anymore. We need the tape only for adjoint propagation. Therefore, we modify the code so it evaluates calculations (eagerly) as it builds the tape, and, when the calculation is complete, propagates adjoints through the tape in the reverse order.

The complete refactored code is listed below:

images
images
images
images
images

This code returns the correct derivatives, having traversed the 13 nodes on the tape exactly once.

We just completed our first implementation of AAD.

This code is likely slower than bumping, the Number type is incomplete: it only supports sum, product, and logarithm. The code is not optimal in many ways, and uses shortcuts such as a static tape and unsafe casting. In the next chapter, we produce professional code, work on memory management, and implement various optimizations and conveniences. We make AAD orders of magnitude faster than bumping.

But the essence of AAD lies in this code and the explanations of this chapter.

NOTES

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset