Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 12
Differentiation of the Simulation Library

12.1 ACTIVE CODE

In this chapter, we apply the AAD library from the previous chapter to differentiate our simulation library from Part II. More precisely, we apply AAD to compute the differentials of prices to model parameters. In the next chapter, we discuss differentials of prices to market variables, also called market risks or hedge coefficients.

What will clearly appear is that an efficient AAD library is certainly necessary, but absolutely not sufficient for the efficient differentiation of complex valuation code. It takes work, art, and skill to correctly instrument calculation code. An efficient AAD library provides all the necessary pieces, but to articulate them together in an efficient instrumentation requires an intimate knowledge of the instrumented algorithms (known as domain-specific knowledge) and a comprehensive understanding of the inner mechanics of the AAD library. The notion that one can template calculation code, link to an efficient AAD library, and obtain differentials with AAD speed is a myth. Instrumentation takes at least as much skill and effort as the production of the AAD library itself. A naive instrumentation does produce differentials in constant time, but it may be a very long constant time, often slower than finite differences. While the notion of an efficient instrumentation depends on the differentiated code, this chapter provides general advice and battle-tested methodologies while instrumenting the simulation code of Part II with remarkable results.

It is especially hard to instrument calculation code that was not written with AAD in mind to start with. We wrote our simulation library knowing that we would eventually instrument it for AAD. This will make its instrumentation substantially easier. In fact, we wrote templated code to start with throughout the simulation library. But not all of it. In particular, we did not template the code for the generation of uniform and Gaussian random numbers. All that part of the code uses doubles to represent numbers.

Why did we code it this way? We know that the final result, the average payoff across the simulated paths, depends on the random numbers. Those numbers have non-zero adjoints. However, our goal is to differentiate the final result with respect to the model parameters: spot, local volatilities, and so forth. We know that the random numbers do not depend on those parameters. Hence, their adjoints do not propagate all the way back to them.

Formally, with the notations of previous chapters, we differentiate:

$images$

with respect to the vector of model parameters . is the vector of Gaussian random numbers for path , is the path generation function, and is the payoff function. It is clear from those notations that is another argument to and has no dependency on . The derivatives we want to compute are:

$images$

The s do not depend depend on . In particular,

$images$

In AAD lingo, we call the s inactive.

We know that because we know exactly how MC simulations work: this is domain-specific knowledge in action. Because we know that, we don't need to propagate from the random numbers at all. Hence, we don't need to propagate to them, either. Their adjoints don't contribute to the final derivatives, so we don't need to compute these adjoints at all. To instantiate the code that produces random numbers with the Number type would result in unnecessary AAD overhead at evaluation time (to put operations and local derivatives on tape) and propagation time (conducting unnecessary propagations to and from the random numbers). Because we know that, we can easily leave the random numbers and their generation outside of the adjoint logic. All it takes is use doubles for those calculations in place of a templated number type.

Code that uses the templated number type is said to be instrumented. This code is recorded on tape and participates in the adjoint propagation at the cost of AAD overhead. Code that uses native types like doubles is said to be noninstrumented. It is not recorded or propagated and free of AAD overhead. Calculations that depend directly or indirectly on the inputs and, therefore, must participate in adjoint propagation, are called active. Code that is independent of inputs and, therefore, does not contribute to their adjoints, is called inactive. We have a first and probably most important rule for the efficient differentiation of calculation code:

Only instrument active code.

We call this major optimization selective instrumentation. On the other hand, if we forget to instrument active code, we produce wrong derivatives. Active adjoints are not propagated and their contribution is lost. If we instrument inactive code, like random number generators, we produce unnecessary AAD overhead. We must instrument all the active code, but only the active code. The identification of inactive code is a key step of an efficient differentiation.

Readers are encouraged to review our simulation code and verify that we correctly applied selective instrumentation.

12.2 SERIAL CODE

We start with the serial code. We have nothing to do about random numbers, and the models and products are already instrumented (although we will need minor extensions on models). Therefore, our work focuses on the template algorithm of Chapter 6, which we recall here:

We now differentiate this code, instantiating valuation code with the Number type in place of doubles and inserting additional logic for the back-propagation of adjoints and, more generally, the management of the tape. Following the discussion of the previous chapter, we back-propagate adjoints path-wise to avoid unreasonable tape growth.

The function has the same inputs as the original, except they are now instantiated with Numbers instead of doubles and an additional “aggregator.” Following the discussion of the previous chapter, we are differentiating one result for now. In case the product has multiple payoffs, we differentiate an aggregate provided by client code as a function of the payoffs, represented by a lambda or another callable type. For example, the product may be a portfolio of transactions, and we may want to differentiate the value of the entire book, the sum of the values of the transactions weighted by the notionals. In this case, we would apply an aggregator that computes the “payoff” of the portfolio as the sum of the payoffs of the products, weighted by the notionals:

where is a standard C++ library algorithm from header <numeric> that does exactly what its names says. Note that it is not the values, but the payoffs that are aggregated. The aggregator effectively turns multiple payoffs into one aggregated payoff and it is this aggregated payoff that is differentiated. The default aggregator simply picks the first payoff.

The return type is different: we return a matrix of path-wise payoffs as previously, but we also return a vector of path-wise aggregated payoffs and the vector of all the sensitivities of its value (path-wise average) to all the model parameters.

Like in valuation, we start by cloning the model and the RNG, modified by the subsequent code. We allocate the path and the model's working memory but we don't initialize them yet and pick the dimensions: number of payoffs and number of parameters. We take a reference on the vector of pointers to the model parameters so we easily address it thereafter under the name .

What follows is new and deserves an explanation:

We take a reference to the tape and initialize it.

The next thing we do (line 13) is put all the model parameters on tape. They are the inputs to our calculation with respect to which we compute derivatives. They must go on tape before any calculation is executed. See the discussion on page 391 for details and a simple example.

This is the modification we need in our simulation code: we must develop a method on the base Model class to put all the parameters on tape. The modification is minor, because we can implement it directly on the base class without modification to concrete models. The (small) difficulty is that the Model class is templated in its number type, but only Numbers go on tape. We must implement some simple template specialization magic so that puts its parameters on tape, whereas for any other type T, does nothing.¹ Readers unfamiliar with template specialization may learn all about it in [76].

The code below completes the base Model class in mcBase.h.

Back to our simulation code, the next thing we do (lines 18 and 21) is initialize the model and the scenario. It is crucial to do this here and not before. Initialization must be conducted after the parameters are on tape. The initialization of a model typically conducts calculations with the model parameters. In Black and Scholes, we pre-calculate deterministic amounts for the subsequent generation of the paths. In Dupire, we pre-interpolate the local volatilities in time. So the parameters must be on tape, or these calculations result in nodes with dangling pointers to nonexisting parameter adjoints. We again refer to the detailed discussion on page 391.

At this point the parameters, and the initial calculations, those that are not repeated path-wise, are all on tape and correctly linked. We mark the tape there (line 24). Remember, we are going to compute path-wise derivatives. We will construct the tape for a path, propagate adjoints through it, and rewind it for the next path. We are going to rewind it up to the mark so that the inputs and the initialization remain on tape throughout the whole simulation, their adjoints accumulating contributions across simulations.

Illustration of a tape constructed for a path, propagating adjoints through it, and rewinding it for the next path so that the inputs and the initialization remain on tape throughout the whole simulation.

Next in our simulation code, we initialize the RNG, preallocate working memory for the simulations, and allocate results. This is admin code; it is not recorded and no different than valuation.

The important initialization phase is complete; we now proceed with the simulations:

A path is processed in three stages:

Rewind the tape to the mark (line 10) so as to reuse the simulation piece of the tape across simulations and avoid growth while leaving the model parameters and initialization on tape throughout all simulations so that their adjoints correctly accumulate contributions from all the simulations.
Generate and evaluate the path (lines 13–20). This code is identical to the valuation code. It executes the three steps of the processing of a path: generate Gaussian numbers, simulate a path, evaluate payoffs over the path. The difference is that this code is now instantiated with Numbers. So it goes on tape when executed and builds the simulation piece of the tape, overwriting the previous simulation so that the size of the tape remains fixed to the size of one simulation; see figure on page 414.
We have an additional line 20, where the payoffs for the current path are aggregated in one result, with an aggregator provided by the client code, as explained earlier.
Propagate adjoints from the result (which adjoint is seeded to 1) to the mark. We back-propagate the adjoints of the current simulation, but only those. As a result, the adjoints on the persistent piece of tape, where the parameters and initialization were recorded, accumulate their contribution to the current simulation.

After the simulation loop completes, the adjoints on the persistent piece of tape accumulated their contribution to all the simulations. What is left is back-propagate adjoints through the persistent piece so that the adjoints of the entire Monte-Carlo valuation accumulate into the model parameters.

Illustration of a tape back-propagating adjoints through the persistent piece so that the adjoints of the entire Monte-Carlo valuation accumulate into the model parameters.

This takes one line of code:

It follows that we back-propagate simulation adjoints repeatedly, but initialization adjoints only once. This is another one of the strongest optimizations in our code. If we propagated adjoints all the way, repeatedly over each simulation, to differentiate a Monte-Carlo valuation would be many times slower. This optimization, to a large extent, is the dual of the strongest optimization we included in our simulation code of Chapter 6. We gathered as much work as possible into the initialization, conducted once, so as to minimize work during the repeated simulations. It follows that we have large a initialization tape and small simulation tapes. In the differentiation algorithm, we back-propagate repeatedly over the small tapes and once over the large tape, reaping the benefits of our optimization of Chapter 6 for differentiation, too.

All that remains is pick sensitivities to parameters on the parameters adjoints, pack,² clean, and return:

The code is found in mcBase.h.

12.3 USER INTERFACE

We must upgrade our user interface in main.h to run the differentiation algorithm. We develop two functions: , which differentiates one selected payoff in a product, and , which differentiates a portfolio of payoffs with given notionals. Both functions take a model and a product in the memory store, developed in Section 6.5, run the differentiation algorithm, and return:

The labels of the payoffs in a vector<string>
The corresponding values in a vector<double>
The value of the differentiated aggregate, as a double
The names of all the parameters in a vector<string>
The corresponding risk sensitivities of the aggregate to parameters in a vector<double>

The differentiation algorithm works with instrumented models and products, Model <Number> and Product <Number>. We upgrade the memory store in store.h so we can store and retrieve both instrumented and non-instrumented models and products. In practice, we store a pair of models or products, one instrumented and one not:

We upgrade the functions to initialize models and products in memory accordingly:

We template the functions to retrieve models and products, so client codes calls for pricing or for risk:

The entire code is found in store.h. With the store upgraded to deal with instrumented objects, we develop the user interface for the differentiation of simulations in main.h below. We write two functions, one for the differentiation of one named payoff, and one for the differentiation of a portfolio. The project in our repository exports these functions to Excel.

where , the parallel version of the differentiation algorithm, is developed next.

For test purposes, and to assess the speed and the accuracy of our derivatives, we also implement bump risk. Bumps are implemented in a trivial manner to produce itemized risk reports³ in a trivial manner. Bump risk is up to thousands of times slower than AAD, and the resulting risk sensitivities are expected to be virtually identical. The following self–explanatory function is found in main.h:

12.4 SERIAL RESULTS

We can now differentiate the results of our simulation library with respect to model parameters. For instance, we compute the sensitivities of a European or barrier option to all local volatilities. Sensitivities to local volatilities, what Dupire calls a microbucket, is valuable information for research purpose. For the purpose of trading, however, we really need the sensitivities to tradable market variables, not model parameters. In the context of Dupire's model, we want the sensitivities to the market-implied volatilities, what Dupire calls a superbucket, because those are risks that may be directly hedged by trading European options in the market. Local volatilities are calibrated to implied volatilities, so superbuckets can be computed from microbuckets, as discussed in detail in Chapter 13. In this chapter, we focus on the production of microbuckets.

We differentiate the example of Section 6.6: a 3y European call of strike 120, together with a 150 barrier, in Dupire's model with a local volatility of 15% over 60 times and 30 spots, simulating 500,000 paths over 156 weekly steps, where a pricing took around 3 seconds with Sobol. With 1,801 sensitivities to compute, we need 1,802 prices with finite differences, so the bump risk takes around 1.5 hours.⁴

Our AAD implementation produces the entire risk in around 8 seconds (8,250 ms), less than three times a single valuation. With the improvements of Chapter 15, when we further accelerate AAD by factor around 2, we produce this report in less than 5 seconds (4,500 ms). With the parallel algorithm, it takes half a second (575 ms).

The results are displayed on the chart below, first for the European, then for the barrier, with a smoothing of 5. The charts display the differentials to local volatilities on seven different times, as functions of the spot.

Chart displaying the results for the 3y European call with a smoothing of 5, displaying the differentials to local volatilities on seven different times.

Chart displaying the results for the Barrier 120/150 with a smoothing of 5, displaying the differentials to local volatilities on seven different times.

The speed is remarkable and may look suspect. Didn't we establish the theoretical maximum speed for AAD at three times a valuation? With the improvements of Chapter 15, our example computes in the time of 1.5 valuations, half the theoretical minimum.

This is mainly due to our selective code instrumentation. A lot of the valuation time is spent generating random numbers and turning them into Gaussians with the not-so-cheap inverse Gaussian, executed times. All those calculations are inactive. We did not instrument them. They execute without ADD overhead, in the exact same time in differentiation or evaluation mode. Only the instrumented code bears AAD overhead. The remarkable speed is not only due to AAD's constant time, and the implementation of the AAD library, but also to the optimizations in our instrumented code, mainly the selective instrumentation and the partial back-propagation.

Risk sensitivities are almost identical to bumping. Sensitivities to local volatilities match to many significant figures. Difference in delta is around . The reason is subtle. In the barrier payoff code, we used a smoothing factor of of the current spot. When the spot is bumped, the smoothing factor changes and affects the price. But this is numerical noise, not market risk. Hence, we made the smoothing factor a double in our code so that it does not contribute to derivatives, and the result we get is market risk alone. Our delta is not off; on the contrary, it is more accurate.

12.5 PARALLEL CODE

We now move on to the AAD instrumentation of our parallel code from Chapter 7, combining the idioms of that section for parallelism with the techniques of this chapter for differentiation. The instrumentation of the parallel code is more complicated, but well worth the effort, since it is really the combination of parallel simulations and AAD (and expression templates, see Chapter 15) that gives us this “magic” performance. The AAD library helps. The static tape pointer on the Number type is thread local, so all the calculations executed on a thread land on that thread's own tape. But it doesn't mean our work is done. First, the tape pointer is thread local, but we still must declare all the tapes for all the threads and set the thread local pointer to each thread's tape from that thread. More importantly:

With AAD, the model is a mutable object. Even though is marked const, it is not really the case any more; see the discussion on page 391. When the Numbers stored on the model (parameters and initialized amounts) are read, their adjoints on tape are written into during back-propagation. So, we must make copies of the model for every thread like other mutable objects.⁵
In the single-thread version, we were careful to put the model parameters on tape and initialize the model so the initialized amounts are recorded, too. In the parallel version, we must put the parameters and initializers, not on one tape, but on every thread's tape. The only way to do this is call and from each thread before the first simulation on that thread.
Adjoints are accumulated on thread local tapes, so after the simulation loop, the adjoints of the parameters and initializers are scattered across different tapes. We must back-propagate over initialization on multiple tapes and aggregate all the sensitivities to parameters.

The code is essentially a combination of the parallel code of Section 7 with the differentiation code of this chapter, with a special response to these specific challenges.

To address number 2, we separate the initialization in a dedicated function:

The template algorithm has the same signature as the single-threaded one and returns results in the same format:

and the entry-level function in main.h is already set up to fire it.

Next, we allocate working memory (without initializing anything quite yet), one mutable object for every thread (including the main thread, so number of worker threads + 1):

We also initialized a vector of tapes on line 35, one for each worker thread. The main thread already owns a tape: the static tape declared on the Number type instantiates of copy for the main thread when the application starts, as for any static object.

Finally, we have a vector of booleans on line 38, one for every thread, tracking what threads completed the initialization phase, starting with false for everyone.⁶

We must also initialize the RNGs and preallocate the Gaussian vector, but for that we need the simulation dimension , and only an initialized model can provide this information. So the next step is initialize the main thread's model (the main thread always has index 0, the worker threads indices being 1 to nThread):

Next, we dispatch the simulation loop for parallel processing to the thread pool, applying the same techniques as in Chapter 7:

The body of the lambda executes on different threads. This is the concurrent code. Here, the tape is the executing thread's tape and returns the index of the executing thread. The first thing we do is note this index, and set the thread local tape accordingly:

This is the core of parallel AAD right here. The tape is accessed through the thread local pointer . We set it from the executing thread to the tape we allocated for that thread in our vector of tapes. From there on, all the subsequent calculations coded in the lambda and executed on this thread are recorded on that tape. The main thread (0) already correctly refers to its global tape. It does not need resetting.

Only at this stage can we put the parameters on tape and initialize the model (other than allocation, we allocated all models earlier, on the main thread, knowing from Chapter 3 that we must avoid concurrent allocations):

Next, we pick this thread's RNG and skip it ahead, like in Chapter 7:

Finally, the rest of the concurrent code executes the inner batch loop of Chapter 7 with the same contents as the serial differentiation code: rewind tape (to mark), generate random numbers, simulate scenario, compute payoffs, aggregate them, back-propagate adjoints (to mark), store results, repeat):

Like for serial differentiation, all adjoints are accumulated on the persistent piece of the tape at this stage, and we must propagate the initializers' adjoints back to the parameters' adjoints; see figure on page 416. The difference is that the adjoints are now spread across the multiple tapes belonging to multiple threads, but we are out of the concurrent code and back on the main thread. Never mind; this is one propagation conducted one time; we propagate all the threads' tapes from the main thread:

We first propagate over the main thread's tape on line 7. Then, we sequentially propagate the other tapes, from the main thread. In order to propagate the tape number from the main thread, we temporarily set the main thread's tape as the tape (line 15), and propagate (line 17).

In the end, we reset the main thread's tape to its static tape on line 21.⁷

We are almost home safe. All the adjoints are correctly accumulated, but still across multiple tapes and multiple models. The tape of each thread contains the sum of the adjoints over the paths processed on that thread. The adjoints are accessible for every thread's copy of the model. We sum-up the adjoints across models, hence across tapes, normalize, pack, and return:

The code is found in mcBase.h.

12.6 PARALLEL RESULTS

We reproduce the example on page 424, this time in parallel: a 3y call 120 with a 150 barrier, in Dupire's model 1,800 local volatilities (of which 1,080 active) of 15% over 60 times and 30 spots, simulating 500,000 paths over 156 weekly steps with Sobol. With the serial implementation, it took around 8 seconds to compute the microbuckets (precisely 8,250 ms). With the improvements of Chapter 15, it takes less than 5 seconds (precisely 4,500 ms).

With the parallel implementation, we get the exact same results in exactly one second, with perfect parallel efficiency. With the improvements of Chapter 15, we get them in half a second (precisely 575 ms).

We produce more than 1,000 active sensitivities over 500,000 paths, 156 steps, in half a second.

With an efficient AAD library, coupled with an effective instrumentation of the simulation code, we computed derivatives over simulations with unthinkable speed. Not only can we produce lightning-fast risk reports on workstations and laptops, including for xVA, we can also conduct research that would have been impossible otherwise.

For instance, a parallel price with Sobol takes 350 ms in our example, that is, more than five minutes for a finite difference risk over the 1,080 active local volatilities (which perhaps may be improved to two to three minutes with substantial effort and invasive code). This is too slow. It is therefore common practice to reduce accuracy to produce risk reports in a manageable time. With 36 monthly time steps and 100,000 paths, a parallel price takes 20 ms, so a finite difference risk report could perhaps be produced in as little as ten seconds. The results, however, are not pretty, as seen in the charts below, first for the European, then for the barrier, to be compared to the charts on page.

Chart displaying the results for the 3y European call displaying a finite difference risk report produced in less than 10 seconds.

Chart displaying the results for the Barrier 120/150 displaying a risk report produced in less than 10 seconds.

Slow financial software has the unfortunate consequence that users are encouraged to settle for approximate, inaccurate risk computed in reasonable time. With AAD, we produce the accurate charts on page 425 in half a second. We could run 1 M path to produce even better, smoother risk reports in a second.

Interestingly perhaps, one second is approximately the time it takes to produce a similar risk report with comparable accuracy, with FDM in place of Monte-Carlo, on a single core with finite differences. The combination of AAD and parallelism roughly brings FDM speed to Monte-Carlo simulations.

With 3M paths, we produce the much smoother charts below in around three seconds (that is three seconds each, although we produce them both in around three seconds in Chapter 14, and three seconds is with the improvements of Chapter 15):

Chart displaying the results for the European call with multithread bumping in serial mode.

Chart displaying the results for the Barrier 120/150 displaying microbuckets with Monte-Carlo simulations.

It would have taken at least half an hour with smart, multi-threaded bumping, at least four hours in serial mode. For all intents and purposes, to obtain such precise microbuckets with Monte-Carlo simulations is not feasible without AAD. AAD opens new research possibilities into financial risk management. One example is that it produces accurate microbuckets in a few seconds. For other, perhaps unexpected applications, see, for example, Reghai's [97] and Andreasen's [82].

We reiterate that the main purpose of derivatives systems is not to produce sensitivities to model parameters, but risks to market instruments. For this reason, this comparison between AAD and bumping is somewhat invidious. With bumping, sensitivities to hedge instruments could be produced faster by directly bumping a smaller number of market prices and repeating the whole valuation process, including the calibration of the model parameters. However, to produce market risks in this manner could lead to instabilities from differentiating through calibration. Further, the sensitivity to model parameters would not be computed, and that is valuable information in its own right. Finally, it would not offer the flexibility to compute risks with different instruments than those used to calibrate model parameters.

We have seen that AAD produces sensitivities to a large number of model parameters extremely fast. With the concern for costs of computing a large number of sensitivities out of the way, the better strategy for producing risks is to first compute sensitivities to model parameters, as explained in this chapter, and next turn them into sensitivities to hedge instruments, as explained in the next chapter, where we calibrate Dupire's model to market data and produce risks, not to local volatilities, but to implied volatilities, which directly reflect the market prices of European options.

NOTES

4 To be fair, with a 3y maturity, we “only” have 1,080 active local volatilities, so we would be done in an hour. More generally, our bump risk implementation is particularly dumb, only for reference against the AAD code. Before AAD made its way to finance, derivatives systems typically implemented a “smart bump” risk, a conscientious, invasive work on models to recalculate as little as possible in the bumped scenarios. We don't do that here because it is not the subject, but we acknowledge that a smart bump risk could complete in this example in around 20 minutes. With multi-threading over the 8 cores of our iMac Pro, it could complete in less than 3 minutes. As a comparison, AAD risk computes in around 8 seconds with the current implementation, 5 seconds with the improvements of Chapter 15, and half a second with multi-threading. This is still at least 300 times faster than the smartest possible implementation with bumps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 12: Differentiation of the Simulation Library

Create new playlist

Sign In

Sign Up

12.1 ACTIVE CODE

12.2 SERIAL CODE

12.3 USER INTERFACE

12.4 SERIAL RESULTS

12.5 PARALLEL CODE

12.6 PARALLEL RESULTS

NOTES

Table of Contents for
CHAPTER 12: Differentiation of the Simulation Library