Build a little. Test a little. Learn a lot.
—Rear Admiral Wayne Meyer, Aegis Weapon System Program Manager
Since the publication of the first edition of this book, the authors have had the opportunity to consult with many dozens of CISOs and their security teams. While they often wanted detailed quantitative risk assessments for a complex organization, some just want to get started in the simplest way possible. Simple is the objective of this chapter.
In this chapter we will propose some first steps toward building a quantitative risk assessment, which shouldn't be much more complicated than qualitative risk registers or risk matrices. We've cut about as many corners as we could while still introducing some basic quantitative risk assessment concepts.
The first model we will describe in this chapter produces just a single monetary amount representing the average of possible losses. The next slightly more elaborate version we will describe introduces the Monte Carlo simulation—a powerful risk analysis tool. Instead of producing a single number, it can show a distribution of possible losses and their probabilities. Both of these tools and many others in this book require only an Excel spreadsheet. Excel templates for the examples explained in this chapter are already made for you and can be downloaded from this book's website:
Later, we will explore more detailed models (starting in Chapter 6) and more advanced statistical methods (starting in Chapter 8). But for now, we will start with a model that merely replaces the common risk matrix. It will simply be a way to capture subjective estimates of likelihood and impact but do so with unambiguous quantities.
Imagine it's your first day on the job as the new CISO of some mid‐sized organization. It has grown large enough to warrant establishing a security organization. You're the first hire. You have no staff. You have no budget. To build and fund a plan you need a rapid method of assessing your enterprise's cyber risk.
This is what the “Rapid Risk Audit” was made for. Its goal is to quickly assess significant risks that can be readily assessed. It could be the first step for the new CISO, with deeper technical and quantitative assessments to follow.
A rapid risk audit could take an hour or two to complete if you keep it simple. It will consist of some interviews and your job is to take the results of the interviews to create rough loss forecasts. Let's review some basic terminology you will use in the audit and then provide a list of steps for completing the audit.
There are six key terms used in this interview. Ultimately, they provide the backbone for the quantification of risk in support of security budgets, cyber insurance policies, and board reporting:
TABLE 3.1 List of Potential Cyber Threats for the Enterprise
(a spreadsheet with another version of this list is ready‐made at www.howtomeasureanything.com/cybersecurity)
Cause (Attack Vector) | Type of Loss |
---|---|
Ransomware | Data breach |
Ransomware | Business disruption |
Ransomware | Extortion |
Business email compromise | Fraud |
Cloud compromise | Data breach |
Cloud compromise | Business disruption |
SaaS compromise | Data breach |
SaaS compromise | Business disruption |
Insider threat | Data breach |
Insider threat | Business disruption |
Insider threat | Fraud |
…and so on |
Like any audit, this rapid audit should follow a structured checklist. Again, we are keeping it very simple for now. Since you will probably rely on some other people for some of the inputs, we need to make it easy to explain. You can download a spreadsheet from the following website as an optional audit step. The math at this point is so simple you hardly need it worked out for you in a spreadsheet, but it should provide some structure. www.howtomeasureanything.com/cybersecurity.
See Table 3.2 for an example with computed AEL values.
TABLE 3.2 Example of Rapid Risk Audit
Attack Vector | Loss Type | Annual Probability of Loss | 90% CI of Impact | Annual Expected Loss | |
---|---|---|---|---|---|
Lower Bound | Upper Bound | ||||
Ransomware | Data breach | 0.1 | $50,000 | $500,000 | $20,750 |
Ransomware | Business disruption | 0.05 | $100,000 | $10,000,000 | $178,250 |
Ransomware | Extortion | 0.01 | $200,000 | $25,000,000 | $88,800 |
Business email compromise | Fraud | 0.03 | $100,000 | $15,000,000 | $159,450 |
Cloud compromise | Data breach | 0.05 | $250,000 | $30,000,000 | $533,125 |
Etc. | Etc. | Etc. | Etc. | Etc. | Etc. |
Once you have completed this rapid risk audit, you have a total dollar amount that approximates the total probability‐weighted losses for all the listed threats. It gives you a sense of the real scale of your risks. The AEL for each risk also gives you some idea of which risks should be more concerning.
Again, this is not much more complicated than a risk matrix, which, according to our surveys, about half of organizations already use. We simply replaced each risk on a risk matrix, one for one. So how is any of this different from what many organizations do now?
Since we are still making some simple judgments about likelihoods and impacts, it may seem like we haven't changed much. After all, the reason we refer to our methods of replacing the risk matrix as a one‐for‐one substitution is because we can directly map the previous risks to the proposed model. However, even though the approach we described is very simple and may still be based mostly on subjective estimates, it introduces some basic quantitative principles the risk matrix excludes. The following table (Table 3.3) contrasts the two approaches.
TABLE 3.3 The Simple Substitutions of the Quantitative Model vs. the Risk Matrix
Instead of: | We Substitute: |
---|---|
Rating likelihood on a scale of 1 to 5 or “low” to “high.” Example: “Likelihood of X is a 2” or “Likelihood of X is medium.” | Estimating the probability of the event occurring in a given period of time (e.g., 1 year). Example: “Event X has a 10% chance of occurring in the next 12 months.” For some items we can use industry data. If we feel stumped, we use LRS from Chapter 2 for guidance. |
Rating impact on a scale of 1 to 5 or “low” to “high.” Example: “Impact of X is a 2” or “Impact of X is medium.” | Estimating a 90% CI for a monetized loss. Example: “If event X occurs, there is a 90% chance the loss will be between $1 million and $8 million.” |
Plotting likelihood and impact scores on a risk matrix and further dividing the risk matrix into risk categories such as “low/medium/high” or “green/yellow/red” | Using the quantitative likelihood and impact to generate an annualized expected loss (AEL), and we can prioritize risks and controls. We can build on the AEL further when we use the same inputs in a very simple Monte Carlo simulation. |
In following chapters, we will describe in greater detail the problems that are introduced by the risk matrix and measurable improvements observed when using more quantitative methods, even if the quantitative assessments are subjective.
In the remainder of this chapter we will discuss the two sources of input for the initial estimates used in this simple model: actual data (from your own firm or the industry) and the judgments of your experts. We'll discuss both of these next. After that, we will cover how to use this simple model for decisions about controls and how to improve the output of the risk audit with simple simulations.
Recall a key point from the previous chapter; you have more data than you think. Using Laplace's rule of succession gives you some basis for interpreting what might seem like very limited observations. But you also have access to a wide variety of publicly available data. Here are just a few publicly available sources.
If you want to dig deeper, there are other potential sources. These might be out of the scope of a rapid audit, but there are some sources that can give you some guidance if you are willing to spend some time and/or money on getting the estimates.
For example, many public sources estimate the total number of breaches in the industry, but if you want to turn that into a likelihood that a threat will damage your firm, you will need to estimate a baseline. If you are a major retailer, you will want to look up the number of breaches in that industry in a given year. Then, to convert that into a baseline likelihood, compare that to the total number of major retailers and apply LRS.
There are some sources that can do some work for you. Advisen LTD is a firm that tracks details on breaches. Firms like Cyentia Institute (which the authors have worked with) specialize in the analysis of this data. We will cover more of these sources in later chapters.
There is also new academic research coming out frequently about every risk. If you work for a large firm or have access to university resources, you may already have access to large databases of published research. Even without that, free sources such as Google Scholar can be useful. But this is a bit like prospecting for gold. You may have to search a while to find something that directly or indirectly informs what you need for estimating your firm's own risks.
If you were using qualitative scoring methods or risk matrices before, you were relying on expert judgment. The rapid audit uses that, too. We only propose that instead of using scales such as “high, medium, low” or 1 to 5, experts learn how to subjectively assess the actual quantities behind those scales—that is, probability and dollar impact.
The method proposed is, like the risk matrix, really just another expression of your current state of uncertainty. It does not yet reflect a proper “measurement” in the sense that we have further reduced uncertainty based on additional observations. We are merely stating our prior uncertainty. But now we have expressed this level of uncertainty in a way that allows us to unambiguously communicate risk and update this uncertainty with new information.
Some may object to the idea of subjectively assessing probabilities. Some analysts who had no problem saying likelihood was a 4 on a scale of 1 to 5 or a medium on a verbal scale will argue that there are requirements for quantitative probabilities that make quantification infeasible. Somehow, the problems that were not an issue using more ambiguous methods are major roadblocks when attempting to state meaningful probabilities.
This is a common misunderstanding. As we first introduced in Chapter 2, it is a mathematically valid position to use a subjective probability to represent the prior state of uncertainty of a subject matter expert. In fact, there are problems in statistics that can only be solved by using a probabilistically expressed prior state of uncertainty, which may require a belief expressed in the form of a probability estimate. And these are actually the very situations most relevant to decision making in any field, including cybersecurity.
Later, we will discuss the sources supporting this approach, including some very large empirical studies demonstrating its validity. Additionally, we have a chapter dedicated to helping readers measure and improve their own skill at assessing probabilities using a short series of exercises that can help them continue to improve it over time. We call this “calibrated probability assessment,” and we will show that there is quite a bit of research backing up the validity of this approach. Most experts are consistently overconfident in estimates and calibration training reduces this error. We will also show methods for combining the estimates of multiple experts in a way that measurably outperforms the best individual expert.
For now, just recognize that most experts can be trained to subjectively assess probabilities and that this skill is objectively measurable (as ironic as that sounds). Remember, if the primary concern about using probabilistic methods is the lack of data, then you also lack the data to use nonquantitative methods. As we've stated, both methods are based on the same source of data so far—that is, the expert opinion of cybersecurity specialists. And we cannot assume that whatever errors you may be introducing to the decision by using quantitative probabilities without being trained are being avoided by using qualitative methods. In Chapter 5, we'll talk more about this among other common misconceptions and objections to quantitative methods.
In addition to correcting for overconfidence, the expert can also be improved by using methods that account for two other sources of error in judgment: the high degree of expert inconsistency and a tendency to make common inference errors when it comes to thinking probabilistically. These improvements will also be addressed in upcoming chapters. (Of course, these sources of error are not dealt with in the typical risk matrix at all.)
Ultimately, the point of risk analysis—even with the risk matrix we are replacing—is to support decisions. But the difficulty we had before was making specific resource‐allocation choices for specific controls. What is it worth, after all, to move one high risk to a medium? Is it $5,000 or $5 million? Or what if we have a budget of $8 million for cybersecurity and 80 lows, 30 mediums, and 15 highs? And what if we can mitigate ten lows for the same money as one medium?
If you have observed (as the authors have) someone asking a question such as, “If we spent another million dollars, can we move this risk from a red to a yellow?” then you may have felt the dissatisfaction from popular qualitative approaches. Clearly the traditional risk matrix offers little guidance once the CISO actually has to make choices about allocating resources. You might think that you may as well do without these methods altogether. But, as we will show later, the CISO should definitely not assume they handle these decisions well using their expert intuition alone.
What the CISO needs is a “return on control” calculation. That is the monetized value of the reduction in expected losses divided by the cost of the control. If we look only at the benefits in a single year (and ignore other time‐value considerations), we can show this as:
The difference in the loss before and after the control is the “reduction in expected losses” in the simple formula above. If the reduction in expected losses was exactly as much as the cost, then this formula would say the return on control (ROC) was 0%. This would be the convention for other forms of investment. Suppose we had an AEL of $100,000 for a given threat and we had a control that we estimated would reduce the chance of that loss by half. That would be a reduction in expected losses of $50,000. If the control cost $10,000, then the ROC is 50,000/10,000 – 1, or 400%
A ROC might have a benefit if it simply reduces the impact and not the chance that a threat event would occur. Methods to detect and stop attacks before they do more damage or methods to quickly recover after a business disruption may not reduce the chance the event would occur in the first place but reduce its impact. This particular type of control might be referred to as a risk mitigation. Regardless of whether it reduces likelihood, impact, or both, the calculation for ROC is the same.
In this example, we assumed that you paid for a risk reduction for one year only. If the control is an investment meant to provide risk reductions over multiple years, then you would also have to identify over what period of time this expected reduction in losses would occur. If the control was just an ongoing expense that could be started and stopped at any time, then this simple formula could just be applied to a year's worth of benefits (loss reduction) and a year's worth of costs. If the control is a one‐time investment that could provide benefits over a longer period of time, then follow the financial conventions in your firm for capital investments. You will probably be required then to compute the benefits as a “present value” of a stream of investments at a given discount rate. Or you may be asked to produce an “internal rate of return.” We won't spend time on those methods here, but there are fairly simple financial calculations that can, again, be done entirely with simple functions in Excel.
A note of caution is needed if you plan on decomposing our simple range for impact further into more variables that go into computing impact. In the example we have shown so far, computing the expected losses is a very simple calculation. We can just multiply the computed mean of the distribution for the impact times the probability of the event occurring (the spreadsheet we provide on the website does this for you). But if we decompose impact into multiple components that need to be multiplied together (e.g., records breached times cost per record, duration of outage times number of people affected times cost per person per hour), then working with the averages no longer becomes a reasonable estimation. We effectively have to run a separate simulation on each row. But in our simplest model we can ignore that for now. As we make the model more advanced, we can add more detail. Later chapters will describe how you can evolve the model by adding elements that incrementally improve realism.
The calculations for ROC are also provided in the example Excel spreadsheet you can download from www.howtomeasureanything.com/cybersecurity.
You have just reviewed how to run a very simple rapid audit of cybersecurity risks. We left out a lot of detail about probabilistic methods, but it is enough to get by for the first day of being a CISO. Now, before you continue on from this chapter, let's show what a simple simulation can do with the ranges and probabilities you already collected.
Using ranges and probabilities to represent your uncertainty instead of unrealistically precise point values clearly has advantages. When you allow yourself to use ranges and probabilities, you don't really have to assume anything you don't know for a fact. But precise values have the advantage of being simple to add, subtract, multiply, and divide in a spreadsheet. If you knew each type of loss exactly, it would be easy to compute the total loss.
In the previous rapid audit method, we asked for intervals, but we still boiled them down to a single point. We could have two ranges with the same mean but where one is far wider than the other. We lost something about the uncertainty when we converted that 90% CI to a single number. If multiple separate events occur in one year, how do we add up the ranges so that we have an aggregated range for the total? We would like to be able to answer questions such as “Given all the threats, what is the chance that the total losses will exceed $20 million in a single year?” We can't answer that by adding up AELs. We need to do the math with the entire range of possibilities.
So how do we add, subtract, multiply, and divide in a spreadsheet when we have no exact values, only ranges? Fortunately, there is a practical, proven solution, and it can be performed on any modern personal computer—the Monte Carlo simulation method. A Monte Carlo simulation uses a computer to generate a large number of scenarios based on probabilities for inputs. For each scenario, a specific value would be randomly generated for each of the unknown variables. Then these specific values would go into a formula to compute an output for that single scenario. This process usually goes on for thousands of scenarios.
In the 1940s, some mathematicians and scientists started using such simulations of thousands of random trials to help solve certain very hard mathematical problems. Stanislaw Ulam, John von Neumann, and Nicholas Metropolis had developed a way to use this method on the rudimentary computers available at the time to help solve math problems related to the development of the first atomic bomb in the Manhattan Project. They found that randomly running a large number of trials was a way to work out the probabilities of various outcomes when a model has a number of highly uncertain inputs. At the suggestion of Metropolis, Ulam named this computer‐based method of generating random scenarios after Monte Carlo, a famous gambling hotspot, in honor of Ulam's uncle, a gambler.1 Now, with the advantage of greater computing power (easily billions of times greater than what was available on the Manhattan Project, by almost any measure), Monte Carlo simulations have been used to simulate models on power plants, supply chains, insurance, project risks, financial risks, and, yes, cybersecurity.
If you have no experience with Monte Carlo simulations, they're probably easier than you think. The authors and many of their staff routinely apply Monte Carlo simulations on a variety of practical business problems. We have seen that many people who initially were uncomfortable with the idea of using Monte Carlo simulations eventually became avid supporters after tinkering with the tools themselves.
Let's start with the same likelihoods and 90% CIs you already gathered for the initial rapid audit. You have a set of possible events that could occur in a given one‐year period. Each event has an assigned probability and, if it occurs, a range of possible losses. You may have some events that have a probability of 1% and perhaps some have a probability of more than 10%. In any given year it is possible that no significant loss event occurs, and it is also possible that several events occur. It is even possible that the same event could happen multiple times in the same year. There is a solution to that, but for now we will keep it simple and just model an event as an either/or outcome that happens no more than once per year. The risk matrix doesn't make this distinction anyway (and it misses quite a few other issues we will introduce later), so this will help keep the example closer to a one‐for‐one substitution.
Just like the previous example, we have all the tools we are about to describe already set up as templates in Excel on the website www.howtomeasureanything.com/cybersecurity.
As stated earlier, we offer this in Excel to make it more accessible. But we've also been asked to provide solutions in Python and R. There are also existing tools specifically for Monte Carlo simulations in cybersecurity. Those tools might allow you to gloss over the basics on Monte Carlo simulations in the remainder of this chapter. But for those who want to understand some basics regardless of the tool or would like to see how the Excel spreadsheets work, read on through the rest of this chapter.
Armed with this information and some of the more detailed content of upcoming chapters, you should be able to model your uncertainty and answer questions such as “What is the chance we will lose more than X next year due to a cyberattack?”
The prepared spreadsheets on the book's website will have all of the following set up so that you don't have to get into the details of how to run a simulation for risk assessment. But if you want to eventually make more elaborate models on your own, this will be a useful primer.
Monte Carlo can involve thousands of randomly generated scenarios. This can be done in any programming language the reader is probably familiar with, and it can even be done in an Excel spreadsheet. First, let's generate one random scenario for a single threat. Let's start with whether the event occurred for a single risk in a single scenario. To simulate whether a particular event occurs, we can randomly generate a “1” if it occurs and a “0” if it does not occur, where the probability of a “1” is equal to the stated probability of the event. In Excel, we can write this as
For example, if the event probability per year is .15, then this equation would produce a “1” (meaning the event occurred) in 15% of the scenarios. In Excel, every time you recalculate this (press F9), you will see a different result. If you did this a thousand times, you would see the event occurs about 150 times. Note that this would be for each individual risk you are listing in your simulation. So if you have 100 risks each with different probabilities and you run 1,000 scenarios, this little formula would have been executed 100,000 times.
One note we'll mention early is that we are using the Excel “rand()” function here as a shorthand for any random number generator. There actually is a rand() function in Excel, and the above equation will work exactly as described. But wherever we mention rand(), our spreadsheets will actually be using a different method for generating a random number. See the inset “HDR PRNG” for more details. This is explained further in the spreadsheets.
Now, for the impact, we need to generate not just a “0” or “1” but a continuum of values. We can do this using one of Excel's “inverse probability functions.” Some probability functions in Excel will tell you the probability of a given result in a particular probability distribution. For example, normdist(x, mean, standard deviation, 1) will tell you the probability that a normal distribution with a given mean and standard deviation will produce a value of x or less. The inverse probability function, on the other hand, tells you the value of x given a probability. In Excel, the inverse probability function for a normal distribution is
(Note: Recent versions of Excel also use “norm.inv()” but “norminv()” will still work.) If you use the Excel rand() function (or HDR PRNG) in place of the probability term, this will produce a normally distributed random value with the stated mean and standard deviation. The standard deviation is a sort of measure of the width of a probability distribution, but it is not really a very intuitive quantity for an expert to estimate. It will be better if we just ask the expert for a 90% CI as described earlier. This can be used to compute the required parameters such as mean and standard deviation based on an upper bound (UB) and lower bound (LB) of describing a range of potential losses provided by the expert.
We are going to turn that range into a probability distribution of a particular type that we will use often: the “lognormal” distribution. When we determined how to weight the bounds to compute the expected value in the simple rapid audit at the beginning of this chapter, this was the type of distribution we were assuming.
The lognormal distribution is a variation of the more familiar, bell‐shaped normal distribution. It is just a normal distribution on the log of a value we want to simulate and it is a distribution that is usually a much better representation of reality. Figure 3.1 illustrates an example of this distribution compared to the normal distribution. Notice how the lognormal distribution is lopsided or “skewed,” unlike the normal distribution. The lognormal distribution can't generate a zero or negative amount, but it has a tail to the right that allows for the possibility of extremely large outcomes. This is why it is often a realistic representation of losses that may be shockingly high. A normal distribution wide enough to capture some extreme events could also produce illogical negative results on the other end of the scale (you can't have a negative number of records breached or a negative downtime for a system). This is why the lognormal is also used to model a variety of quantities that can't be negative but could possibly (but rarely) be very large.
To generate a lognormal distribution, the tool provided on the book's website uses the following formula in Excel:
where:
standard deviation of ln(X) | = | (ln(UB) − ln(LB))/3.29 |
mean of ln(X) | = | (ln(UB) + ln(LB))/2 |
So if we had a 90% CI for an impact of $100,000 to $8 million, then the mean and standard deviation we need to use for lognorm.inv (which is the mean and standard deviation of the log of the original distribution) would be:
To generate the loss for an event with a 5% chance of occurrence and an impact of $1 million to $9 million, we would write
Most of the time (95%), this function would produce a zero. And just 5% of the time it would generate a value with a 90% chance of falling between $1 million and $9 million. Note that since this is a 90% CI, there is a 5% chance of being below the lower bound (but above zero, since the log normal distribution can only produce positive values) and a 5% chance of being above the upper bound and sometimes well above. If the event occurs in the example above, there is a 1% chance the loss could exceed $14.2 million.
Here is one note of caution in using lognormal distributions. The extreme losses for a given 90% CI may be unrealistic when the upper bound is many times the lower bound. This can happen when the expert estimating the value makes the mistake of believing the upper bound represents a worst‐case extreme, which it is not. The upper bound of a 90% CI allows for a 5% chance the value is higher. Extreme outcomes are also sensitive to the lower bound. If the 90% CI is $10,000 to $1 million, then the upper bound is 100 times as much as the lower bound. In this case there is a 1% chance the loss will exceed 2.6 times the stated upper bound ($2.6 million). If the 90% CI was $1,000 to $10 million, then there is a 1% chance the loss could be more than 6.7 times the upper bound ($67 million). If that seems like too much, then reconsider the width of the range or simply truncate the generated value to some maximum. If we wanted to say that $10 million was the maximum loss, then we could use the Excel function =min(loss,$10000000) to take the lesser of the loss or $10 million. Appendix A has more distributions that will be more appropriate for certain kinds of problems. It provides the Excel formulas for them, along with a description of when different distributions are appropriate. Later, we will review some considerations for choosing distributions.
For a large number of events and impacts, we could make a table like Table 3.2 to simulate all of the losses for all of the events. Table 3.4 is like Table 3.2 but with an additional column on the right showing a single randomly generated result. An example is provided for download at www.howtomeasureanything.com/cybersecurity.
The value of interest in this particular trial shown in Table 3.4 is the total losses: $23,345,193. All you have to do now is run a few thousand more trials to see what the distribution of losses will be. Every time you recalculate this table you would see a different value come up in the total. (If you are an MS Office user on a PC, “recalculate” should be your F9 key.) If you could somehow record every result in a few thousand trials, then you have the output of a Monte Carlo simulation.
The easiest way to do this in Excel is with a “data table” in the “What‐If Analysis” tools. You can run as many trials as you like and show each individual result without you having to copy the result in Table 3.4 thousands of times. The data table lets the Excel user see what a series of answers would look like in a formula if you could change one input at a time. For example, you might have a very big spreadsheet for computing retirement income that includes current savings rates, market growth, and several other factors. You might want to see how the estimate of project duration changes if you modified your monthly savings from $100 to $5,000 in $100 increments. A data table would automatically show all of the results as if you manually changed that one input each time yourself and recorded the result. The spreadsheet you can download at www.howtomeasureanything.com/cybersecurity uses this method.
TABLE 3.4 The One‐for‐One Substitution with Random Scenarios
Attack Vector | Loss Type | Annual Probability of Loss | 90% CI of Impact | Random Result (zero when the event did not occur) | |
---|---|---|---|---|---|
Lower Bound | Upper Bound | ||||
Ransomware | Data breach | 0.05 | $100,000 | $10,000,000 | $8,456,193 |
Ransomware | Business disruption | 0.01 | $200,000 | $25,000,000 | 0 |
Ransomware | Extortion | 0.03 | $100,000 | $15,000,000 | 0 |
Business email compromise | Fraud | 0.05 | $250,000 | $30,000,000 | 0 |
Cloud compromise | Data breach | 0.1 | $200,000 | $2,000,000 | 0 |
Etc. | Etc. | Etc. | Etc. | Etc. | Etc. |
Total: | $23,345,193 |
If you want to find out more about data tables in general, the help pages for Excel can take you through the basics, but we do make one modification in the example spreadsheet. Usually, you will need to enter either a “column input cell” or “row input cell” (we would just use “column input cell” in our example) to identify which value the data table will be repeatedly changing to produce different results. In this case, we don't really need to identify an input to change because we already have the rand() function that changes every time we recalculate. So our “input” values are just arbitrary numbers counting from 1 to the number of scenarios we want to run.
There is an important consideration for the number of trials when simulating cybersecurity events. We are often concerned with rare but high‐impact events. If an event likelihood is only 1% each year, then 10,000 trials will produce about 100 of these events most of the time, but this varies a bit. Out of this number of trials, it can vary randomly from an exact value of 100 (just as flipping a coin 100 times doesn't necessarily produce exactly 50 heads). In this case the result will be between 84 and 116 about 90% of the time.
Now, for each of these times the event occurs, we have to generate a loss. If that loss has a long tail, there may be a significant variation each time the Monte Carlo simulation is run. By “long tail,” what we mean is that it is not infeasible for the loss to be much more than the average loss. We could have, for example, a distribution for a loss where the most likely outcome is a $100,000 loss, but there is a 1% chance of a major loss ($50 million or more). The one‐percentile worst‐case scenario in a risk that has only a 1% chance per year of occurrence in the first place is a situation that would happen with a probability of 1/100 × 1/100, or 1 in 10,000, per year. Since 10,000 is our number of trials, we could run a simulation where this worst‐case event occurred one or more times in 10,000 trials, and we could run another 10,000 trials where it never occurred at all. This means that every time you ran a Monte Carlo simulation, you would see the average total loss jump around a bit.
The simplest solution to this problem for the Monte Carlo modeler who doesn't want to work too hard is to throw more trials at it. This simulation‐to‐simulation variation would shrink if we ran 100,000 trials or a million. You might be surprised at how little time this takes in Excel on a decently fast machine. We've run 100,000 trials in a few seconds using Excel, which doesn't sound like a major constraint. We have even run a million scenarios—in plain ol’ native Excel—of models with several large calculations in 15 minutes or less. As events get rarer and bigger, however, there are more efficient methods available than just greatly increasing the number of trials. But for now we will keep it simple and just throw cheap computing power at the problem.
Now we have a method of generating thousands of outcomes in a Monte Carlo simulation using native Excel—no add‐ins or Visual Basic code to run. Given that Excel is so widely used, it is almost certain any cybersecurity analyst has the tools to use this. We can then use this data for another important element of risk analysis—visualizing the risk quantitatively.
The risk matrix familiar to anyone in cybersecurity is widely used because it appears to serve as a simple illustration of likelihood and impact on one chart. In our proposed simple solution, we have simply replaced the likelihood scale with an explicit probability and the impact with a 90% CI representing a range of potential losses.
In our proposed solution, the vertical axis can still be represented by a single point—a probability as opposed to a score. But now the impact is represented by more than a single point. If we say that an event has a 5% chance of occurrence, we can't just say the impact will be exactly $10 million. There is really a 5% chance of losing something, while perhaps there is a 2% chance of losing more than $5 million, a 1% chance of losing more than $15 million, and so on.
This amount of information cannot be plotted with a single point on a risk matrix. Instead, we can represent this with a chart called a “loss exceedance curve,” or LEC. In the spirit of not reinventing the wheel (as risk managers in many industries have done many times), this is a concept also used in financial portfolio risk assessment, actuarial science, and what is known as “probabilistic risk assessment” in nuclear power and other areas of engineering. In these other fields, it is also variously referred to as a “probability of exceedance” or even “complementary cumulative probability function.” Figure 3.2 shows an example of an LEC.
Figure 3.2 shows the chance that a given amount would be lost in a given period of time (e.g., a year) due to a particular category of risks. This curve can be constructed entirely from the data generated in the previous data table example (Table 3.5). A risk curve could be constructed for a particular vulnerability, system, business unit, or enterprise. An LEC can show how a range of losses is possible (not just a point value) and that larger losses are less likely than smaller ones. In the example shown in Figure 3.2 (which, given the scale, would probably be enterprise‐level cybersecurity risks for a large organization), there is a 40% chance per year of losing $10 million or more. There is also about a 15% chance of losing $100 million or more. A logarithmic scale is used on the horizontal axis to better show a wider range of losses (but that is just a matter of preference—a linear scale can be used, too).
We can also create another variation of this chart by adding a couple of additional curves. Figure 3.3 shows three curves: inherent risk, residual risk, and risk tolerance. Inherent versus residual risk is a common distinction made in cybersecurity to represent risks before the application of proposed controls (i.e., methods of mitigating risks) and risks after the application of controls, respectively. Inherent risk, however, doesn't have to mean a complete lack of controls, since this is not a realistically viable alternative. Inherent risk might be defined instead as including only minimal required controls. Those are controls where it would be considered negligent to exclude them so there really is no dilemma about whether to include them. The differences between inherent and residual risks are only truly discretionary controls—the sorts of controls where it would be considered a reasonable option to exclude them. Examples of minimal controls may be password protection, firewalls, some required frequency of updating patches, and limiting certain types of access to administrators. The organization can make its own list of minimal controls. If a control is considered a required minimum, then there is no dilemma, and where there is no dilemma, there is no value to decision analysis. So it is helpful to focus our attention on controls where having them or not are both reasonable alternatives.
The LEC provides a simple and useful visual method for comparing a risk to our tolerance for risk. An LEC is compared to another curve, which represents what our management considers to be its maximum bearable appetite for risk. Like the LEC, the curve that represents our tolerance for risk is expressed unambiguously and quantitatively.
We will use the terms “risk appetite” and “risk tolerance” interchangeably. We should note that there is a large body of work in decision science that has some specific mathematical definitions for risk tolerance. Those slightly more advanced methods are very useful for making trade‐offs between risk and return. (Hubbard uses them routinely when he analyzes large, risky investments for clients.) But we will use this risk tolerance curve to express a kind of maximum bearable pain, regardless of potential benefits.
As Figure 3.3 shows, part of the inherent risk curve (shown in the thicker curve) is above the risk tolerance curve (shown as the dashed curve). The part of the inherent risk curve that is over the risk tolerance curve is said to “violate” or “break” the risk tolerance. The residual risk curve, on the other hand, is on or underneath the risk tolerance curve at all points. If this is the case, we say that the risk tolerance curve “stochastically dominates” the residual risk curve. This simply means that the residual risks are acceptable. We will talk about a simple process for defining a risk tolerance curve shortly, but first we will describe how to generate the other curves from the Monte Carlo simulation we just ran.
Remember, as with the other methods used in this chapter, the downloadable spreadsheet at the previously mentioned website shows all the technical details. As Table 3.6 shows, the histogram has two columns, one showing a series of loss levels. These would be the values shown on the horizontal axis of an LEC. The second column shows the percentage of Monte Carlo results that generated something equal to or higher than the value in the first column. The simplest method involves using a “countif()” function in Excel. If we use “Monte Carlo Results” to stand for the second column of values in Table 3.5 and “Loss” to mean the cell in the spreadsheet in the loss column of the same row as the following formula, we get:
The countif() function does what it sounds like. It counts the number of values in a defined range that meet a stated condition. If countif() returns 8,840 for a given range and if “Loss” is equal to $2 million, then that means there are 8,840 values in the range greater than $2 million. Dividing by 10,000 is to turn the result into a value between 0 and 1 (0% and 100%) for the 10,000 trials in the Monte Carlo simulation. As this formula is applied to larger and larger values in the loss column, the percentage of values in the simulation that exceed that loss level will decrease.
Now, we simply create an XY scatterplot on these two columns in Excel. If you want to make it look just like the LEC that has been shown, you will want to use the version of the scatterplot that interpolates points with curves and without markers for the point. This is the version set up in the spreadsheet. More curves can be added simply by adding more columns of data. The residual risk curve, for example, is just the same procedure but based on your estimated probabilities and impacts (which would presumably be smaller) after your proposed additional controls are implemented.
TABLE 3.6 Histogram for a Loss Exceedance Curve
Loss | Probability per Year of Loss or Greater |
---|---|
$ − | 99.9% |
$500,000 | 98.8% |
$1,000,000 | 95.8% |
$1,500,000 | 92.6% |
$2,000,000 | 88.4% |
$2,500,000 | 83.4% |
$3,000,000 | 77.5% |
$24,000,000 | 3.0% |
$24,500,000 | 2.7% |
One disadvantage of the LEC chart is that if multiple LECs are shown, it can get very busy‐looking. While in the typical risk matrix, each risk is shown as a single point (although an extremely unrealistic and ambiguous point), an LEC is a curve. This is why one organization that produced a chart with a large number of LECs called it a “spaghetti chart.” However, this complexity was easily managed just by having separate charts for different categories. Also, since the LECs can always be combined in a mathematically proper way, we can have aggregate LEC charts where each curve on that chart could be decomposed into multiple curves shown on a separate, detailed chart for that curve. This is another key advantage of using a tool such as an LEC for communicating risks. We provide a spreadsheet on the book's website to show how this is done.
Now, compare this to popular approaches in cybersecurity risk assessment. The typical low/medium/high approach lacks the specificity to say that “seven lows and two mediums are riskier than one high” or “nine lows add up to one medium,” but this can be done with LECs. Again, we need to point out that the high ambiguity of the low/medium/high method in no way saves the analyst from having to think about these things. With the risk matrix, the analyst is just forced to think about risks in a much more ambiguous way.
What we need to do is to create another table like Table 3.5 that is then rolled up into another Table 3.6, but where each value in the table is the total of several simulated categories of risks. Again, we have a downloadable spreadsheet for this at www.howtomeasureanything.com/cybersecurity. We can run another 10,000 trials on all the risks we want to add up, and we follow the LEC procedure for the total. You might think that we could just take separately generated tables like Table 3.6 and add up the number of values that fall within each bin to get an aggregated curve, but that would produce an incorrect answer unless risks are perfectly correlated (I'll skip over the details of why this is the case, but a little experimentation would prove it to you if you wanted to see the difference between the two procedures). Following this method we can see the risks of a system from several vulnerabilities, the risks of a business unit from several systems, and the risks across the enterprise for all business units.
Ideally, the risk tolerance curve is gathered in a meeting with a level of management that is in a position to state, as a matter of policy, how much risk the organization is willing to accept. Hubbard has gathered risk tolerance curves of several types (LEC is one type of risk tolerance quantification) from many organizations, including multiple cybersecurity applications. The required meeting is usually done in about 90 minutes. It involves simply explaining the concept to management and then asking them to establish a few points on the curve. We also need to identify which risk tolerance curve we are capturing (e.g., the per‐year risk for an individual system, the per‐decade risk for the entire enterprise). But once we have laid the groundwork, we could simply start with one arbitrary point and ask the following:
Analyst: | Would you accept a 10% chance, per year, of losing more than $5 million due to a cybersecurity risk? |
Executive: | I prefer not to accept any risk. |
Analyst: | Me too, but you accept risk right now in many areas. You could always spend more to reduce risks, but obviously there is a limit. |
Executive: | True. I suppose I would be willing to accept a 10% chance per year of a $5 million loss or greater. |
Analyst: | How about a 20% chance of losing more than $5 million in a year? |
Executive: | That feels like pushing it. Let's stick with 10%. |
Analyst: | Great, 10% then. Now, how much of a chance would you be willing to accept for a much larger loss, like $50 million or more? Would you say even 1%? |
Executive: | I think I'm more risk averse than that. I might accept a 1% chance per year of accepting a loss of $25 million or more… |
And so on. After plotting three or four points, we can interpolate the rest and give it to the executive for final approval. It is not a technically difficult process, but it is important to know how to respond to some potential questions or objections. Some executives may point out that this exercise feels a little abstract. In that case, give them some real‐life examples from their firm or other firms of given losses and how often those happen.
Also, some may prefer to consider such a curve only for a given cybersecurity budget—as in, “That risk is acceptable depending on what it costs to avoid it.” This is also a reasonable concern. You could, if the executive was willing to spend more time, state more risk tolerance at different expenditure levels for risk avoidance. There are ways to capture risk‐return trade‐offs (see Hubbard's previous books, The Failure of Risk Management and How to Measure Anything, for details on this method). But most seem willing to consider the idea that there is still a maximum acceptable risk, and this is what we are attempting to capture.
It is also worth noting that the authors have had many opportunities to gather risk tolerance curves from upper management in organizations for problems in cybersecurity as well as other areas. If your concern is that upper management won't understand this, we can say we have not observed this—even when we've been told that management wouldn't understand it. In fact, upper management seems to understand having to determine which risks are acceptable at least as well as anyone in cybersecurity.
We will bring up concerns about adoption by management again in Chapter 5 when we discuss this and other illusory obstacles to adopting quantitative methods. In later chapters, we will also discuss other ways to think about risk tolerance.
There are many models cybersecurity analysts could use to represent their current state of uncertainty. You could simply estimate likelihood and impact directly without further decomposition. You could develop a modeling method that determines how likelihood and impact are modified by specifying types of threat, the capabilities of threats, vulnerabilities, or the characteristics of systems. You can list applications and evaluate risk by application, or you can list controls and assess the risks that individual controls would address.
Ultimately, this book will be agnostic regarding which modeling strategy you use, but we will discuss how various modeling strategies should be evaluated. When enough information is available to justify the industry adopting a single, uniform modeling method, then it should do so. Until then, we should let different organizations adopt different modeling techniques while taking care to measure the relative performance of these methods.
To get started, there are some ready‐made solutions to decompose risks. In addition to the methods Hubbard uses, solutions include the methodology and tools used by the FAIR method, developed by Jack Jones and Jack Freund.3 In the authors’ opinion, FAIR, as another Monte Carlo–based solution with its own variation on how to decompose risks into further components, could be a step in the right direction for your firm. We can also build quite a lot on the simple tools we've already provided (and with more to come in later chapters). For readers who have had even basic programming, math, or finance courses, they may be able to add more detail without much trouble. So most readers will be able to extrapolate from this as much as they see fit. The website will also add a few more tools for parts of this, like R and Python for those who are interested. But since everything we are doing in this book can be handled entirely within Excel, any of these tools would be optional.
So far, this method is still only a very basic solution based on expert judgment—updating this initial model with data using statistical methods comes later. Still, even at this stage there are advantages of a method like this when compared to the risk matrix. It can capture more details about the knowledge of the cybersecurity expert, and it gives us access to much more powerful analytical tools. If we wanted, we could do any or all of the following:
We will introduce more about each of these improvements and others later in the book, but we have demonstrated what a simple rapid risk audit and one‐for‐one substitution would look like. Now we can turn our attention to evaluating alternative methods for assessing risk. Of all the methods we could have started with in this simple model and for all the methods we could add to it, how do we select the best for our purposes? Or, for that matter, how do we know it works at all?