Chapter 1. Navigating Complex Systems

In the first part of this chapter we explore the problems with complex systems. Chaos Engineering was born of necessity in a complex distributed software system. It specifically addresses the needs of operating a complex system, namely that these systems are non-linear1 which makes them unpredictable, and that leads to undesirable outcomes. This is often uncomfortable for us as engineers, because we like to think that we can plan our way around uncertainty. We are often tempted to blame these undesirable behaviors on the people who build and operate the systems, but in fact surprises are a natural property of complex systems. Later in this chapter we then ask whether we can extricate the complexity out of the system, and in doing so extricate the undesirable behaviors with it. (Spoiler: No; we cannot.)

The sections of this chapter are laid out in the natural order that experienced engineers and architects will learn to ultimately navigate complexity: contemplate, encounter, confront, and finally embrace it.

Contemplating Complexity

Before you can decide whether Chaos Engineering makes sense for your system, you need to understand where to draw the line between simple and complex. One way to characterize a system is in the way changes to input to the system correspond to changes in output. Simple systems are often described as linear. A change to input of a linear system produces a corresponding change to the output of the system. Many natural phenomena constitute familiar linear systems. The harder you throw a ball, the further it goes.

Non-linear systems have output that varies wildly based on changes to the constituent parts. The bullwhip effect is an example from Systems Thinking2 that visually captures this interaction: a flick of the wrist (small change in system input) results in the far end of the whip covering enough distance in an instance to break the speed of sound and create the cracking sound that whips are known for (big change in system output.)

Non-linear effects can take various forms: changes to system parts can cause exponential changes in output, like social networks that grow faster when they are big vs when they are small; or they can cause quantum changes in output, like applying increasing force to a dry stick, which doesn’t move until it suddenly breaks; or they can cause seemingly random output, like an up-beat song which might inspire someone during their workout one day, and bore them the next.

Linear systems are obviously easier to predict than non-linear systems. It is often relatively easy to intuit the output of a linear system, particularly after interacting with one of the parts and experiencing the linear output. For this reason, we can say that linear systems are simple systems. In contrast, non-linear systems exhibit unpredictable behavior, particularly when several non-linear parts coexist. Overlapping non-linear parts can cause system output to increase up to a point, and then suddenly reverse course, and then just as suddenly stop altogether. We say these non-linear systems are complex.

Another way we can characterize systems is less technical and more subjective, but probably more intuitive. A simple system is one in which a person can comprehend all of the parts, how they work, and how they contribute to the output. A complex system by contrast has so many moving parts, or the parts change so quickly, that no person is capable of holding a mental model of it in their head.

Table 1-1 summarizes these characterizations below.

Simple Systems Complex Systems
Linear Non-Linear
Predictable Output Unpredictable Behavior
Comprehensible Impossible to Build a Complete Mental Model

Table 1-1: Properties of Simple Systems vs Complex Systems

Looking at the accumulated characteristics of complex systems, it’s easy to see why traditional methods of exploring system safety are inadequate. Non-linear output is difficult to simulate or accurately model. The output is unpredictable. People can’t mentally model them.

In the world of software, it’s not unusual to work with complex systems that exhibit these characteristics. In fact, a consequence of the Law of Requisite Variety3 is that any control system must have at least as much complexity as the system that it controls. Since most software involves writing control systems, the great bulk of building software increases complexity over time. If you work in software and don’t work with complex systems today, it’s increasingly likely that you will at some point.

One consequence of the increase in complex systems is that the traditional role of software architect becomes less relevant over time. In simple systems, one person, usually an experienced engineer, can orchestrate the work of several engineers. The role of the architect evolved because that person can mentally model the entire system and knows how all the parts fit together. They can act as a guide and planner for how the functionality is written and how the technology unfolds in a software project over time.

In complex systems, we acknowledge that one person can’t hold all of the pieces in their head. This means that software engineers need to have greater involvement in the design of the system. Historically, engineering is a bureaucratic profession: some people have the role of deciding what work needs to be done, others decide how and when it will be done, and others do the actual work. In complex systems, that division of labor is counter-productive because the people who have the most context are the ones doing the actual work. The role of architects and associated bureaucracy becomes less efficient. Complex systems encourage non-bureaucratic organizational structures to effectively build, interact with, and respond to them.

Encountering Complexity

The unpredictable, incomprehensible nature of complex systems presents new challenges. Below we give three examples of outages caused by complex interactions. In each of these cases, we would not expect a reasonable engineering team to anticipate the undesirable interaction in advance.

Example 1: Mismatch between Business-Logic and Application-Logic

Consider the microservice architecture described below and illustrated in Fig 1-1. In this system, we have four components:

  1. Service ‘P’ stores personalized information. An ID represents a person and some metadata associated with that person. For simplicity, the metadata stored is never very large, and people are never removed from the system. P passes data to Q to be persisted.
  2. Service ‘Q’ is a generic storage service used by several upstream services. It stores data in a persistent database for fault-tolerance and recovery, and in a memory-based cache database for speed.
  3. Service ‘S’ is a persistent storage database, perhaps a columnar storage system like Cassandra or DynamoDB.
  4. Service ‘T’ is an in-memory cache, perhaps something like Redis or Memcached.
Figure 1-1. Diagram of microservice components showing flow of requests coming in to P and proceeding through storage.

To add some rational fallbacks to this system, the teams responsible for each component anticipate failures. Service Q will write data to both services: S and T. When retrieving data, it will read from Service T first, since that is quicker. If the cache fails for some reason, it will read from Service S. If both Service T and Service S fail, then it can send a default response for the database back upstream.

Likewise, Service P has rational fallbacks. If Q times out, or returns an error, then P can degrade gracefully by returning a default response. For example, P could return un-personalized metadata for a given person if Q is failing.

Figure 1-2. The in-memory cache T fails, causing the fallback in Q to rely on responses from the persistent storage database S.

One day, T fails (Fig 1-2). Lookups to P start to slow down, because Q notices that T is no longer responding, and so it switches to reading from S. Unfortunately for this setup, it’s common for systems with large caches to have read-heavy workloads. In this case, T was handling the read load quite well because reading directly from memory is fast, but S is not provisioned to handle this sudden workload increase. S slows down and eventually fails. Those requests time out.

Fortunately, Q was prepared for this as well, and so it returns a default response. The default response for a particular version of Cassandra when looking up a data object when all three replicas are unavailable is a 404 [Not Found] response code, so Q emits a 404 to P.

Figure 1-3. With T unresponsive, and S unable to handle the load of the read-heavy workload, Q returns a default response to P.

P knows that the person it is looking up exists because it has an ID. People are never removed from the service. The 404 [Not Found] response that P receives from Q is therefore an impossible condition by virtue of the business logic. (Fig 1-3) P could have handled an error from Q, or even a lack of response, but it has no condition to catch this impossible response. P crashes, taking down the entire system with it. (Fig 1-4)

Figure 1-4. The default “404 [Not Found]” response from Q seems logically impossible to P, causing it to fail catastrophically.

What is at fault in this scenario? The entire system going down is obviously undesirable system behavior. This is a complex system, where we allow that no person can hold all of the moving parts in mind. Each of the respective teams that own P, Q, S, and T made reasonable design decisions. They even went an extra step to anticipate failures, catch those cases, and degrade gracefully. So what is to blame?

No one is at fault and no service is at fault. There is nothing to blame. This is a well-built system. It would be unreasonable to expect that the engineers should have anticipated this failure, since the interaction of the components exceeds the capability of any human to hold all of the pieces in their head, and inevitably leads to gaps in assumptions of what other humans on the team may know. The undesirable output from this complex system is an outlier, produced by non-linear contributing factors.

Let’s look at another example.

Example 2: Customer-Induced Retry Storm

Consider the following snippet of a distributed system from a movie streaming service. (Fig 1-5) In this system, we have two main subsystems:

System ‘R’ stores a personalized user interface. Given an ID that represents a person, it will return a user interface customized to the movie preferences of that individual. R calls S for additional information about each person.

System ‘S’ stores a variety of information about people, such as whether they have a valid account and what they are allowed to watch. This is too much data to fit on one instance or virtual machine, so S separates access and reading and writing into two sub-components.

‘S-L’ is a load balancer that uses a consistent hash algorithm to distribute the read-heavy load to the S-D components.

‘S-D’ is a storage unit that has a small sample of the full data set. For example, one instance of S-D might have information about all of the people whose name start with the letter “m” whereas another might store all of the people whose name start with the letter “p.”4

Figure 1-5. Request path from R to S-L to S-D-N for user Louis’ data.

The team that maintains this has experience in distributed systems and industry norms in cloud deployment. This includes measures like having rational fallbacks. If R can’t retrieve information about a person from S, then it has a default user interface. Both systems are also conscientious about cost, and so they have scaling policies that keep the clusters appropriately sized. If disk I/O drops below a certain threshold on S-D for example, S-D will hand off data from the least busy node, shut that node down, and S-L will redistribute that workload to the remaining nodes. S-D data is held in a redundant on-node cache, so if the disk is slow for some reason, a slightly stale result can be returned from the cache. Alerts are set to trigger on increased error ratios; outlier detection will restart instances behaving oddly; etc.

One day, a customer who we will call Louis is watching streaming video from this service under non-optimal conditions. Specifically, Louis is accessing the system from a web browser on their laptop on a train. At some point a strange thing happens in the video and surprises Louis. They drop their laptop, on the ground, some keys are pressed, and when they situate the laptop again to continue watching, the video is frozen.

Louis does what any sensible customer would do in this situation and hits the ‘refresh’ button 100 times. The calls are queued in the web browser, but at that moment the train is between cell towers, so a network partition prevents the requests from being delivered. When the wifi signal reconnects, all 100 requests are delivered at once.

Back on the server side, R receives all 100 requests and initiates 100 equal requests to S-L, which uses the consistent hash of Louis’ ID to forward all of those requests to a specific node in S-D that we will call S-D-N. 100 requests at once is a significant increase since S-D-N is use to getting a baseline of 50 requests per second. This is a threefold increase over baseline, but fortunately we have rational fallbacks and degradations in place.

S-D-N can’t serve 150 requests (baseline plus Louis) in one second from disk, so it starts serving requests from the cache. This is significantly faster. As a result, both disk I/O and CPU utilization drop dramatically. At this point, the scaling policies kick in to keep the system right-sized to cost concerns. Since disk I/O and CPU utilization are so low, S-D decides to shut down S-D-N and hand off its workload to a peer node. Or maybe anomaly detection shut off this node; sometimes it’s difficult to say in complex systems.

Figure 1-6. Request path from R to S-L to S-D-M for user Louis’ data after S-D-N shuts down and hands off data.

S-L returns responses to 99 of Louis’ requests, all served from S-D-N’s cache, but the 100’th response is lost due to the configuration of the cluster being changed as S-D-N shuts down and data handoff takes place. For this last response, since R gets a timeout error from S-L, it returns a default user interface rather than the personalized user interface for Louis.

Back on their laptop, Louis’ web browser ignores the 99 proper responses and renders the 100th response, which is the default user interface. To Louis, this appears to be another error, since it is not the personalized user interface to which they are accustomed.

Louis does what any sensible customer would do in this situation and hits the ‘refresh’ button another 100 times. This time, the process repeats but S-L forwards the requests to S-D-M, who took over from S-D-N. Unfortunately, data handoff has not completed, so the disk on S-D-M is quickly overwhelmed.

S-D-M switches to serving requests from cache. Repeating the procedure that S-D-N followed, this significantly speeds up requests. Disk I/O and CPU utilization drop dramatically. Scaling policies kick in and S-D decides to shut down S-D-M and hand off its workload to a peer node. (Fig 1-7)

Figure 1-7. Request path from R to S-L to S-D for user Louis’ data after S-D-M and S-D-N both shut down and hand off data.

S-D now has a data handoff situation in flight for two nodes. These nodes are responsible not just for the Louis user, but for a percentage of all users. R receives more timeout errors from S-L for this percentage of users, so R returns a default user interface rather than the personalized user interface for these users.

Back on their client devices, these users now have a similar experience to Louis. To many of them this appears to be another error, since it is not the personalized user interface to which they are accustomed. They do what any sensible customer would do in this situation and hit the ‘refresh’ button 100 times.

We now have a user-induced retry storm.

The cycle accelerates. S-D shrinks and latency spikes as more nodes are overwhelmed by handoff. S-L struggles to satisfy requests as the request rate increases dramatically from client devices while timing out simultaneously keeps requests sent to S-D open longer. Eventually R, keeping all of these requests to S-L open even though they will eventually time out, has an overwhelmed thread pool which crashes the virtual machine. The entire service falls over. (Fig 1-8)

Figure 1-8. Request path as S-D is scaled down and R is pummelled by user-induced retry storms.

To make matters worse, the outage causes more client-induced retries, which makes it even more difficult to remediate the issue and bring the service back online to a stable state.

Again we can ask: what is at fault in this scenario? Which component was built incorrectly? In a complex system no person can hold all of the moving parts in their mind. Each of the respective teams that built R, S-L, and S-D made reasonable design decisions. They even went an extra step to anticipate failures, catch those cases, and degrade gracefully. So what is to blame?

As with the prior example, no one is at fault here. There is nothing to blame. Of course with the bias of hindsight, we can improve this system to prevent the scenario described above from happening again. Nevertheless, it would be unreasonable to expect that the engineers should have anticipated this failure. Once again non-linear contributing factors conspired to emit an undesirable result from this complex system.

Let’s look at one more example.

Example 3: Holiday Code Freeze

Consider the following infrastructure setup (Fig 1-9):

Component ‘E’ is a load balancer that simply forwards requests, similar to an elastic load balancer (ELB) on the AWS cloud service.

Component ‘F’ is an API Gateway. It parses some information from the headers, cookies, and path. It uses that information to pattern-match an enrichment policy, for example adding additional headers that indicate which features that user is authorized to access. It then pattern-matches a backend and forwards the request to the backend.

Component ‘G’ is a sprawling mess of backend applications running with various levels of criticality, on various platforms, serving countless functions, to an unspecified set of users.

The team maintaining F has some interesting obstacles to manage. They have no control over the stack or other operational properties of G. Their interface has to be flexible to handle many different shapes of patterns to match request headers, cookies, and paths and deliver the requests to the correct place. The performance profile of G runs the full spectrum, from low-latency responses with small payloads to keep-alive connections that stream large files. None of these factors can be planned for because the components in G and beyond are themselves complex systems with dynamically changing properties.

F is highly flexible, handling a diverse set of workloads. New features are added to F roughly once per day and deployed to satisfy new use cases for G. To provision such a functional component, the team vertically scales the solution over time to match the increase in use cases for G. Larger and larger boxes allow them to allocate more memory, which takes time to initiate. More and more pattern matching for both enrichment and routing results in a hefty ruleset, which is preparsed into a state machine and loaded into memory for faster access. This too takes time. When all is said and done, these large virtual machines running F each take about 40 minutes to provision from the time that the provisioning pipeline is kicked off to the time that all of the caches are warm and the instance is running at or near baseline performance.

Because F is in the critical path of all access to G, the team operating it understand that it is a potential single point of failure. They don’t just deploy one instance; they deploy a cluster. The number of instances at any given time is determined so that the entire cluster has an additional 50% capacity. At any given time, a third of the instances could suddenly disappear and everything should still keep working.

Vertically scaled, horizontally scaled, and overprovisioned: F is an expensive component.

To go above and beyond with regard to availability, the team takes several additional precautions. The CI pipeline runs a thorough set of unit and integration tests before baking an image for the virtual machine. Automated canaries test any new code change on a small amount of traffic before proceeding to a blue/green deployment model that runs a good chunk of the cluster in parallel before completely cutting over to a new version. All pull requests to change the code to F undergo a two-reviewer policy and the reviewer can’t be someone working on the feature being changed, requiring the entire team to be well informed about all aspects of development in motion.

Finally, the entire organization goes into a code freeze at the beginning of November until January. No changes are allowed during this time unless it is absolutely critical to the safety of the system, since the holidays between Black Friday and New Years Day are the peak traffic seasons for the company. The potential of introducing a bug for the sake of a non-critical feature could be catastrophic in this timeframe, so the best way to avoid that possibility is to not change the system at all. Since many people take vacation around this time of year, it works out to have restricted code deployments from an oversight perspective as well.

Then one year an interesting phenomenon occurs. During the end of the second week in November, two weeks into the code freeze, the team is paged for a sudden increase in errors from one instance. No problem: that instance is shut down and another is booted up. Over the course of the next 40 minutes, before the new instance becomes fully operational, several other machines also experience a similar increase in errors. As new instances are booted to replace those, the rest of the cluster experiences the same phenomenon.

Over the course of several hours, the entire cluster is replaced with new instances running the same exact code. Even with the 50% overhead, a significant number of requests go unserved during the period while the entire cluster is rebooted over such a short interval. This partial outage fluctuates in severity for hours before the entire provisioning process is over and the new cluster stabilizes.

The team faces a dilemma: In order to troubleshoot the issue they would need to deploy a new version with observability measures focused in a new area of code. But the code freeze is in full swing, and the new cluster appears by all metrics to be stable. The next week they decide to deploy a small number of new instances with the new observability measures.

Two weeks go by without incident when suddenly the same phenomenon occurs again. First a few, then eventually all of the instances see a sudden increase in error rates. All instances that is, except those that were instrumented with the new observability measures.

As with the prior incident, the entire cluster is rebooted over several hours and seemingly stabilizes. The outage is more severe this time since the company is now within its busiest season.

A few days later, the instances that were instrumented with new observability measures begin to see the same spike in errors. Due to the metrics gathered, it is discovered that an imported library causes a predictable memory leak that scales linearly with the number of requests served. Because the instances are so massive, it takes approximately two weeks for the leak to eat up enough memory to cause enough resource deprivation to affect other libraries.

This bug had been introduced into the code base almost nine months earlier. The phenomenon was never seen before because no instance in the cluster had ever run for more than four days. New features caused new code deployment which cycled in new instances. Ironically, it was a procedure meant to increase safety -- the holiday code freeze -- that caused the bug to manifest in an outage.

Again we ask: what is at fault in this scenario? We can identify the bug in the dependent library, but we learn nothing if we point the blame to an external coder who does not even have knowledge of this project. Each of the team members working on F made reasonable design decisions. They even went an extra step to anticipate failures, stage deployment of new features, overprovision, and ‘be careful’ as much as they could think to do. So who is to blame?

As with both prior examples, no one is at fault here. There is nothing to blame. It would be unreasonable to expect that the engineers should have anticipated this failure. Non-linear contributing factors produced an undesirable and expensive output in this complex system.

Confronting Complexity

The three preceding examples illustrate cases where none of the humans in the loop could reasonably be expected to anticipate the interactions that ultimately led to the undesirable outcome. Humans will still be writing software for the foreseeable future, so taking them out of the loop is not an option. What then can be done to reduce systemic failures like these?

One popular idea is to reduce or eliminate the complexity. Take the complexity out of a complex system, and we will not have the problems of complex systems anymore.

Perhaps if we could reduce these systems to simpler, linear ones then we would even be able to identify who is to blame when something goes wrong. In this hypothetical simpler world, we can imagine a hyper-efficient, impersonal manager could remove all errors simply by removing the bad apples who create those errors.

To examine this possible solution, it is helpful to understand a few additional characteristics of complexity. Roughly speaking, complexity can be sorted into two buckets: Accidental, and Essential, a distinction made by Frederick Brooks in the 1980s.5

Accidental Complexity

Accidental complexity is a consequence of writing software within a resource limited setting, namely this Universe. In everyday work there are always competing priorities. For software engineers the explicit priorities might be feature velocity, test coverage, and idiomaticity. The implicit priorities might be economy, workload, and safety. No one has infinite time and resources, so navigating these priorities inevitably results in a compromise.

The code we write is imbued with our intentions, assumptions, priorities at one particular point in time. It cannot be correct because the world will change, and what we expect from our software will change with it.

A compromise in software can manifest as a slightly suboptimal snippet of code, a vague intention behind a contract, an equivocating variable name, an emphasis on a later-abandoned code path, etc. Like dirt on a floor, these snippets accumulate. No one brings dirt into a house and puts it on the floor on purpose; it just happens as a byproduct of living. Likewise, suboptimal code just happens as a byproduct of engineering. At some point these accumulated suboptimals exceed the ability of a person to intuitively understand them and at that point we have complexity -- specifically, accidental complexity.

The interesting thing about accidental complexity is that there is no known, sustainable method to reduce it. You can reduce accidental complexity at one point in time by stopping work on new features to reduce the complexity in previously written software. This can work, but there are caveats.

For example: there is no reason to assume that the compromises that were made at the time the code was written were any less informed than the ones that will be made in a refactoring. The world changes, as does our expectation of how software should behave. It is often the case that writing new software to reduce accidental complexity simply creates new forms of accidental complexity. These new forms may be more acceptable than the prior, but that acceptability will expire at roughly the same rate.

Large refactors often suffer from what is known as the Second-System effect, a term also introduced by Frederick Brooks, in which the subsequent project is supposed to be better than the original because of the insight gained during development of the first. Instead these second systems end up bigger and more complex due to unintentional tradeoffs inspired by the success of writing the first version.

Regardless of the approach taken to reduce accidental complexity, none of these methods are sustainable. They all require a diversion of limited resources like time and attention away from developing new features. In any organization where the intention is to make progress, these diversions conflict with other priorities. Hence, they are not sustainable.

So as a byproduct of writing code, accidental complexity is always accruing.

Essential Complexity

If we cannot sustainably reduce accidental complexity, then perhaps we can reduce the other kind of complexity. Essential complexity in software is the code that we write that purposefully adds more overhead because that is the job. As software engineers we write new features, and new features make things more complex.

Consider the following example: You have the simplest database that you can imagine. It is a key/value datastore, as seen in Figure 1-10: give it a key and a value, it will store the value. Give it a key, it will return the value. To make it absurdly simple, imagine that it runs in-memory on your laptop.

Now imagine that you are given the task of making it more available. We can put it into the cloud. That way when we shut the laptop lid the data persists. We can add multiple nodes for redundancy. We can put the keyspace behind a consistent hash and distribute the data to multiple nodes. We can persist the data on those nodes to disk so we can bring them on- and offline for repair or data handoff. We can replicate a cluster to another in different regions so that if one region or datacenter becomes unavailable we can still access the other cluster.

Figure 1-9. Progression of a simple key/value database to a highly available setup.

In one paragraph we very quickly can describe a slew of well known design principles to make a database more available.

Figure 1-10. Rewind to the simple key/value database.

Now let’s go back to our simple key/value datastore running in-memory on our laptop (Figure 1-11). Imagine that you are given the task of making it more available, and simpler, simultaneously. Do not spend too much time trying to solve this riddle: it cannot be done in any meaningful way.

Adding new features to software (or safety properties like availability and security) requires the addition of complexity.

Taken together, the prospect of trading in our complex systems for simple systems is not encouraging. Accidental complexity will always accrue as a byproduct of work, and essential complexity will be driven by new features. In order to make any progress in software, complexity will increase.

Embracing Complexity

If complexity is causing bad outcomes, and we cannot remove the complexity, then what are we to do? The solution is a two-step process.

The first step is to embrace complexity rather than avoid it. Most of the properties that we desire and optimize for in our software require adding complexity. Trying to optimize for simplicity sets the wrong priority and generally leads to frustration. In the face of inevitable complexity we sometimes hear, “Don’t add any unnecessary complexity.” Sure, but the same could be said of anything: “Don’t add any unnecessary _____.” Accept that complexity is going to increase, even as software improves, and that is not a bad thing.

The second step is to learn to navigate complexity. Find tools to move quickly with confidence. Learn practices to add new features without exposing your system to increased risks of unwanted behavior. Rather than sink into complexity and drown in frustration, surf it like a wave. Chaos Engineering is one method to do just that. It isn’t the only method: increasing reversibility, improving team resilience, and de-bureaucratizing your organization are other methods that also navigate complexity. As an engineer, Chaos Engineering may be the most approachable, efficient way to begin to navigate the complexity of your system.

1 https://en.wikipedia.org/wiki/Nonlinear_system

2 See Peter Senge’s book The Fifth Discipline

3 See commentary on W. Ross Ashby’s “Law of Requisite Variety” as outlined in http://pespmc1.vub.ac.be/books/AshbyReqVar.pdf To oversimplify, a system A that fully controls system B has to be at least as complex as B.

4 It doesn’t work exactly this way, because the consistent hash algorithm distributes data objects psuedo-randomly across all S-D instances.

5 http://faculty.salisbury.edu/~xswang/Research/Papers/SERelated/no-silver-bullet.pdf

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset