Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Managing Complexity

Complexity is a challenge and an opportunity for engineers. You need a team of people skilled and dynamic enough to successfully run a distributed system with many parts and interactions. The opportunity to innovate and optimize within the complex system is immense.

Software engineers typically optimize for three properties: performance, availability, and fault tolerance.

Performance: In this context refers to minimization of latency or capacity costs.
Availability: Refers to the system’s ability to respond and avoid downtime.
Fault tolerance: Refers to the system’s ability to recover from any undesirable state.

An experienced team will optimize for all three of these qualities simultaneously.

At Netflix, engineers also consider a fourth property:

Velocity of feature development: Describes the speed with which engineers can provide new, innovative features to customers.

Netflix explicitly makes engineering decisions based on what encourages feature velocity throughout the system, not just in service to the swift deployment of a local feature. Finding a balance between all four of these properties informs the decision-making process when architectures are planned and chosen.

With these properties in mind, Netflix chose to adopt a microservice architecture. Let us remember Conway’s Law:

Any organization that designs a system (defined broadly) will inevitably produce a design whose structure is a copy of the organization’s communication structure.

Melvin Conway, 1967

With a microservice architecture, teams operate their services independently of each other. This allows each team to decide when to push new code to the production environment. This architectural decision optimizes for feature velocity, at the expense of coordination. It is often easier to think of an engineering organization as many small engineering teams. We like to say that engineering teams are loosely coupled (very little structure designed to enforce coordination between teams) and highly aligned (everyone sees the bigger picture and knows how their work contributes to the greater goal). Communication between teams is key in order to have a successfully implemented microservices architecture. Chaos Engineering comes into play here by supporting high velocity, experimentation, and confidence in teams and systems through resiliency verification.

Understanding Complex Systems

Imagine a distributed system that serves information about products to consumers. In Figure 2-1 this service is depicted as seven microservices, A through G. An example of a microservice might be A, which stores profile information for consumers. Microservice B perhaps stores account information such as when the consumer last logged in and what information was requested. Microservice C understands products and so on. D in this case is an API layer that handles external requests.

Let’s look at an example request. A consumer requests some information via a mobile app:

The request comes in to microservice D, the API.
The API does not have all of the information necessary to respond to the request, so it reaches out to microservices C and F.
Each of those microservices also need additional information to satisfy the request, so C reaches out to A, and F reaches out to B and G.
A also reaches out to B, which reaches out to E, who is also queried by G. The one request to D fans out among the microservices architecture, and it isn’t until all of the request dependencies have been satisfied or timed out that the API layer responds to the mobile application.

This request pattern is typical, although the number of interactions between services is usually much higher in systems at scale. The interesting thing to note about these types of architectures versus tightly-coupled, monolithic architectures is that the former have a diminished role for architects. If we take an architect’s role as being the person responsible for understanding how all of the pieces in a system fit together and interact, we quickly see that a distributed system of any meaningful size becomes too complex for a human to satisfy that role. There are simply too many parts, changing and innovating too quickly, interacting in too many unplanned and uncoordinated ways for a human to hold those patterns in their head. With a microservice architecture, we have gained velocity and flexibility at the expense of human understandability. This deficit of understandability creates the opportunity for Chaos Engineering.

The same is true in other complex systems, including monoliths (usually with many, often unknown, downstream dependencies) that become so large that no single architect can understand the implications of a new feature on the entire application. Perhaps the most interesting examples of this are systems where comprehensibility is specifically ignored as a design principle. Consider deep learning, neural networks, genetic evolution algorithms, and other machine-intelligence algorithms. If a human peeks under the hood into any of these algorithms, the series of weights and floating-point values of any nontrivial solution is too complex for an individual to make sense of. Only the totality of the system emits a response that can be parsed by a human. The system as a whole should make sense but subsections of the system don’t have to make sense.

In the progression of the request/response, the spaghetti of the call graph fanning out represents the chaos inherent in the system that Chaos Engineering is designed to tame. Classical testing, comprising unit, functional, and integration tests, is insufficient here. Classical testing can only tell us whether an assertion about a property that we know about is true or false. We need to go beyond the known properties of the system; we need to discover new properties. A hypothetical example based on real-world events will help illustrate the deficiency.

Example of Systemic Complexity

Imagine that microservice E contains information that personalizes a consumer’s experience, such as predicted next actions that arrange how options are displayed on the mobile application. A request that needs to present these options might hit microservice A first to find the consumer’s account, which then hits E for this additional personalized information.

Now let’s make some reasonable assumptions about how these microservices are designed and operated. Since the number of consumers is large, rather than have each node of microservice A respond to requests over the entire consumer base, a consistent hashing function balances requests such that any one particular consumer may be served by one node. Out of the hundred or so nodes comprising microservice A, all requests for consumer “CLR” might be routed to node “A42,” for example. If A42 has a problem, the routing logic is smart enough to redistribute A42’s solution space responsibility around to other nodes in the cluster.

In case downstream dependencies misbehave, microservice A has rational fallbacks in place. If it can’t contact the persistent stateful layer, it serves results from a local cache.

Operationally, each microservice balances monitoring, alerting, and capacity concerns to balance the performance and insight needed without being reckless about resource utilization. Scaling rules watch CPU load and I/O and scale up by adding more nodes if those resources are too scarce, and scale down if they are underutilized.

Now that we have the environment, let’s look at a request pattern. Consumer CLR starts the application and makes a request to view the content-rich landing page via a mobile app. Unfortunately, the mobile phone is currently out of connectivity range. Unaware of this, CLR makes repeated requests, all of which are queued by the mobile phone OS until connectivity is reestablished. The app itself also retries the requests, which are also queued within the app irrespective of the OS queue.

Suddenly connectivity is reestablished. The OS fires off several hundred requests simultaneously. Because CLR is starting the app, microservice E is called many times to retrieve essentially the same information regarding a personalized experience. As the requests fan out, each call to microservice E makes a call to microservice A. Microservice A is hit by these requests as well as others related to opening the landing page. Because of A’s architecture, each request is routed to node A42. A42 is suddenly unable to hand off all of these requests to the persistent stateful layer, so it switches to serving requests from the cache instead.

Serving responses from the cache drastically reduces the processing and I/O overhead necessary to serve each request. In fact, A42’s CPU and I/O drop so low that they bring the mean below the threshold for the cluster-scaling policy. Respectful of resource utilization, the cluster scales down, terminating A42 and redistributing its work to other members of the cluster. The other members of the cluster now have additional work to do, as they handle the work that was previously assigned to A42. A11 now has responsibility for service requests involving CLR.

During the handoff of responsibility between A42 and A11, microservice E timed out its request to A. Rather than failing its own response, it invokes a rational fallback, returning less personalized content than it normally would, since it doesn’t have the information from A.

CLR finally gets a response, notices that it is less personalized than he is used to, and tries reloading the landing page a few more times for good measure. A11 is working harder than usual at this point, so it too switches to returning slightly stale responses from the cache. The mean CPU and I/O drop, once again prompting the cluster to shrink.

Several other users now notice that their application is showing them less personalized content than they are accustomed to. They also try refreshing their content, which sends more requests to microservice A. The additional pressure causes more nodes in A to flip to the cache, which brings the CPU and I/O lower, which causes the cluster to shrink faster. More consumers notice the problem, causing a consumer-induced retry storm. Finally, the entire cluster is serving from the cache, and the retry storm overwhelms the remaining nodes, bringing microservice A offline. Microservice B has no rational fallback for A, which brings D down, essentially stalling the entire service.

Takeaway from the Example

The scenario above is called the “bullwhip effect” in Systems Theory. A small perturbation in input starts a self-reinforcing cycle that causes a dramatic swing in output. In this case, the swing in output ends up taking down the app.

The most important feature in the example above is that all of the individual behaviors of the microservices are completely rational. Only taken in combination under very specific circumstances do we end up with the undesirable systemic behavior. This interaction is too complex for any human to predict. Each of those microservices could have complete test coverage and yet we still wouldn’t see this behavior in any test suite or integration environment.

It is unreasonable to expect that any human architect could understand the interaction of these parts well enough to predict this undesirable systemic effect. Chaos Engineering provides the opportunity to surface these effects and gives us confidence in a complex distributed system. With confidence, we can move forward with architectures chosen for feature velocity as well systems that are too vast or obfuscated to be comprehensible by a single person.

Chaos Kong

Building on the success of Chaos Monkey, we decided to go big. While the monkey turns off instances, we built Chaos Kong to turn off an entire Amazon Web Services (AWS) region.

The bits and bytes for Netflix video are served out of our CDN. At our peak, this constitutes about a third of the traffic on the Internet in North America. It is the largest CDN in the world and covers many fascinating engineering problems, but for most examples of Chaos Engineering we are going to set it aside. Instead, we are going to focus on the rest of the Netflix services, which we call our control plane.

Every interaction with the service other than streaming video from the CDN is served out of three regions in the AWS cloud service. For thousands of supported device types, from Blu-ray players built in 2007 to the latest smartphone, our cloud-hosted application handles everything from bootup, to customer signup, to navigation, to video selection, to heartbeating while the video is playing.

During the holiday season in 2012, a particularly onerous outage in our single AWS region at the time encouraged us to pursue a multiregional strategy. If you are unfamiliar with AWS regions, you can think of them as analogous to datacenters. With a multi-regional failover strategy, we move all of our customers out of an unhealthy region to another, limiting the size and duration of any single outage and avoiding outages similar to the one in 2012.

This effort required an enormous amount of coordination between the teams constituting our microservices architecture. We built Chaos Kong in late 2013 to fail an entire region. This forcing function aligns our engineers around the goal of delivering a smooth transition of service from one region to another. Because we don’t have access to a regional disconnect at the IaaS level (something about AWS having other customers) we have to simulate a regional failure.

Once we thought we had most of the pieces in place for a regional failover, we started running a Chaos Kong exercise about once per month. The first year we often uncovered issues with the failover that gave us the context to improve. By the second year, things were running pretty smoothly. We now run Chaos Kong exercises on a regular basis, ensuring that our service is resilient to an outage in any one region, whether that outage is caused by an infrastructure failure or self-inflicted by an unpredictable software interaction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
2. Managing Complexity

Chapter 2. Managing Complexity

Understanding Complex Systems

Figure 2-1. Microservices architecture

Example of Systemic Complexity

Takeaway from the Example

Table of Contents for 2. Managing Complexity

Create new playlist

Sign In

Sign Up

Chapter 2. Managing Complexity

Understanding Complex Systems

Figure 2-1. Microservices architecture

Example of Systemic Complexity

Takeaway from the Example

Table of Contents for
2. Managing Complexity