Chapter 5. Writing and Running Your First Automated Chaos Experiment

In Chapter 4 you grabbed the Chaos Toolkit; now it’s time to actually use the toolkit to execute your first automated chaos experiment. In this chapter you’ll set up a simple target system to work your chaos against, then write and run your first automated chaos experiment to surface a weakness in that system. You’ll execute a whole cycle in which your automated chaos experiment is first used to uncover evidence of a weakness, and then used again to validate that the weakness has been overcome—see the diagram in Figure 5-1.

An image of chaos experiment to surface and validate a weakness.
Figure 5-1. Using a chaos experiment to surface evidence of a weakness, then provide evidence of the weakness being overcome

Setting Up the Sample Target System

You need a system to explore for weaknesses, and everything you need for that is available in the learning-chaos-engineering-book-samples directory in the community-playground repo under the chaostoolkit-incubator organization. Grab the code now by cloning the repository with the git command:

(chaostk) $ git clone https://github.com/chaostoolkit-incubator/community-
            playground.git

If you’re not comfortable with using git, you can simply grab the repository’s contents as a zip file.

Getting the Example Code

In actual fact, all the experiments shown in this book are in the chaostoolkit-incubator/community-playground repo. For more information on the other contents in this repository, see Appendix B. For now, grab the Community Playground and keep it handy, as we’re going to be using it throughout the rest of the book.

Once you’ve cloned the repository (or unpacked the zip file), you should see the following directory structure and contents in the learning-chaos-engineering-book-samples directory:

.
├── LICENSE
├── README.md
└── chapter5
    ├── experiment.json
    ├── resilient-service.py
    └── service.py

... further entries omitted ...

As you might expect, you’ll be working from within the chapter5 directory. Change directory in your terminal now so that you’re in this directory (not forgetting to check that you still have your chaostk virtual environment activated).

A Quick Tour of the Sample System

Any chaos engineering experiment needs a system to target, and since this is your first experiment, the sample system has been kept very simple indeed. The key features of the system that is going to be the target of your chaos experiment are shown in Figure 5-2.

An image of the simple system to be targetted with your first chaos experiment
Figure 5-2. The single-service system that you’ll be targeting with your first chaos experiment

The target system is made up of a single Python file, service.py, which contains a single runtime service with the following code:

# -*- coding: utf-8 -*-
from datetime import datetime
import io
import time
import threading
from wsgiref.validate import validator
from wsgiref.simple_server import make_server

EXCHANGE_FILE = "./exchange.dat"


def update_exchange_file():
    """
    Writes the current date and time every 10 seconds into the exchange file.

    The file is created if it does not exist.
    """
    print("Will update to exchange file")
    while True:
        with io.open(EXCHANGE_FILE, "w") as f:
            f.write(datetime.now().isoformat())
        time.sleep(10)


def simple_app(environ, start_response):
    """
    Read the contents of the exchange file and return it.
    """
    start_response('200 OK', [('Content-type', 'text/plain')])
    with io.open(EXCHANGE_FILE) as f:
        return [f.read().encode('utf-8')]


if __name__ == '__main__':
    t = threading.Thread(target=update_exchange_file)
    t.start()

    httpd = make_server('', 8080, simple_app)
    print("Listening on port 8080....")

    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        httpd.shutdown()
        t.join(timeout=1)

This simple service deliberately does almost nothing of interest. It exposes an HTTP endpoint at its root, /, and serves the contents of a file, exchange.dat, when that URL is hit. To see the service in action, all you need to do is enter the following while in the start directory:

(chaostk) $ python3 service.py
Will update the exchange file
Listening on port 8080....

With the service running you should be able to visit http://localhost:8080 and see the contents of the file that is served, as shown in Figure 5-3.

An image showing the response from hitting the target system's simple service when all is well.
Figure 5-3. Try not to be too blown away by the incredible output from your target system’s simple service…

To make things more interesting, in addition to serving the contents of the file the service also periodically refreshes the file’s contents, so you should see the contents changing if you repeatedly hit http://localhost:8080.

What could possibly go wrong with such a trivial service? That is the question the team responsible for this service would ask when beginning to consider what chaos experiments might be interesting to explore. Worryingly, even in this trivial piece of code, there is a weakness—one that would result in service failure. And worse, it’s a failure that would cascade directly to the service consumer.

Imagine for a moment that this is not such a trivial piece of code. Imagine that it is a business-critical service, part of a popular API, and that real customers rely on it—but there’s a weakness! You may have already spotted it in such simple code, but for the sake of your first chaos engineering exercise, you’ll now construct an experiment that will surface that weakness.

Exploring and Discovering Evidence of Weaknesses

Following the Chaos Engineering Learning Loop first shown in Figure 4-3, the initial step is to explore the target system to attempt to surface or discover any weaknesses (Figure 5-4).

An image showing your focus at this point is to explore and discover weaknesses using a chaos experiment.
Figure 5-4. Using a chaos experiment to explore and discover weaknesses in the target system

The experiment is already written for you, using the declarative experiment specification format of the Chaos Toolkit.

The experiment is located in the experiment.json file, along with the service.py code. Open the experiment.json file in your favorite text editor, and you’ll see that it starts with a title, a description, and some tags:

{
    "title": "Does our service tolerate the loss of its exchange file?",
    "description": "Our service reads data from an exchange file,
     can it support that file disappearing?",
    "tags": [
        "tutorial",
        "filesystem"
    ],

Every chaos experiment should have a meaningful title and a description that conveys how you believe the system will survive. In this case, you’ll be exploring how the service performs if, or more likely when, the exchange.dat file disappears for whatever reason. The title indicates that the service should tolerate this loss, but there is doubt. This chaos experiment will empirically prove whether your belief in the resiliency of the service is well founded.

The next section of the experiment file captures the steady-state hypothesis. A steady-state hypothesis is:

[A] model that characterizes the steady state of the system based on expected values of the business metrics.

Chaos Engineering

Remember that the steady-state hypothesis expresses, within certain tolerances, what constitutes normal and healthy for the portion of the target system being subjected to the chaos experiment. With only one service in your target system, the Blast Radius of your experiment—that is, the area anticipated to be impacted by the experiment—is also limited to your single service.

A Chaos Toolkit experiment’s steady-state hypothesis comprises a collection of probes. Each probe inspects some property of the target system and judges whether the property’s value is within an expected tolerance:

    "steady-state-hypothesis": {
        "title": "The exchange file must exist",
        "probes": [
            {
                "type": "probe",
                "name": "service-is-unavailable",
                "tolerance": [200, 503],
                "provider": {
                    "type": "http",
                    "url": "http://localhost:8080/"
                }
            }
        ]
    },

If all of the probes in the steady-state hypothesis are within tolerance, then the system is declared to be in a “normal,” steady state.

Next comes the active part of an experiment, the experimental method:

    "method": [
        {
            "name": "move-exchange-file",
            "type": "action",
            "provider": {
                "type": "python",
                "module": "os",
                "func": "rename",
                "arguments": {
                    "src": "./exchange.dat",
                    "dst": "./exchange.dat.old"
                }
            }
        }
    ]

A Chaos Toolkit experiment’s method defines actions that will affect the system and cause the turbulent conditions, the chaos, that should be applied to the target system. Here the experiment is exploring how resilient the service is to the sudden absence of the exchange.dat file, so all the experiment’s method needs to do is rename that file so that it cannot be found by the service.

As well as actions, the experiment’s method can contain probes similar to those in the experiment’s steady-state hypothesis, except without any tolerances specified. No tolerances are needed, as these probes are not assessing the target system. Rather, probes declared in the experiment’s method enrich the output from an experiment’s execution, capturing data points from the target system as the method is executed and adding those data points to the experiment’s findings.

In this simple experiment definition, the method is the last section. There is one further section that is permitted, and that is the rollbacks section. You’ll come to grips with rollbacks when you create more advanced experiments in Chapter 6.

Running Your Experiment

And now what you’ve been waiting for! It’s time to execute your first chaos experiment and see whether the target system handles the chaos. You’re now entering the discovery and analysis phases of the Chaos Engineering Learning Loop (Figure 5-5).

An image showing your focus at this point is to discover and analyse any weaknesses by running your chaos experiment.
Figure 5-5. Using a chaos experiment to discover and begin to analyze any weaknesses surfaced in the target system

First make sure service.py is running in your terminal; you should see something like the following:

(chaostk) $ python3 service.py
Will update to exchange file
Listening on port 8080....

Now run your chaos experiment using the chaos run command in a new terminal window, making sure you have the chaostk virtual environment activated:

(chaostk) $ chaos run experiment.json
[2019-04-25 12:44:41 INFO] Validating the experiment's syntax
[2019-04-25 12:44:41 INFO] Experiment looks valid
[2019-04-25 12:44:41 INFO] Running experiment: Does our service tolerate the
loss of its exchange file?
[2019-04-25 12:44:41 INFO] Steady state hypothesis: The exchange file must exist
[2019-04-25 12:44:41 INFO] Probe: service-is-unavailable
[2019-04-25 12:44:41 INFO] Steady state hypothesis is met!
[2019-04-25 12:44:41 INFO] Action: move-exchange-file
[2019-04-25 12:44:41 INFO] Steady state hypothesis: The exchange file must exist
[2019-04-25 12:44:41 INFO] Probe: service-is-unavailable
[2019-04-25 12:44:41 CRITICAL] Steady state probe 'service-is-unavailable' is
not in the given tolerance so failing this experiment
[2019-04-25 12:44:41 INFO] Let's rollback...
[2019-04-25 12:44:41 INFO] No declared rollbacks, let's move on.
[2019-04-25 12:44:41 INFO] Experiment ended with status: failed

Congratulations! You’ve run your first automated chaos experiment. Even better, the terminal output indicates, through the CRITICAL entry, that you may have discovered a weakness. But before you start analyzing the potentially complex causes of the weakness, let’s look at what the Chaos Toolkit did when it executed your chaos experiment.

Under the Skin of chaos run

The first job the Chaos Toolkit performs is to ensure that the indicated experiment is valid and executable. You can also verify this yourself without executing an experiment using the chaos validate command.

Assuming the experiment passes as valid, the Chaos Toolkit orchestrates the experiment execution based on your experiment definition, as shown in the diagram in Figure 5-6.

An image showing how the Chaos Toolkit interprets and executed an experiment
Figure 5-6. How the Chaos Toolkit interprets and executes an experiment

A surprising thing you’ll notice is that the steady-state hypothesis is used twice: once at the beginning of the experiment’s execution, and then again when the experiment’s method has completed its execution.

The Chaos Toolkit uses the steady-state hypothesis for two purposes. At the beginning of an experiment’s execution, the steady-state hypothesis is assessed to decide whether the target system is in a recognizably normal state. If the target system is deviating from the expectations of the steady-state hypothesis at this point, the experiment is aborted, as there is no value in executing an experiment’s method when the target system isn’t recognizably “normal” to begin with. In scientific terms, we have a “dirty petri dish” problem.

The second use of the steady-state hypothesis is its main role in an experiment’s execution. After an experiment’s method, with its turbulent condition–inducing actions, has completed, the steady-state hypothesis is again compared against the target system. This is the critical point of an experiment’s execution, because this is when any deviation from the conditions expected by the steady-state hypothesis will indicate that there may be a weakness surfacing under the method’s actions.

Steady-State Deviation Might Indicate “Opportunity for Improvement”

When a chaos experiment reports that there has been a deviation from the conditions expected by the steady-state hypothesis, you celebrate! This might sound odd, but any weakness you find in a target system before a user encounters it is not a failure; it’s an opportunity for assessment, learning, and improvements in the resiliency of the system.

A glance at the service.py code will quickly highlight the problem:1

# -*- coding: utf-8 -*-
from datetime import datetime
import io
import time
import threading
from wsgiref.validate import validator
from wsgiref.simple_server import make_server

EXCHANGE_FILE = "./exchange.dat"


def update_exchange_file():
    """
    Writes the current date and time every 10 seconds into the exchange file.

    The file is created if it does not exist.
    """
    print("Will update to exchange file")
    while True:
        with io.open(EXCHANGE_FILE, "w") as f:
            f.write(datetime.now().isoformat())
        time.sleep(10)


def simple_app(environ, start_response):
    """
    Read the content of the exchange file and return it.
    """
    start_response('200 OK', [('Content-type', 'text/plain')])
    with io.open(EXCHANGE_FILE) as f:
        return [f.read().encode('utf-8')]


if __name__ == '__main__':
    t = threading.Thread(target=update_exchange_file)
    t.start()

    httpd = make_server('', 8080, simple_app)
    print("Listening on port 8080....")

    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        httpd.shutdown()
        t.join(timeout=1)

The code assumes that the exchange.dat file is always there. If the file disappears for any reason, the service fails when its root URL is accessed, returning a server error. Our experiment’s title and description indicated that the presence of the file was not guaranteed and that the service should be resilient to the condition of the file not being present. The chaos experiment has proved that the service has been implemented without this resiliency in mind and shown that this condition will cause a catastrophic failure in the service that will affect its consumers.

Improving the System

When a new weakness is surfaced by a chaos experiment’s execution, it can often lead to a lot of work that needs to be prioritized by the team responsible for the portion of the system where the weakness has been found. Just analyzing the findings can be a big job in itself!

Once you and your teams have conducted an analysis, it’s time to prioritize and apply system improvements to overcome any high-priority weaknesses (Figure 5-7).

An image showing your focus at this point prioritise and roll out an improvement to the system.
Figure 5-7. Once the challenge of analysis is done, it’s time to apply an improvement to the system (if needed)

Fortunately, your target system contains only one simple service, and the weakness is relatively obvious in the service’s exchange.dat file handling code.

An improved and more resilient implementation of the service is available in the resilient-service.py file:

# -*- coding: utf-8 -*-
from datetime import datetime
import io
import os.path
import time
import threading
from wsgiref.validate import validator
from wsgiref.simple_server import make_server

EXCHANGE_FILE = "./exchange.dat"


def update_exchange_file():
    """
    Writes the current date and time every 10 seconds into the exchange file.

    The file is created if it does not exist.
    """
    print("Will update the exchange file")
    while True:
        with io.open(EXCHANGE_FILE, "w") as f:
            f.write(datetime.now().isoformat())
        time.sleep(10)


def simple_app(environ, start_response):
    """
    Read the contents of the exchange file and return it.
    """
    if not os.path.exists(EXCHANGE_FILE):
        start_response(
            '503 Service Unavailable',
            [('Content-type', 'text/plain')]
        )
        return [b'Exchange file is not ready']

    start_response('200 OK', [('Content-type', 'text/plain')])
    with io.open(EXCHANGE_FILE) as f:
        return [f.read().encode('utf-8')]


if __name__ == '__main__':
    t = threading.Thread(target=update_exchange_file)
    t.start()

    httpd = make_server('', 8080, simple_app)
    print("Listening on port 8080....")

    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        httpd.shutdown()
        t.join(timeout=1)

This more resilient service checks whether the exchange.dat file is present and, if not, responds with a more informative 503 Service Unavailable when the root URL of the service is accessed. This is a small and simple change, but it immediately improves the service such that it can gracefully deal with unexpected failure when accessing a file it depends on.

Validating the Improvement

It’s now time to run your experiment again to validate that the improvement has overcome the discovered and analyzed weakness (Figure 5-8).

An image showing your focus at this point is validation of your system improvement.
Figure 5-8. Your chaos experiment becomes a chaos test to detect whether the weakness has indeed been overcome

First ensure you’ve killed the original service instance that contained the weakness, and then run the new, improved, and more resilient service by entering:

$ python3 resilient-service.py
Will update to exchange file
Listening on port 8080....

Now switch to the terminal window where you previously ran your chaos experiment and run it again:

$ chaos run experiment.json
[2019-04-25 12:45:38 INFO] Validating the experiment's syntax
[2019-04-25 12:45:38 INFO] Experiment looks valid
[2019-04-25 12:45:38 INFO] Running experiment: Does our service tolerate the
loss of its exchange file?
[2019-04-25 12:45:38 INFO] Steady state hypothesis: The exchange file must exist
[2019-04-25 12:45:38 INFO] Probe: service-is-unavailable
[2019-04-25 12:45:38 INFO] Steady state hypothesis is met!
[2019-04-25 12:45:38 INFO] Action: move-exchange-file
[2019-04-25 12:45:38 INFO] Steady state hypothesis: The exchange file must exist
[2019-04-25 12:45:38 INFO] Probe: service-is-unavailable
[2019-04-25 12:45:38 INFO] Steady state hypothesis is met!
[2019-04-25 12:45:38 INFO] Let's rollback...
[2019-04-25 12:45:38 INFO] No declared rollbacks, let's move on.
[2019-04-25 12:45:38 INFO] Experiment ended with status: completed

Weakness overcome! The steady-state hypothesis does not detect a deviation in the target system, so you can celebrate that your chaos test has validated the improvement in system resiliency.

Summary

You’ve come a long way! In this chapter you’ve worked through a complete cycle: from surfacing evidence of a weakness all the way through to validating that the weakness has been overcome using your first automated chaos experiment. You’ve explored, discovered a weakness, analyzed that weakness, improved the system, and validated that the weakness has been overcome.

Now it’s time to dive deeper into the Chaos Toolkit experiment definition vocabulary. In the next chapter you’ll learn the details of the experiment definition format while you build your own chaos experiment from scratch.

1 If only it were this simple in large, complex systems! This assessment of the cause of a failure and a system weakness alone can take a lot of time when your system is bigger than one simple service.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset