With every chaos experiment that you write and run, you increase your chances of finding evidence of dark debt that you can learn from and use to improve your system. Your chaos experiments will start out as explorations of your system; ways to ask yourself, “If this happens, I think the system will survive…or will it?” You’ll gradually build a catalog of experiments for your system that explores a selection of your Hypothesis Backlog, helping you build trust and confidence that you’re proactively exploring and surfacing weaknesses before they affect your users.
Some of your chaos experiments will then graduate into a different phase of their lives. The first phase of an experiment’s life, as just described, is about finding evidence of system weaknesses. It’s about exploring and uncovering that dark debt inherent in all complex sociotechnical systems. Over time, you will choose to overcome some or all of the weaknesses that your automated chaos experiments have surfaced evidence for. At that point, a chaos experiment enters the second phase of its life: it becomes a chaos test.
A chaos experiment is exploration; a chaos test is validation. Whereas a chaos experiment seeks to surface weaknesses and is celebrated when a deviation is found,1 a chaos test validates that previously found weaknesses have been overcome.
There’s more good news: a chaos experiment and a chaos test look exactly the same. Only the interpretation of the results is different. Instead of being a scientific exploration to find evidence of weaknesses, the goal has become to validate that those weaknesses seem to have been overcome. If a chaos engineer celebrates when evidence from a chaos experiment or Game Day shows that a new weakness may have been found, they will celebrate again when no evidence of that weakness is found after that same experiment is run as a chaos test once system improvements have been put into place.
Over time you will build catalogs of hypotheses, chaos experiments (Game Days and automated experiments), and chaos tests (always automated). You’ll share those experiments with others and demonstrate, through the contribution model (see “Specifying a Contribution Model”), what areas you are focusing on to improve trust and confidence in your system…but there is one more thing you could do to really turn those chaos tests into something powerful.
Chaos tests enable an additional chaos engineering superpower: they enable the potential for “continuous chaos.”
Continuous chaos means that you have regularly scheduled—often frequent—executions of your chaos tests. Usually chaos tests, rather than chaos experiments, are scheduled, because the intent is to validate that a weakness has not returned. The more frequently you schedule your chaos tests to run, the more often you can validate that a transient condition has not caused the weakness to return.
A continuous chaos environment is made up of these three elements:
Responsible for taking control of when a chaos test can and should be executed
Responsible for executing the experiment
The collection of experiments that have graduated into being tests with a high degree of trust and confidence (see “Continuous Chaos Needs Chaos Tests with No Human Intervention” for more on this)
Figure 12-1 shows how these three concepts work together in a continuous chaos environment.
So far in this book you’ve been using the Chaos Toolkit as your chaos runtime, and you’ve been building up a collection of chaos experiments that are ready to be run as chaos tests; now it’s time to slot the final piece into place by adding scheduled, continuous chaos to your toolset.
Since the Chaos Toolkit provides a CLI through the chaos
command, you can hook it up to your cron
scheduler.2
We won’t go into all the details of how to use cron
here,3 but it is one of the simplest ways of scheduling chaos tests to run as part of your own continuous chaos environment. First you need to have activated the Python virtual environment into which your Chaos Toolkit and its extensions are installed. To do this, create a runchaos.sh file and add the following to turn on your chaostk
Python virtual environment (where your Chaos Toolkit was installed), and then run the chaos --help
command to show that everything is working:
#!/bin/bash
source
~/.venvs/chaostk/bin/activate
export
LANG
=
"en_US.UTF-8"
# Needed currently for the click library
export
LC_ALL
=
"en_US.UTF-8"
# Needed currently for the click library
chaos
--help
deactivate
Activate the Python virtual environment where the Chaos Toolkit and any necessary extensions are installed.
Deactivate the Python virtual environment at the end of the run. This is only included to show that you could activate and deactivate different virtual environments with different installations of the Chaos Toolkit and extensions depending on your experiment’s needs.
Save the runchaos.sh file and then make it executable:
$ chmod +x runchaos.sh
Now when you run this script you should see:
$ ./runchaos.sh Usage: chaos [OPTIONS] COMMAND [ARGS]... Options: --version Show the version and exit. --verbose Display debug level traces. --no-version-check Do not search for an updated version of the chaostoolkit. --change-dir TEXT Change directory before running experiment. --no-log-file Disable logging to file entirely. --log-file TEXT File path where to write the command's log. [default: chaostoolkit.log] --settings TEXT Path to the settings file. [default: /Users/russellmiles/.chaostoolkit/settings.yaml] --help Show this message and exit. Commands: discover Discover capabilities and experiments. info Display information about the Chaos Toolkit environment. init Initialize a new experiment from discovered capabilities. run Run the experiment loaded from SOURCE, either a local file or a... validate Validate the experiment at PATH.
You can now add as many chaos run
commands to the runchaos.sh script as you need to execute each of those chaos tests sequentially when the script is run. For example:
#!/bin/bash
source
~/.venvs/chaostk/bin/activateexport
LANG
=
"en_US.UTF-8"
# Needed currently for the click library
export
LC_ALL
=
"en_US.UTF-8"
# Needed currently for the click library
chaos run /absolute/path/to/experiment/experiment.json# Include as many more chaos tests as you like here!
deactivate
This script will work well if your experiment files are always available locally. If that is not the case, another option is to direct the Chaos Toolkit to load the experiment from a URL.4 You can do this by amending your runchaos.sh file with URL references in your chaos run
commands:
#!/bin/bash
source
~/.venvs/chaosinteract/bin/activateexport
LANG
=
"en_US.UTF-8"
# Needed currently for the click library
export
LC_ALL
=
"en_US.UTF-8"
# Needed currently for the click library
chaos run /Users/russellmiles/temp/simpleexperiment.json# Include as many more chaos tests as you like here!
deactivate
Now you can schedule a task with cron
by adding an entry into your system’s crontab (cron
table). To open up the crontab file, execute the following:
$ crontab -e
This will open the file in your terminal’s default editor. Add the following line to execute your runChaosTests.sh script every minute:
*/1 * * * * absolute/path/to/script/runChaosTests.sh
Save the file and exit, and you should see the crontab: installing new crontab
message. Now just wait; if everything is working correctly, your chaos tests will be executed every minute by cron
.
Scheduling your chaos tests to be executed every time there’s been a change to the target system,5 is a very common choice, so that’s what you’re going to set up now: you;ll install the popular open source Jenkins Continuous integration and delivery pipeline tool and add your chaos tests to that environment as an additional deployment stage.
First you need to get a Jenkins server running, and the simplest way to do that is to download and install it locally for your operating system.6 Once Jenkins has been downloaded, installed, and unlocked and is ready for work, you should see the Jenkins home screen shown in Figure 12-2.
You are now all set to tell Jenkins how to run your chaos tests. From the Jenkins home screen, click “create new jobs” (see Figure 12-2). You’ll then be asked what type of Jenkins job you’d like to create. Select “Freestyle project” and give it a name such as “Run Chaos Tests” (see Figure 12-3).
Once you’ve clicked OK to create your new project, you’ll be presented with a screen where you can configure the job. There’s a lot you could complete here to make the most of Jenkins, but for our purposes you’re going to do the minimum to be able to execute your chaos tests.
Navigate down the page to the “Build” section and click the “Add build step” button (see Figure 12-4), and then select “Execute shell.”
You’ll be asked to specify the shell command that you want Jenkins to execute. You’ll be reusing the run-chaos-tests.sh script that you created earlier, so simply enter the full path to your run-chaos-tests.sh file and then click “Save” (Figure 12-5).
You’ll now be returned to your new Run Chaos Tests job page. To test that everything is working, click the “Build Now” link; you should see a new build successfully completed in the Build History pane (Figure 12-6).
You can see the output of running your chaos tests by clicking the build execution link (i.e., the job number) and then the “Console Output” link (Figure 12-7).
Great! You now have Jenkins executing your chaos tests. However, your clicking the “Build Now” button is hardly “continuous.” To enable continuous chaos, you need to add an appropriate build trigger.
You can trigger your new Run Chaos Tests Jenkins job in a number of different ways, including triggering on the build success of other projects. For our purposes, you can see some continuous chaos in action by simply triggering the job on a schedule, just as you did earlier with cron
. In fact, Jenkins scheduled builds are specified with exactly the same cron
pattern, so let’s do that now.
From your Run Chaos Tests job home page, click “Configure” and then go to the “Build Triggers” tab (see Figure 12-8). Select “Build periodically” and then enter the same cron
pattern that you used earlier when editing the crontab
file, which was:
*/1 * * * *
Figure 12-8 shows what your completed build trigger should look like.
Now when you go back to your job’s home page you should see new executions of your chaos tests being run every minute!
The progression from manual Game Days to automated chaos experiments to chaos tests and continuous chaos is now complete. By building a continuous chaos environment, you can search for and confidently surface weaknesses as often as needed, without long delays between Game Days.
But your journey into chaos engineering is only just beginning.
Chaos engineering never stops; as long as a system is being used, you will find value in exploring and surfacing evidence of weaknesses in it. Chaos engineering is never done, and this is a good thing! As a chaos engineer, you know that the real value of chaos engineering is in gaining evidence of system weaknesses as early as possible, so that you and your team can prepare for them and maybe even overcome them. As a mind-set, a process, a set of techniques, and a set of tools, chaos engineering is a part of your organization’s resilience engineering capability, and you are now ready to play your part in that capability. Through the establishment of the learning loops that chaos engineering supports, everyone can be a chaos engineer and contribute to the reliability of your systems.
Good luck, and happy chaos engineering!
1 Maybe “celebrate” is too strong a term for your reaction to finding potential evidence of system weaknesses, but that is the purpose of a chaos experiment.
2 If you’re running on Windows there are a number of other options, such as Task Scheduler.
3 Check out bash Cookbook by Carl Albing and JP Vossen (O’Reilly) for more on using cron
to schedule tasks.
4 The specified URL must be reachable from the machine that the chaos run
command will be executed on.
5 Possibly even as part of the choice to roll back during a blue-green deployment.
6 If an instance of Jenkins is already available, please feel free to use that existing installation.