Monitoring for configuration drift

In Chapter 7, Configuration Management with Ansible, we have explored the ways that Ansible can be used both to deploy configuration at an enterprise scale and to enforce it. Let us now build on this, with something else—monitoring for configuration drift.

As we discussed in Chapter 1, Building a Standard Operating Environment on Linux, manual changes are the enemy of automation. Beyond this, they are also a security risk. Let us work with a specific example here, to demonstrate. As was suggested previously in this book, it would be advisable to manage the Secure Shell (SSH) server configuration with Ansible. SSH is the standard protocol for managing Linux servers and can be used not only for management but also for file transfer. In short, it is one of the key mechanisms through which people will access your servers, and hence it is vital that it is secure.

It is also common, however, for a variety of people to have root access to Linux servers. Whether developers are deploying code, or system administrators are performing routine (or break-fix) work, it is considered perfectly normal for many people to have root access to a server. This is fine if everyone is well behaved, and actively supports the principles of automation in your enterprise. However, what happens if someone makes unauthorized changes?

Through the SSH configuration, they might enable remote root logins. They might turn on password-based authentication when you have disabled this in favor of key-based authentication. Many times, these kinds of changes are made to support laziness—it is easier to copy files around as a root user, for example.

Whatever the intention and root cause, someone manually making these changes to a Linux server you deployed previously is a problem. How do you go about detecting them, though? Certainly, you don't have time to log in to every server and check the files by hand. Ansible, however, can help.

In Chapter 7, Configuration Management with Ansible, we proposed a simple Ansible example that deployed the SSH server configuration from a template and restarted the SSH service if the configuration was changed using a handler.

We can actually repurpose this code for our configuration drift checks. Without even making any code changes, we can run the playbook with Ansible in check mode. Check mode makes no changes to the systems on which it is working—rather, it tries its best to predict any changes that might occur. The reliability of these predictions depends very much on the modules used in the role. For example, the template module can reliably predict changes because it knows whether the file that would be written is different from the file that is in place. Conversely, the shell module can never know the difference between a change and an ok result because it is such a general-purpose module (though it can detect failures with a reasonable degree of accuracy). Thus, I advocate strongly the use of changed_when when this module is used.

Let's see what happens if we rerun the securesshd role from before, this time in check mode. The result can be seen in the following screenshot:

Here, we can see that someone has indeed changed the SSH server configuration—if it matched the template we were providing, the output would look like this instead:

So far, so good—you could run this against a hundred, or even a thousand, servers, and you would know that any changed results came from servers where the SSH server configuration no longer matches the template. You could even run the playbook again to rectify the situation, only this time not in check mode (that is, without the -C flag on the command line).

In an environment such as AWX or Ansible Tower, jobs (that is to say, running playbooks) are categorized into two different states—success and failure. Success is categorized as any playbook that runs to completion, producing only changed or ok results. Failure, however, comes about from one or more failed or unreachable states being returned from the playbook run.

Thus, we could enhance our playbook by getting it to issue a failed state if the configuration file is different from the templated version. The bulk of the role remains exactly the same, but, on our template task, we add the following clauses:

  register: template_result
  failed_when: (template_result.changed and ansible_check_mode == True) or template_result.failed

These have the following effect on the operation of this task:

The result of the task is registered in the template_result variable.
We change the failure condition of this task to the following:
- The template task result was changed, and we are running it in check mode.
- Or, the template task failed for some other reason—this is a catch-all case, to ensure we still report other failure cases correctly (for example, access denied to a file).

You will observe the use of both logical and and or operators in the failed_when clause—a powerful way to expand on the operation of Ansible. Now, when we run the playbook in check mode and the file has changed, we see the following result:

Now, we can very clearly see that there is an issue on our host, and it will be reported as a failure in AWX and Ansible Tower too.

Of course, this works very well for plain text files. What about binary files, though? Ansible is, of course, not a complete replacement for a file integrity monitoring tool such as Advanced Intrusion Detection Environment (AIDE) or the venerable Tripwire—however, it can help with the use of binary files too. In fact, the process is very simple. Let's suppose you want to ensure the integrity of /bin/bash—this is the shell that everyone uses by default on most systems, so the integrity of this file is incredibly important. If you have space to store a copy of the original binary on your Ansible server, then you can use the copy module to copy it across to the target hosts. The copy module makes use of checksumming to determine whether a file needs to be copied, and so, you can be sure that, if the copy module results in a changed result, then the target file differs from your original version, and integrity is compromised. The role code for this would look very similar to our template example here:

---
- name: Copy bash binary to target host
  copy:
    src: files/bash
    dest: /bin/bash
    owner: root
    group: root
    mode: 0755
  register: copy_result
  failed_when: (copy_result.changed and ansible_check_mode == True) or copy_result.failed

Of course, storing original binaries on your Ansible server is inefficient, and also, means you have to keep them up to date, in line with your server patching schedule, which is not desirable when you have a large number of files to check. Fortunately, the Ansible stat module can generate checksums, as well as returning lots of other useful data about files, and so, we could very easily write a playbook to check that our binary for Bash has not been tampered with, by running the following code:

---
- name: Get sha256 sum of /bin/bash
  stat:
    path: /bin/bash
    checksum_algorithm: sha256
    get_checksum: yes
  register: binstat

- name: Verify checksum of /bin/bash
  fail:
    msg: "Integrity failure - /bin/bash may have been compromised!"
  when: binstat.stat.checksum != 'da85596376bf384c14525c50ca010e9ab96952cb811b4abe188c9ef1b75bff9a'

This is a very simple example and could be enhanced significantly by ensuring the file path and name, and checksum, are variables rather than static values. It could also be made to loop over a dictionary of files and their respective checksums—these tasks are left as an exercise for you, and this is entirely possible, using techniques we have covered throughout this book. Now, if we run this playbook (whether in check mode or not), we will see a failed result if the integrity of Bash has not been maintained, and ok otherwise, as follows:

Checksumming can be used to verify the integrity of configuration files too, so, this example role serves as a good basis for any file integrity checking you might wish to undertake.

We have now completed our exploration of file and integrity monitoring with Ansible, and hence, the ability check for configuration drift. In the next section of this chapter, we'll take a look at how Ansible can be used to manage processes across an Enterprise Linux estate.

Table of Contents for Monitoring for configuration drift

Create new playlist

Sign In

Sign Up

Table of Contents for
Monitoring for configuration drift