Understanding process management with Ansible

Sooner or later, you will end up with the need to manage, and possibly even kill, processes on one or more Linux servers within your enterprise. Obviously, this is not an ideal scenario, and in day-to-day operations, most services should be managed using the Ansible service module, many examples of which we have seen in this book.

What if, however, you need to actually kill a service that has hung? Obviously, a system administrator could SSH into the errant server and issue commands such as the following:

$ ps -ef | grep <processname> | grep -v grep | awk '{print $2}'
$ kill <PID1> <PID2>

If the process refuses stubbornly to terminate, then the following may become necessary:

$ kill -9 <PID1> <PID2>

While this is a fairly standard practice, in which most system administrators will be well versed (and indeed, may have their own favorite tools to handle, such as pkill), it suffers the same problem as most manual interventions on a server—how can you keep track of what happened, and which processes were affected? If numeric process IDs (PIDs) were used, then even with access to the command history, it is still impossible to tell which process historically held that numeric PID.

What we propose here is an unconventional use of Ansible—yet one that, if run through a tool such as AWX or Ansible Tower, would enable us to track all operations that were performed, along with details of who ran them and, if we put the process name in a parameter, what the target was too. This could be useful if, in the future, it becomes necessary to analyze the history of a problem, whereupon it would be easy to check which servers were acted upon, and which processes were targeted, along with precise timestamps.

Let's build up a role to perform exactly this set of tasks. This chapter was originally written against Ansible 2.8, which did not feature a module for process management, and so, the following example uses native shell commands to handle this case:

  1. We start by running the process listing we proposed earlier in this section, but this time, registering the list of PIDs into an Ansible variable, as follows:
---
- name: Get PID's of running processes matching {{ procname }}
shell: "ps -ef | grep -w {{ procname }} | grep -v grep | grep -v ansible | awk '{print $2","$8}'"
register: process_ids

Most people familiar with shell scripting should be able to understand this line—we are filtering the system process table for whole-word matches for the Ansible variable procname, and removing any extraneous process names that might come up and confuse the output, such as grep and ansible. Finally, we use awk to process the output into a comma-separated list, containing the PID, in the first column, and the process name itself in the second.

  1. Now, we must start to take action on this output. We now loop over the process_ids variable populated previously, issuing a kill command against the first column in the output (that is, the numeric PID), as follows:
- name: Attempt to kill processes nicely
shell: "kill {{ item.split(',')[0] }}"
loop:
"{{ process_ids.stdout_lines }}"
loop_control:
label: "{{ item }}"

You will observe the use of Jinja2 filtering here—we can use the built-in split function to split the data we created in the previous code block, taking only the first column of output (the numeric PID). However, we use the loop_control label to set the task label containing both the PID and process name, which could be very useful in an auditing or debugging scenario.

  1. Any experienced system administrator will know that it is not sufficient to just issue a kill command to a process—some processes must be forcefully killed as they are hung. Not all processes exit immediately, so we will use the Ansible wait_for module to check for the PID in the /proc directory—when it becomes absent, then we know the process has exited. Run the following code:
- name: Wait for processes to exit
wait_for:
path: "/proc/{{ item.split(',')[0] }}"
timeout: 5
state: absent
loop:
"{{ process_ids.stdout_lines }}"
ignore_errors: yes
register: exit_results

We have set the timeout here to 5 seconds—however, you should set it as appropriate in your environment. Once again, we register the output to a variable—we need to know which processes failed to exit, and hence, try killing them more forcefully. Note that we set ignore_errors here, as the wait_for module produces an error if the desired state (that is, /proc/PID becomes absent) does not occur within the timeout specified. This should not be an error in our role, simply a prompt for further processing.

  1. We now loop over the results of the wait_for  task —only this time, we use the Jinja2 selectattr function, to select only dictionary items that have failed asserted; we don't want to forcefully terminate non-existent PIDs. Run the following code:
- name: Forcefully kill stuck processes
shell: "kill -9 {{ item.item.split(',')[0] }}"
loop:
"{{ exit_results.results | selectattr('failed') | list }}"
loop_control:
label: "{{ item.item }}"

Now, we attempt to kill the stuck processes with the -9 flag—normally, sufficient to kill most hung processes. Note again the use of Jinaj2 filtering and the tidy labeling of the loop, to ensure we can use the output of this role for auditing and debugging.

  1. Now, we run the playbook, specifying a value for procname—there is no default process to be killed, and I would not suggest that setting a default value for this variable is safe. Thus, in the following screenshot, I am setting it using the -e flag when I invoke the ansible-playbook command:

From the preceding screenshot, we can clearly see the playbook killing the mysqld process, and the output of the playbook is tidy and concise, yet contains enough information for debugging, should the need occur. 

As an addendum, if you are using Ansible 2.8 or later, there is now a native Ansible module called pids that will return a nice, clean list of PIDs for a given process name, if it is running. Adapting our role for this new functionality, we can, first of all, remove the shell command and replace it with the pids module, which is much easier to read, like this:

---
- name: Get PID's of running processes matching {{ procname }}
pids:
name: "{{ procname }}"
register: process_ids

From this point on, the role is almost identical to before, except that, rather than the comma-separated list we generated from our shell command, we have a simple list that just contains the PIDs for each running process that matches the procname variable in name. Thus, we no longer need to use the split Jinja2 filter on our variables when executing commands on them. Run the following code:

- name: Attempt to kill processes nicely
shell: "kill {{ item }}"
loop:
"{{ process_ids.pids }}"
loop_control:
label: "{{ item }}"

- name: Wait for processes to exit
wait_for:
path: "/proc/{{ item }}"
timeout: 5
state: absent
loop:
"{{ process_ids.pids }}"
ignore_errors: yes
register: exit_results

- name: Forcefully kill stuck processes
shell: "kill -9 {{ item.item }}"
loop:
"{{ exit_results.results | selectattr('failed') | list }}"
loop_control:
label: "{{ item.item }}"

This block of code performs the same functions as before, only now, it is a little more readable, as we've reduced the number of Jinja2 filters required, and we have removed one shell command, in favor of the pids module. These techniques, combined with the service module discussed earlier, should give you a sound basis to meet all of your process control needs with Ansible.

In the next and final section of this chapter, we'll take a look at how to use Ansible when you have multiple nodes in a cluster, and you don't want to take them all out of service at once.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset