It’s been said that specialization is for insects, which although poetic, isn’t exactly true. Nature abounds with examples of specialization in just about every biological kingdom, from mitochondria to clownfish. The most extreme examples are a special kind of specialization, which biologists refer to as symbiosis.
You’ve probably come across some examples of biological symbiosis at one time or another. Some are quite famous, like the clownfish and the anemone. Others, like the fig wasp, are less so, but the general idea is always the same: two organisms, finding that they can rely on each other, buddy up. Buddies have to work less and can focus more on what they’re good at. In this way, symbiosis begets more specialization, and the individual specializations grow to complement each other.
Effective symbiotes are complementary in the sense that there isn’t much functional overlap between them. The beneficial abilities of one buddy stop pretty close to where those of the other begin, and vice versa. They are also complementary in the sense that their individual specializations combine to create a solution that would be impossible otherwise. Together the pair become something more than the sum of their parts.
It would surprise us to learn that you’d never heard of Nagios. It is probably the most popular open source monitoring system in existence today, and is generally credited for if not inventing, then certainly perfecting the centralized polling model employed by myriad monitoring systems both commercial and free. Nagios has been imitated, forked, reinvented, and commercialized, but in our opinion, it’s never been beaten, and it remains the yardstick by which all monitoring systems are measured.
It is not, however, a valid yardstick by which to measure Ganglia, because the two are not in fact competitors, but symbiotes, and the admin who makes the mistake of choosing one over the other is doing himself a disservice. It is not only possible, but advisable to use them together to achieve the best of both worlds. To that end, we’ve included this chapter to help you understand the best options available for Nagios interoperability.
Under the hood, Nagios is really just a special-purpose scheduling and notification engine. By itself, it can’t monitor anything. All it can do is schedule the execution of little programs referred to as plug-ins and take action based on their output.
Nagios plug-ins return one of four states: 0 for “OK,” 1 for “Warning,” 2 for “Critical,” and 3 for “Unknown.” The Nagios daemon can be configured to react to these return codes, notifying administrators via email or SMS, for example. In addition to the codes, the plug-ins can also return a line of text, which will be captured by the daemon, written to a log, and displayed in the UI. If the daemon finds a pipe character in the text returned by a plug-in, the first part is treated normally, and the second part is treated as performance data.
Performance data doesn’t really mean anything to Nagios; it won’t, for example, enforce any rules on it or interpret it in any way. The text after the pipe might be a chili recipe, for all Nagios knows. The important point is that Nagios can be configured to handle the post-pipe text differently than pre-pipe text, thereby providing a hook from which to obtain metrics from the monitored hosts and pass those metrics to external systems (like Ganglia) without affecting the human-readable summary provided by the pre-pipe text.
Nagios’s performance data handling feature is an important hook.
There are quite a few Nagios add-ons that use it to export metrics from
Nagios for the purpose of importing them into local RRDs. These systems
typically point the service_perfdata_command
attribute in
nagios.cfg to a script that use a series of regular
expressions to parse out the metrics and metric names and then import them
into the proper RRDs. The same methodology can easily be used to push
metrics from Nagios to Ganglia by pointing the
service_perfdata_command
to a script that runs gmetric
instead of the RRDtool import command.
First, you must enable performance data processing in Nagios by
setting process_performance_data=1
in the
nagios.cfg file. Then you can specify the name of the
command to which Nagios should pass all performance data it encounters
using the service_perfdata_command
attribute.
Let’s walk through a simple example. Imagine a
check_ping
plug-in that, when executed by the Nagios
scheduler, pings a host and then return the following output:
PING OK - Packet loss = 0%, RTA = 0.40 ms|0;0.40
We want to capture this plug-in’s performance data, along with
details we’ll need to pass to gexec, including the name of the target
host. Once process_performance_data
is enabled, we’ll tell
Nagios to execute our own shell script every time a plug-in returns with
performance data by setting
service_perfdata_command=PushToGanglia
in nagios.cfg. Then we’ll define
pushToGanglia
in the Nagios object configuration like
so:
define command{ command_name pushToGanglia command_line /usr/local/bin/pushToGanglia.sh "$LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||$SERVICEPERFDATA$" }
With so many Nagios plug-ins, written by so many different authors, it’s important to carefully choose your delimiter and avoid using the same one returned by a plug-in. In our example command, we chose double pipes for a delimiter, which can be difficult to parse in some languages. The tilde (~) character is another good choice.
The capitalized words surrounded by dollar signs in the command
definition are Nagios macros. Using macros, we can request all sorts of
interesting details about the check result from the Nagios daemon,
including the nonperformance data section of the output returned from the
plug-in. The Nagios daemon will substitute these macros for their
respective values at runtime, so when Nagios runs our
pushToGanglia
command, our input will wind up looking
something like this:
1338674610||dbaHost14.foo.com||PING||PING OK - Packet loss = 0%, RTA = 0.40 ms||0;0.40
Our pushToGanglia.sh script will take this
input and compare it against a series of regular expressions to detect
what sort of data it is. When it matches the PING
regex, the script will parse out the relevant metrics and push them to
Ganglia using gexec. It looks something like this:
#!/bin/sh while read IN do #check for output from the check_ping plug-in if [ "$(awk -F '[|][|]' '$3 ~ /^PING$/' <<<${IN})" ] then #this looks like check_ping output all right, parse out what we need read BOX CMDNAME PERFOUT <<<$(awk -F '[|][|]' '{print $2" "$3" "$5}'<<<${IN}) read PING_LOSS PING_MS <<<$(tr ';' ' '<<<${PERFOUT}) #Ok, we have what we need. Send it to Ganglia. gmetric -S ${BOX} -n ${CMDNAME} -t PING_MS -v ${PING_MS} gmetric -S ${BOX} -n ${CMDNAME} -t PING_LOSS -v ${PING_LOSS} #check for output from the check_cpu plug-in elif [ "$(awk -F '[|][|]' '$3 ~ /^CPU$/' <<<${IN})" ] then #do the same sort of thing but with CPU data fi done
This is a popular solution because it’s self-documenting, keeps all
of the metrics collection logic in a single file, detects new hosts
without any additional configuration, and works with any kind of Nagios
check result, including passive checks. It does, however, add a nontrivial
amount of load to the Nagios server. Consider that any time you add a new
check, the result of that check for every host must be parsed against the
pushToGanglia
script. The same is true when you add a
new host or even a new regex to the pushToGanglia
script. In Nagios, process_performance_data
is a global
setting, and so are the ramifications that come with enabling it.
It probably makes sense to process performance data globally if you rely heavily on Nagios for metrics collection. However, for the reasons we outlined in Chapter 1, we don’t think that’s a good idea. If you’re using Ganglia along with Nagios, gmond is the better-evolved symbiote for collecting the normal litany of performance metrics. It’s more likely that you’ll want to use gmond to collect the majority of your performance metrics, and less likely that you’ll want Nagios churning through the result of every single check in case there might be some metrics you’re interested in sending over to Ganglia.
If you’re interested in metrics from only a few Nagios plug-ins,
consider leaving the metric process_performance_data
disabled
and instead writing “wrappers” for the interesting plug-ins. Here, for
example, is what a wrapper for the check_ping
plug-in
might look like:
#!/bin/sh ORIG_PLUGIN='/usr/libexec/check_ping_orig' #get the target host from the H option while getopts "H:" opt do if [ "${opt}" == 'H' ] then BOX=${OPTARG} fi done #run the original plug-in with the given options, and capture its output OOUT=$(${ORIG_PLUGIN} $@) OEXIT=$? #parse out the perfdata we need read PING_LOSS PING_MS <<<$(echo ${OOUT} | cut -d| -f2 | tr ";" " ") #send the metrics to Ganglia gmetric -S ${BOX} -n ${CMDNAME} -t PING_MS -v ${PING_MS} gmetric -S ${BOX} -n ${CMDNAME} -t PING_LOSS -v ${PING_LOSS} #mimic the original plug-in's output back to Nagios echo "${OOUT}" exit ${OEXIT}
The wrapper approach takes a huge burden off the Nagios daemon but is more difficult to track. If you don’t carefully document your changes to the plug-ins, you’ll mystify other administrators, and upgrades to the Nagios plug-ins will break your data collection efforts.
The general strategy is to replace the check_ping
plug-in with a small shell script that calls the original
check_ping
, intercepts its output, and sends the
interesting metrics to Ganglia. The imposter script then reports back to
Nagios with the output and exit code of the original plug-in, and Nagios
has no idea that anything extra has transpired. This approach has several
advantages, the biggest of which is that you can pick and choose which
plug-ins will process performance data.
Because Nagios has no built-in means of polling data from remote hosts, Nagios users have historically employed various remote execution schemes to collect a litany of metrics with the goal of comparing them against static thresholds. These metrics, such as the available disk space or CPU utilization of a host, are usually collected by services like NSCA or NRPE, which execute scripts on the monitored systems at the Nagios server’s behest, returning their results in the standard Nagios way. The metrics themselves, once returned, are usually discarded or in some cases fed into RRDs by the Nagios daemon in the manner described previously.
This arrangement is expensive, especially considering that most of the metrics administrators tend to collect with NRPE and NSCA are collected by gmond out of the box. If you’re using Ganglia, it’s much cheaper to point Nagios at Ganglia to collect these metrics.
To that end, the Ganglia project began including a series of official Nagios plug-ins in gweb versions as of 2.2.0. These plug-ins enable Nagios users to create services that compare metrics stored in Ganglia against alert thresholds defined in Nagios. This is, in our opinion, a huge win for administrators, in many cases enabling them to scrap entirely their Nagios NSCA infrastructure, speed up the execution time of their service checks, and greatly reduce the monitoring burden on both Nagios and the monitored systems themselves.
There are five Ganglia plug-ins currently available:
Check heartbeat.
Check a single metric on a specific host.
Check multiple metrics on a specific host.
Check multiple metrics across a regex-defined range of hosts.
Verify that one or more values is the same across a set of hosts.
The plug-ins interact with a series of gweb PHP scripts that were
created expressly for the purpose. See Figure 7-1. The
check_host_regex.sh plug-in, for example, interacts
with the PHP script: “http://your.gweb.box/nagios/check_host_regex.php”.
Each PHP script takes the arguments passed from the plug-in and parses a
cached copy of the XML dump of the grid state obtained from gmetad’s
xml_port
to retrieve the current metric values for the
requested entities and return a Nagios-style status code (see gmetad for details on gmetad’s
xml_port
). You must functionally enable the
server-side PHP scripts before they can be used and also define the
location and refresh interval of the XML grid state cache by setting the
following parameters in the gweb conf.php
file:
$conf['nagios_cache_enabled'] = 1; $conf['nagios_cache_file'] = $conf['conf_dir'] . "/nagios_ganglia.cache"; $conf['nagios_cache_time'] = 45;
Consider storing the cache file on a RAMDisk or tmpfs to increase performance.
If you define a service check in Nagios to use hostgroups instead of individual hosts, Nagios will schedule the service check for all hosts in that hostgroup at the same time, which may cause a race condition if gweb’s grid state cache changes before the service checks finish executing. To avoid cache-related race conditions, use the warmup_metric_cache.sh script in the web/nagios subdirectory of the gweb tarball, which will ensure that your cache is always fresh.
Internally, Ganglia uses a heartbeat counter to determine whether
a machine is up. This counter is reset every time a new metric packet is
received for the host, so you can safely use this plug-in in lieu of the
Nagios check_ping
plug-in. To use it, first copy the
check_heartbeat.sh script from the Nagios
subdirectory in the Ganglia Web tarball to your Nagios plug-ins
directory. Make sure that the GANGLIA_URL
inside the script
is correct. By default, it is set to:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_heartbeat.php"
Next, define the check command in Nagios. The threshold is the amount of time since the last reported heartbeat; that is, if the last packet received was 50 seconds ago, you would specify 50 as the threshold:
define command { command_name check_ganglia_heartbeat command_line $USER1$/check_heartbeat.sh host=$HOSTADDRESS$ threshold=$ARG1$ }
Now for every host/host group, you want the monitored change
check_command
to be:
check_command check_ganglia_heartbeat!50
The check_ganglia_metric
plug-in compares a
single metric on a given host against a predefined Nagios threshold. To
use it, copy the check_ganglia_metric.sh script
from the Nagios subdirectory in the Ganglia Web tarball to your Nagios
plug-ins directory. Make sure that the GANGLIA_URL
inside
the script is correct. By default, it is set to:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_metric.php"
Next, define the check command in Nagios like so:
define command {
command_name check_ganglia_metric
command_line $USER1$/check_ganglia_metric.sh host=$HOSTADDRESS$↩
metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$
}
Next, add the check command to the service checks for any hosts you want monitored. For instance, if you wanted to be alerted when the 1-minute load average for a given host goes above 5, add the following directive:
check_command check_ganglia_metric!load_one!more!5
To be alerted when the disk space for a given host falls below 10 GB, add:
check_command check_ganglia_metric!disk_free!less!10
The check_multiple_metrics
plug-in is an
alternate implementation of the check_ganglia_metric script that can
check multiple metrics on the same host. For example, instead of
configuring separate checks for disk utilization on
/, /tmp, and
/var—which could produce three separate alerts—you
could instead set up a single check that alerted any time disk
utilization fell below a given threshold.
To use it, copy the check_multiple_metrics.sh
script from the Nagios subdirectory of the Ganglia Web tarball to your
Nagios plug-ins directory. Make sure that the variable
GANGLIA_URL
in the script is correct. By default, it is set
to:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_multiple_metrics.php"
Then define a check command in Nagios:
define command { command_name check_ganglia_multiple_metrics command_line $USER1$/check_multiple_metrics.sh host=$HOSTADDRESS$ checks='$ARG1$' }
Then add a list of checks that are delimited with a colon. Each check consists of:
metric_name,operator,critical_value
For example, the following service would monitor the disk utilization for root (/) and /tmp:
check_command check_ganglia_multiple_metrics!disk_free_rootfs,less,↩
10:disk_free_tmp,less,20
Anytime you define a single service to monitor multiple entities in Nagios, you run the risk of losing visibility into “compound” problems. For example, a service configured to monitor both /tmp and /var might only notify you of a problem with /tmp, when in fact both partitions have reached critical capacity.
Use the check_host_regex
plug-in to check one
or more metrics on a regex-defined range of hosts. This plug-in is
useful when you want to get a single alert if a particular metric is
critical across a number of hosts.
To use it, copy the check_host_regex.sh
script from the Nagios subdirectory in Ganglia Web tarball to your
Nagios plug-ins directory. Make sure that the
GANGLIA_URL
inside the script is correct. By default,
it is:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_host_regex.php"
Next, define a check command in Nagios:
define command { command_name check_ganglia_host_regex command_line $USER1$/check_host_regex.sh hreg='$ARG1$' checks='$ARG2$' }
Then add a list of checks that are delimited with a colon. Each check consists of:
metric_name,operator,critical_value
For example, to check free space on / and
/tmp for any machine starting with
web-*
or app-*
you would use
something like this:
check_command check_ganglia_host_regex!^web-|^app-!disk_free_rootfs,less,↩
10:disk_free_tmp,less,10
Combining multiple hosts into a single service check will prevent Nagios from correctly respecting host-based external commands. For example, Nagios will send notifications if a host listed in this type of service check goes critical, even if the user has placed the host in scheduled downtime. Nagios has no way of knowing that the host has anything to do with this service.
Use the check_value_same_everywhere
plug-in to
verify that one or more metrics on a range of hosts have the same value.
For example, let’s say you wanted to make sure the SVN revision of the
deployed program listing was the same across all servers. You could send
the SVN revision as a string metric and then list it as a metric that
needs to be the same everywhere.
To use the plug-in, copy the
check_value_same_everywhere.sh script from the
Nagios subdirectory of the Ganglia Web tarball to your Nagios plug-ins
directory. Make sure that the GANGLIA_URL
variable
inside the script is correct. By default, it is:
GANGLIA_URL="http://localhost/ganglia2/nagios/check_value_same_everywhere.php"
Then define a check command in Nagios:
define command { command_name check_value_same_everywhere command_line $USER1$/check_value_same_everywhere.sh hreg='$ARG1$' checks='$ARG2$' }
For example:
check_command check_value_same_everywhere!^web-|^app-!svn_revision,num_config_files
In Nagios 3.0, the action_url
attribute was added
to the host and service object definitions. When specified, the
action_url
attribute creates a small icon in the Nagios
UI next to the host or service name to which it corresponds. If a user
clicks this icon, the UI will direct them to the URL specified by the
action_url
attribute for that particular
object.
If your host and service names are consistent in both Nagios and
Ganglia, it’s pretty simple to point any service’s
action_url
back to Ganglia’s
graph.php using built-in Nagios macros so that when a
user clicks on the action_url
icon for that service in
the Nagios UI, he or she is presented with a graph of that service’s
metric data. For example, if we had a host called host1, with a service
called load_one representing the one-minute load history, we could ask
Ganglia to graph it for us with:
http://my.ganglia.box/graph.php?c=cluster1&h=host1&m=load1&r=hour&z=large
The hiccup, if you didn’t notice, is that Ganglia’s
graph.php requires a c=
attribute,
which must be set to the name of the cluster to which the given host
belongs. Nagios has no concept of Ganglia clusters, but it does provide
you with the ability to create custom variables in any object definition.
Custom variables must begin with an underscore, and are available as
macros in any context a built-in macro would be available. Here’s an
example of a custom variable in a host object definition defining the
Ganglia cluster name to which the host belongs:
define host{ host_name host1 address 192.168.1.1 _ganglia_cluster cluster1 ... }
Read more about Nagios Macros here.
You can also use custom variables to correct differences between the
Nagios and Ganglia namespaces, creating, for example, a
_ganglia_service_name
macro in the service definition to map
a service called “CPU” in Nagios to a metric called “load_one” in
Ganglia.
To enable the action_url
attribute, we find it
expedient to create a template for the Ganglia
action_url
, like so:
define service {
name ganglia-service-graph
action_url http://my.ganglia.host/ganglia/graph.php?c=$_GANGLIA_CLUSTER$&↩
h=$HOSTNAME$&m=$SERVICEDESC$&r=hour&z=large
register 0
}
This code makes it easy to toggle the action_url
graph for some services but not others by including use
ganglia-service-graph
in the definition of any service that you
want to graph. As you can see, the action_url
we’ve
specified combines the custom-made _ganglia_cluster
macro we defined in the host object with the hostname and
servicedesc
built-in macros. If the Nagios service name
was not the same as the Ganglia metric name (which is likely the case in
real life), we would have defined our own
_ganglia_service_name
variable in the service
definition and referred to that macro in the action_url
instead of the servicedesc
built-in.
The Nagios UI also supports custom CGI headers and footers, which
make it possible to accomplish rollover popups of the
action_url
icon containing graphs from the Ganglia
graph.php. This approach requires some custom
development on your part and is outside the scope of this book, but we
wanted you to know it’s there. If that sounds like a useful feature to
you, we suggest checking out this
information.
When Ganglia is running, it’s a great way to aggregate metrics, but when it breaks, it can cause a bit of frustration with regard to locating the cause of that breakage. Thankfully, there are a number of points to monitor, which can help stave off an inconvenient breakage.
Using check_nrpe
(or even
check_procs
directly), the daemons that support
Ganglia can be monitored for any failures. It is most useful to monitor
gmetad and rrdcached on the aggregation hosts and gmond on all hosts.
The pertinent snippets for local monitoring of a gmond process
are:
define command { command_name check_gmond_local command_line $USER1$/check_procs -C gmond -c 1:2 } define service { use generic-service host_name localhost service_description GMOND check_command check_gmond_local }
A more “functional” type of monitoring is monitoring for connectivity on the outbound TCP ports for the varying services. gmetad, for example, listens on ports 8651 and 8652, and gmond listens on port 8649. Checking these ports, with a reasonable timeout, can give a reasonably good idea as to whether they are functioning as expected.
cron collection jobs, which are run by your cron periodic scheduling daemon, are another way of collecting metrics without using gmond modules. Monitoring failures in these scripts, by virtue of their extremely heterogeneous nature and lack of similar structures, has the potential for being a place for fairly serious collection failures. These can, for the most part, be avoided by following a few basic suggestions
Using the logger utility for bash scripts or any of the variety of syslog submission capabilities available will allow you to be able to see what your scripts are doing, instead of being bombarded by logwatch emails or just seeing collection for certain metrics stop.
Touch a stamp file to allow other monitoring tools to detect the last run of your script. That way, you can monitor the stamp file for becoming stale in a standard way. Be wary of permissions issues, as test-running a script as a user other than the one who will be running it in production can cause silent failures.
Too many cron jobs are written to collect data, but assume things like “the network is always available,” “a file I’m monitoring exists,” or “some third-party dependency will never fail.” These will eventually lead to error conditions that either break collection completely or, worse, submit incorrect metrics.
If you’re using netcat, telnet, or other network-facing methods to gather metrics data, there is a possibility that they will fail to return data before the next polling period, potentially causing a pile-up or resulting in other nasty behavior. Use common sense to figure out how long you should be waiting for results, then exit gracefully if you haven’t gotten them.
It can be useful to collect metrics on the backlog and processing
metrics for your rrdcached services (if you are using them to speed up
your gmetad host). This can be done by querying the rrdcached
stats
socket and pushing those metrics into Ganglia
using gmetric.
Excessive backlogs can be caused by high IO or CPU load on your rrdcached server, so this can be a useful tool to track down rogue cron jobs or other root causes:
#!/bin/bash # rrdcache-stats.sh # # SHOULD BE RUN AS ROOT, OTHERWISE SUDO RULES NEED TO BE PUT IN PLACE # TO ALLOW THIS SCRIPT, SINCE THE SOCKET IS NOT ACCESSIBLE BY NORMAL # USERS! GMETRIC="/usr/bin/gmetric" RRDSOCK="unix:/var/rrdtool/rrdcached/rrdcached.sock" EXPIRE=300 ( echo "STATS"; sleep 1; echo "QUIT" ) | socat - $RRDSOCK | grep ':' | while read X; do K="$( echo "$X" | cut -d: -f1 )" V="$( echo "$X" | cut -d: -f2 )" $GMETRIC -g rrdcached -t uint32 -n "rrdcached_stat_${K}" -v ${V} -x ${EXPIRE} -d ${EXPIRE} | done