Chapter 7. Ganglia and Nagios

Vladimir Vuksan

Jeff Buchbinder

Dave Josephsen

It’s been said that specialization is for insects, which although poetic, isn’t exactly true. Nature abounds with examples of specialization in just about every biological kingdom, from mitochondria to clownfish. The most extreme examples are a special kind of specialization, which biologists refer to as symbiosis.

You’ve probably come across some examples of biological symbiosis at one time or another. Some are quite famous, like the clownfish and the anemone. Others, like the fig wasp, are less so, but the general idea is always the same: two organisms, finding that they can rely on each other, buddy up. Buddies have to work less and can focus more on what they’re good at. In this way, symbiosis begets more specialization, and the individual specializations grow to complement each other.

Effective symbiotes are complementary in the sense that there isn’t much functional overlap between them. The beneficial abilities of one buddy stop pretty close to where those of the other begin, and vice versa. They are also complementary in the sense that their individual specializations combine to create a solution that would be impossible otherwise. Together the pair become something more than the sum of their parts.

It would surprise us to learn that you’d never heard of Nagios. It is probably the most popular open source monitoring system in existence today, and is generally credited for if not inventing, then certainly perfecting the centralized polling model employed by myriad monitoring systems both commercial and free. Nagios has been imitated, forked, reinvented, and commercialized, but in our opinion, it’s never been beaten, and it remains the yardstick by which all monitoring systems are measured.

It is not, however, a valid yardstick by which to measure Ganglia, because the two are not in fact competitors, but symbiotes, and the admin who makes the mistake of choosing one over the other is doing himself a disservice. It is not only possible, but advisable to use them together to achieve the best of both worlds. To that end, we’ve included this chapter to help you understand the best options available for Nagios interoperability.

Sending Nagios Data to Ganglia

Under the hood, Nagios is really just a special-purpose scheduling and notification engine. By itself, it can’t monitor anything. All it can do is schedule the execution of little programs referred to as plug-ins and take action based on their output.

Nagios plug-ins return one of four states: 0 for “OK,” 1 for “Warning,” 2 for “Critical,” and 3 for “Unknown.” The Nagios daemon can be configured to react to these return codes, notifying administrators via email or SMS, for example. In addition to the codes, the plug-ins can also return a line of text, which will be captured by the daemon, written to a log, and displayed in the UI. If the daemon finds a pipe character in the text returned by a plug-in, the first part is treated normally, and the second part is treated as performance data.

Performance data doesn’t really mean anything to Nagios; it won’t, for example, enforce any rules on it or interpret it in any way. The text after the pipe might be a chili recipe, for all Nagios knows. The important point is that Nagios can be configured to handle the post-pipe text differently than pre-pipe text, thereby providing a hook from which to obtain metrics from the monitored hosts and pass those metrics to external systems (like Ganglia) without affecting the human-readable summary provided by the pre-pipe text.

Nagios’s performance data handling feature is an important hook. There are quite a few Nagios add-ons that use it to export metrics from Nagios for the purpose of importing them into local RRDs. These systems typically point the service_perfdata_command attribute in nagios.cfg to a script that use a series of regular expressions to parse out the metrics and metric names and then import them into the proper RRDs. The same methodology can easily be used to push metrics from Nagios to Ganglia by pointing the service_perfdata_command to a script that runs gmetric instead of the RRDtool import command.

First, you must enable performance data processing in Nagios by setting process_performance_data=1 in the nagios.cfg file. Then you can specify the name of the command to which Nagios should pass all performance data it encounters using the service_perfdata_command attribute.

Let’s walk through a simple example. Imagine a check_ping plug-in that, when executed by the Nagios scheduler, pings a host and then return the following output:

PING OK - Packet loss = 0%, RTA = 0.40 ms|0;0.40

We want to capture this plug-in’s performance data, along with details we’ll need to pass to gexec, including the name of the target host. Once process_performance_data is enabled, we’ll tell Nagios to execute our own shell script every time a plug-in returns with performance data by setting service_perfdata_command=PushToGanglia in nagios.cfg. Then we’ll define pushToGanglia in the Nagios object configuration like so:

define command{
command_name    pushToGanglia
command_line  /usr/local/bin/pushToGanglia.sh 
"$LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||$SERVICEPERFDATA$"
}

Careful with those delimiters!

With so many Nagios plug-ins, written by so many different authors, it’s important to carefully choose your delimiter and avoid using the same one returned by a plug-in. In our example command, we chose double pipes for a delimiter, which can be difficult to parse in some languages. The tilde (~) character is another good choice.

The capitalized words surrounded by dollar signs in the command definition are Nagios macros. Using macros, we can request all sorts of interesting details about the check result from the Nagios daemon, including the nonperformance data section of the output returned from the plug-in. The Nagios daemon will substitute these macros for their respective values at runtime, so when Nagios runs our pushToGanglia command, our input will wind up looking something like this:

1338674610||dbaHost14.foo.com||PING||PING OK - Packet loss = 0%, RTA = 0.40 ms||0;0.40

Our pushToGanglia.sh script will take this input and compare it against a series of regular expressions to detect what sort of data it is. When it matches the PING regex, the script will parse out the relevant metrics and push them to Ganglia using gexec. It looks something like this:

#!/bin/sh
while read IN
do
    #check for output from the check_ping plug-in
    if [ "$(awk -F '[|][|]' '$3 ~ /^PING$/' <<<${IN})" ]
    then

        #this looks like check_ping output all right, parse out what we need
        read BOX CMDNAME PERFOUT <<<$(awk -F '[|][|]' '{print $2" "$3" "$5}'<<<${IN})
        read PING_LOSS PING_MS <<<$(tr ';' ' '<<<${PERFOUT})

        #Ok, we have what we need. Send it to Ganglia.
        gmetric -S ${BOX} -n ${CMDNAME} -t PING_MS -v ${PING_MS}
        gmetric -S ${BOX} -n ${CMDNAME} -t PING_LOSS -v ${PING_LOSS}

    #check for output from the check_cpu plug-in
    elif [ "$(awk -F '[|][|]' '$3 ~ /^CPU$/' <<<${IN})" ]
    then
        #do the same sort of thing but with CPU data
    fi
done

This is a popular solution because it’s self-documenting, keeps all of the metrics collection logic in a single file, detects new hosts without any additional configuration, and works with any kind of Nagios check result, including passive checks. It does, however, add a nontrivial amount of load to the Nagios server. Consider that any time you add a new check, the result of that check for every host must be parsed against the pushToGanglia script. The same is true when you add a new host or even a new regex to the pushToGanglia script. In Nagios, process_performance_data is a global setting, and so are the ramifications that come with enabling it.

It probably makes sense to process performance data globally if you rely heavily on Nagios for metrics collection. However, for the reasons we outlined in Chapter 1, we don’t think that’s a good idea. If you’re using Ganglia along with Nagios, gmond is the better-evolved symbiote for collecting the normal litany of performance metrics. It’s more likely that you’ll want to use gmond to collect the majority of your performance metrics, and less likely that you’ll want Nagios churning through the result of every single check in case there might be some metrics you’re interested in sending over to Ganglia.

If you’re interested in metrics from only a few Nagios plug-ins, consider leaving the metric process_performance_data disabled and instead writing “wrappers” for the interesting plug-ins. Here, for example, is what a wrapper for the check_ping plug-in might look like:

#!/bin/sh

ORIG_PLUGIN='/usr/libexec/check_ping_orig'

#get the target host from the H option
while getopts "H:" opt 
do
	if [ "${opt}" == 'H' ]
	then
		BOX=${OPTARG}
	fi
done
                         
#run the original plug-in with the given options, and capture its output
OOUT=$(${ORIG_PLUGIN} $@)
OEXIT=$?

#parse out the perfdata we need
read PING_LOSS PING_MS <<<$(echo ${OOUT} | cut -d| -f2 | tr ";" " ")

#send the metrics to Ganglia
gmetric -S ${BOX} -n ${CMDNAME} -t PING_MS -v ${PING_MS}
gmetric -S ${BOX} -n ${CMDNAME} -t PING_LOSS -v ${PING_LOSS} 

#mimic the original plug-in's output back to Nagios
echo "${OOUT}"
exit ${OEXIT}

Note

The wrapper approach takes a huge burden off the Nagios daemon but is more difficult to track. If you don’t carefully document your changes to the plug-ins, you’ll mystify other administrators, and upgrades to the Nagios plug-ins will break your data collection efforts.

The general strategy is to replace the check_ping plug-in with a small shell script that calls the original check_ping, intercepts its output, and sends the interesting metrics to Ganglia. The imposter script then reports back to Nagios with the output and exit code of the original plug-in, and Nagios has no idea that anything extra has transpired. This approach has several advantages, the biggest of which is that you can pick and choose which plug-ins will process performance data.

Monitoring Ganglia Metrics with Nagios

Because Nagios has no built-in means of polling data from remote hosts, Nagios users have historically employed various remote execution schemes to collect a litany of metrics with the goal of comparing them against static thresholds. These metrics, such as the available disk space or CPU utilization of a host, are usually collected by services like NSCA or NRPE, which execute scripts on the monitored systems at the Nagios server’s behest, returning their results in the standard Nagios way. The metrics themselves, once returned, are usually discarded or in some cases fed into RRDs by the Nagios daemon in the manner described previously.

This arrangement is expensive, especially considering that most of the metrics administrators tend to collect with NRPE and NSCA are collected by gmond out of the box. If you’re using Ganglia, it’s much cheaper to point Nagios at Ganglia to collect these metrics.

To that end, the Ganglia project began including a series of official Nagios plug-ins in gweb versions as of 2.2.0. These plug-ins enable Nagios users to create services that compare metrics stored in Ganglia against alert thresholds defined in Nagios. This is, in our opinion, a huge win for administrators, in many cases enabling them to scrap entirely their Nagios NSCA infrastructure, speed up the execution time of their service checks, and greatly reduce the monitoring burden on both Nagios and the monitored systems themselves.

There are five Ganglia plug-ins currently available:

  1. Check heartbeat.

  2. Check a single metric on a specific host.

  3. Check multiple metrics on a specific host.

  4. Check multiple metrics across a regex-defined range of hosts.

  5. Verify that one or more values is the same across a set of hosts.

Principle of Operation

The plug-ins interact with a series of gweb PHP scripts that were created expressly for the purpose. See Figure 7-1. The check_host_regex.sh plug-in, for example, interacts with the PHP script: “http://your.gweb.box/nagios/check_host_regex.php”. Each PHP script takes the arguments passed from the plug-in and parses a cached copy of the XML dump of the grid state obtained from gmetad’s xml_port to retrieve the current metric values for the requested entities and return a Nagios-style status code (see gmetad for details on gmetad’s xml_port). You must functionally enable the server-side PHP scripts before they can be used and also define the location and refresh interval of the XML grid state cache by setting the following parameters in the gweb conf.php file:

$conf['nagios_cache_enabled'] = 1;
$conf['nagios_cache_file'] = $conf['conf_dir'] . "/nagios_ganglia.cache";
$conf['nagios_cache_time'] = 45;
Plug-in principle of operation
Figure 7-1. Plug-in principle of operation

Consider storing the cache file on a RAMDisk or tmpfs to increase performance.

Beware: Numerous parallel checks

If you define a service check in Nagios to use hostgroups instead of individual hosts, Nagios will schedule the service check for all hosts in that hostgroup at the same time, which may cause a race condition if gweb’s grid state cache changes before the service checks finish executing. To avoid cache-related race conditions, use the warmup_metric_cache.sh script in the web/nagios subdirectory of the gweb tarball, which will ensure that your cache is always fresh.

Check Heartbeat

Internally, Ganglia uses a heartbeat counter to determine whether a machine is up. This counter is reset every time a new metric packet is received for the host, so you can safely use this plug-in in lieu of the Nagios check_ping plug-in. To use it, first copy the check_heartbeat.sh script from the Nagios subdirectory in the Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the GANGLIA_URL inside the script is correct. By default, it is set to:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_heartbeat.php"

Next, define the check command in Nagios. The threshold is the amount of time since the last reported heartbeat; that is, if the last packet received was 50 seconds ago, you would specify 50 as the threshold:

define command {
  command_name  check_ganglia_heartbeat
  command_line  $USER1$/check_heartbeat.sh host=$HOSTADDRESS$ threshold=$ARG1$
}

Now for every host/host group, you want the monitored change check_command to be:

check_command  check_ganglia_heartbeat!50

Check a Single Metric on a Specific Host

The check_ganglia_metric plug-in compares a single metric on a given host against a predefined Nagios threshold. To use it, copy the check_ganglia_metric.sh script from the Nagios subdirectory in the Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the GANGLIA_URL inside the script is correct. By default, it is set to:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_metric.php"

Next, define the check command in Nagios like so:

define command {
  command_name  check_ganglia_metric
  command_line  $USER1$/check_ganglia_metric.sh host=$HOSTADDRESS$
  metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$
}

Next, add the check command to the service checks for any hosts you want monitored. For instance, if you wanted to be alerted when the 1-minute load average for a given host goes above 5, add the following directive:

check_command			check_ganglia_metric!load_one!more!5

To be alerted when the disk space for a given host falls below 10 GB, add:

check_command			check_ganglia_metric!disk_free!less!10

Operators denote criticality

The operators specified in the Nagios definitions for the Ganglia plug-ins always indicate the “critical” state. If you use a notequal operator, it means that state is critical if the value is not equal.

Check Multiple Metrics on a Specific Host

The check_multiple_metrics plug-in is an alternate implementation of the check_ganglia_metric script that can check multiple metrics on the same host. For example, instead of configuring separate checks for disk utilization on /, /tmp, and /var—which could produce three separate alerts—you could instead set up a single check that alerted any time disk utilization fell below a given threshold.

To use it, copy the check_multiple_metrics.sh script from the Nagios subdirectory of the Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the variable GANGLIA_URL in the script is correct. By default, it is set to:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_multiple_metrics.php"

Then define a check command in Nagios:

define command {
  command_name  check_ganglia_multiple_metrics
  command_line  $USER1$/check_multiple_metrics.sh host=$HOSTADDRESS$ checks='$ARG1$'
}

Then add a list of checks that are delimited with a colon. Each check consists of:

metric_name,operator,critical_value

For example, the following service would monitor the disk utilization for root (/) and /tmp:

check_command check_ganglia_multiple_metrics!disk_free_rootfs,less,
10:disk_free_tmp,less,20

Beware: Aggregated services

Anytime you define a single service to monitor multiple entities in Nagios, you run the risk of losing visibility into “compound” problems. For example, a service configured to monitor both /tmp and /var might only notify you of a problem with /tmp, when in fact both partitions have reached critical capacity.

Check Multiple Metrics on a Range of Hosts

Use the check_host_regex plug-in to check one or more metrics on a regex-defined range of hosts. This plug-in is useful when you want to get a single alert if a particular metric is critical across a number of hosts.

To use it, copy the check_host_regex.sh script from the Nagios subdirectory in Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the GANGLIA_URL inside the script is correct. By default, it is:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_host_regex.php"

Next, define a check command in Nagios:

define command {
  command_name  check_ganglia_host_regex
  command_line  $USER1$/check_host_regex.sh hreg='$ARG1$' checks='$ARG2$'
}

Then add a list of checks that are delimited with a colon. Each check consists of:

metric_name,operator,critical_value

For example, to check free space on / and /tmp for any machine starting with web-* or app-* you would use something like this:

check_command check_ganglia_host_regex!^web-|^app-!disk_free_rootfs,less,
10:disk_free_tmp,less,10

Beware: Multiple hosts in a single service

Combining multiple hosts into a single service check will prevent Nagios from correctly respecting host-based external commands. For example, Nagios will send notifications if a host listed in this type of service check goes critical, even if the user has placed the host in scheduled downtime. Nagios has no way of knowing that the host has anything to do with this service.

Verify that a Metric Value Is the Same Across a Set of Hosts

Use the check_value_same_everywhere plug-in to verify that one or more metrics on a range of hosts have the same value. For example, let’s say you wanted to make sure the SVN revision of the deployed program listing was the same across all servers. You could send the SVN revision as a string metric and then list it as a metric that needs to be the same everywhere.

To use the plug-in, copy the check_value_same_everywhere.sh script from the Nagios subdirectory of the Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the GANGLIA_URL variable inside the script is correct. By default, it is:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_value_same_everywhere.php"

Then define a check command in Nagios:

define command {
  command_name  check_value_same_everywhere
  command_line  $USER1$/check_value_same_everywhere.sh hreg='$ARG1$' checks='$ARG2$'
}

For example:

check_command check_value_same_everywhere!^web-|^app-!svn_revision,num_config_files

Displaying Ganglia Data in the Nagios UI

In Nagios 3.0, the action_url attribute was added to the host and service object definitions. When specified, the action_url attribute creates a small icon in the Nagios UI next to the host or service name to which it corresponds. If a user clicks this icon, the UI will direct them to the URL specified by the action_url attribute for that particular object.

If your host and service names are consistent in both Nagios and Ganglia, it’s pretty simple to point any service’s action_url back to Ganglia’s graph.php using built-in Nagios macros so that when a user clicks on the action_url icon for that service in the Nagios UI, he or she is presented with a graph of that service’s metric data. For example, if we had a host called host1, with a service called load_one representing the one-minute load history, we could ask Ganglia to graph it for us with:

http://my.ganglia.box/graph.php?c=cluster1&h=host1&m=load1&r=hour&z=large

The hiccup, if you didn’t notice, is that Ganglia’s graph.php requires a c= attribute, which must be set to the name of the cluster to which the given host belongs. Nagios has no concept of Ganglia clusters, but it does provide you with the ability to create custom variables in any object definition. Custom variables must begin with an underscore, and are available as macros in any context a built-in macro would be available. Here’s an example of a custom variable in a host object definition defining the Ganglia cluster name to which the host belongs:

define host{
	host_name		host1
	address		192.168.1.1
	_ganglia_cluster	cluster1
	...
}

Note

Read more about Nagios Macros here.

You can also use custom variables to correct differences between the Nagios and Ganglia namespaces, creating, for example, a _ganglia_service_name macro in the service definition to map a service called “CPU” in Nagios to a metric called “load_one” in Ganglia.

To enable the action_url attribute, we find it expedient to create a template for the Ganglia action_url, like so:

define service {
   name       ganglia-service-graph
   action_url http://my.ganglia.host/ganglia/graph.php?c=$_GANGLIA_CLUSTER$&
              h=$HOSTNAME$&m=$SERVICEDESC$&r=hour&z=large
   register   0
}

This code makes it easy to toggle the action_url graph for some services but not others by including use ganglia-service-graph in the definition of any service that you want to graph. As you can see, the action_url we’ve specified combines the custom-made _ganglia_cluster macro we defined in the host object with the hostname and servicedesc built-in macros. If the Nagios service name was not the same as the Ganglia metric name (which is likely the case in real life), we would have defined our own _ganglia_service_name variable in the service definition and referred to that macro in the action_url instead of the servicedesc built-in.

The Nagios UI also supports custom CGI headers and footers, which make it possible to accomplish rollover popups of the action_url icon containing graphs from the Ganglia graph.php. This approach requires some custom development on your part and is outside the scope of this book, but we wanted you to know it’s there. If that sounds like a useful feature to you, we suggest checking out this information.

Monitoring Ganglia with Nagios

When Ganglia is running, it’s a great way to aggregate metrics, but when it breaks, it can cause a bit of frustration with regard to locating the cause of that breakage. Thankfully, there are a number of points to monitor, which can help stave off an inconvenient breakage.

Monitoring Processes

Using check_nrpe (or even check_procs directly), the daemons that support Ganglia can be monitored for any failures. It is most useful to monitor gmetad and rrdcached on the aggregation hosts and gmond on all hosts. The pertinent snippets for local monitoring of a gmond process are:

define command {
  command_name    check_gmond_local
  command_line    $USER1$/check_procs -C gmond -c 1:2
  }

define service {
  use                       generic-service
  host_name                 localhost
  service_description       GMOND
  check_command             check_gmond_local
  }

Monitoring Connectivity

A more “functional” type of monitoring is monitoring for connectivity on the outbound TCP ports for the varying services. gmetad, for example, listens on ports 8651 and 8652, and gmond listens on port 8649. Checking these ports, with a reasonable timeout, can give a reasonably good idea as to whether they are functioning as expected.

Monitoring cron Collection Jobs

cron collection jobs, which are run by your cron periodic scheduling daemon, are another way of collecting metrics without using gmond modules. Monitoring failures in these scripts, by virtue of their extremely heterogeneous nature and lack of similar structures, has the potential for being a place for fairly serious collection failures. These can, for the most part, be avoided by following a few basic suggestions

Log, but not too much.

Using the logger utility for bash scripts or any of the variety of syslog submission capabilities available will allow you to be able to see what your scripts are doing, instead of being bombarded by logwatch emails or just seeing collection for certain metrics stop.

Use “last run” files.

Touch a stamp file to allow other monitoring tools to detect the last run of your script. That way, you can monitor the stamp file for becoming stale in a standard way. Be wary of permissions issues, as test-running a script as a user other than the one who will be running it in production can cause silent failures.

Expect bad data.

Too many cron jobs are written to collect data, but assume things like “the network is always available,” “a file I’m monitoring exists,” or “some third-party dependency will never fail.” These will eventually lead to error conditions that either break collection completely or, worse, submit incorrect metrics.

Use timeouts.

If you’re using netcat, telnet, or other network-facing methods to gather metrics data, there is a possibility that they will fail to return data before the next polling period, potentially causing a pile-up or resulting in other nasty behavior. Use common sense to figure out how long you should be waiting for results, then exit gracefully if you haven’t gotten them.

Collecting rrdcached Metrics

It can be useful to collect metrics on the backlog and processing metrics for your rrdcached services (if you are using them to speed up your gmetad host). This can be done by querying the rrdcached stats socket and pushing those metrics into Ganglia using gmetric.

Excessive backlogs can be caused by high IO or CPU load on your rrdcached server, so this can be a useful tool to track down rogue cron jobs or other root causes:

#!/bin/bash
# rrdcache-stats.sh
#
# SHOULD BE RUN AS ROOT, OTHERWISE SUDO RULES NEED TO BE PUT IN PLACE
# TO ALLOW THIS SCRIPT, SINCE THE SOCKET IS NOT ACCESSIBLE BY NORMAL
# USERS!

GMETRIC="/usr/bin/gmetric"
RRDSOCK="unix:/var/rrdtool/rrdcached/rrdcached.sock"
EXPIRE=300

( echo "STATS"; sleep 1; echo "QUIT" ) | 
  socat - $RRDSOCK | 
  grep ':' | 
  while read X; do
    K="$( echo "$X" | cut -d: -f1 )"
    V="$( echo "$X" | cut -d: -f2 )"
    $GMETRIC -g rrdcached -t uint32 -n "rrdcached_stat_${K}" -v ${V} -x ${EXPIRE} 
    -d ${EXPIRE} | 
  done
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset