Appendix A. Advanced Metric Configuration and Debugging

Module Metric Definitions

The following tables describe the metric modules that are part of the distribution of the Ganglia monitoring system. In addition to these metric modules, there are also a number of other modules that are available through the Ganglia module git repository. As new modules are developed, many developers share them with the Ganglia community through the Ganglia module repository. The Ganglia module git repository is open to the public, and the modules are free to download and use. Some of the additional modules that are available from this repository include modules for monitoring an Apache Web Server, MySQL database, and Xen virtual machine, as well as Tomcat and Jetty servlet monitoring through JMX.

Mod_MultiCPU

Prior to the introduction of Mod_MultiCPU, gmond was able to produce only a single CPU-related value for each of the various CPU metrics that it reported. If the hardware architecture supported multiple CPUs, gmond reported only the overall usage rather than the usage for each individual CPU. The Mod_MultiCPU module is capable of detecting how many CPUs exist on the system and constructs the series of metric definitions for each one (Table A-1). Through the configuration of Mod_MultiCPU, all CPU-related metrics can be reported for each CPU detected on the system.

Table A-1. Mod_MultiCPU: monitor individual CPU metrics
Metric NameDescription
multicpu_userPercentage of CPU utilization that occurred while executing at the user level
multicpu_nicePercentage of CPU utilization that occurred while executing at the nice level
multicpu_systemPercentage of CPU utilization that occurred while executing at the system level
multicpu_idlePercentage of CPU utilization that occurred while executing at the idle level
multicpu_wioPercentage of CPU utilization that occurred while executing at the wio level
multicpu_intrPercentage of CPU utilization that occurred while executing at the intr level
multicpu_sintrPercentage of CPU utilization that occurred while executing at the sintr level

Mod_GStatus

The Mod_GStatus module (Table A-2) started out as a metric module debugging tool when the modular interface was first introduced into gmond. The purpose of this module was to detect and report all of the metric gathering, data packet sends, and receives as gmond was running. In other words, as gmond is capable of monitoring every aspect of the system, why shouldn’t gmond also monitor itself? Some of the metrics that Mod_GStatus reports are the number of metadata packets sent and received and the number of value packets sent and received, as well as the overall totals. It also tracks any failures in the system. If gmond is incapable of sending or receiving a packet of any kind, Mod_GStatus will report these failures as well.

Table A-2. Mod_GStatus: monitor gmond metrics
Metric NameDescription
gmond_pkts_recvd_valueNumber of metric value packets received
gmond_pkts_recvd_metadataNumber of metric metadata packets received
gmond_pkts_sent_valueNumber of metric value packets sent
gmond_pkts_recvd_failedNumber of metric packets failed to receive
gmond_pkts_sent_metadataNumber of metric metadata packets sent
gmond_pkts_recvd_requestNumber of metric metadata packet requests received (multicast only)
gmond_pkts_sent_requestNumber of metric metadata packet requests sent (multicast only)
gmond_versiongmond version
gmond_pkts_recvd_allTotal number of metric packets received
gmond_pkts_sent_allTotal number of metric packets sent
gmond_version_fullgmond full version
gmond_pkts_recvd_ignoredNumber of metric packets received that were ignored

Multidisk

The Multidisk module (Table A-3) was introduced as one of the new metric gathering modules for many of the same reasons as Mod_MultiCPU. In previous versions of gmond, the metrics that reported disk space added up the totals for all disks and reported this value for total disk space and used space. The Multidisk module provided a way to report disk usage metrics for each individual disk rather than a total of all disks on the system.

Table A-3. Multidisk (Python module): report disk available and disk used space for each individual disk device
Metric NameDescription
<device name>_disk_totalAvailable disk space for each disk device
<device name>_disk_usedAmount of disk space used for each disk device

memcached

The memcached module (Table A-4) introduced a way to take a closer look at what is actually happening under the hood of the memory management system. The standard memory metrics only report overall memory usage and totals for the system and do not provide any further details about memory management. This module dives a little deeper into how the memory is being used and can help to point out memory inefficiencies.

Table A-4. memcached (Python module)
Metric NameDescription
<metric prefix>_curr_itemsCurrent number of items stored
<metric prefix>_cmd_getCumulative number of retrieval reqs
<metric prefix>_cmd_setCumulative number of storage reqs
<metric prefix>_bytes_readTotal number of bytes read by this server from network
<metric prefix>_bytes_writtenTotal number of bytes sent by this server to network
<metric prefix>_bytesCurrent number of bytes used to store items
<metric prefix>_limit_maxbytesNumber of bytes this server is allowed to use for storage
<metric prefix>_curr_connectionsNumber of open connections
<metric prefix>_evictionsNumber of valid items removed from cache to free memory for new items
<metric prefix>_get_hitsNumber of keys that have been requested and found present
<metric prefix>_get_missesNumber of items that have been requested and not found
<metric prefix>_get_hits_rateHits per second
<metric prefix>_get_misses_rateMisses per second
<metric prefix>_cmd_get_rateGets per second
<metric prefix>_cmd_set_rateSets per second
<metric prefix>_cmd_set_hitsNumber of keys that have been stored and found present
<metric prefix>_cmd_set_missesNumber of items that have been stored and not found
<metric prefix>_cmd_deleteCumulative number of delete reqs
<metric prefix>_cmd_delete_hitsNumber of keys that have been deleted and found present
<metric prefix>_cmd_delete_missesNumber of items that have been deleted and not found

TcpConn

The TcpConn metric module (Table A-5) provides a way to look at TCP network connections in an effort to detect problems or misconfiguration. By monitoring the TCP connection activity on the system, this module can help point out issues that may affect network latency or the inability to send or receive data in an efficient manner. This module also introduced a new pattern for how to write Python metric modules that include threading and caching. Because the TcpConn module relies heavily on the netstat Linux utility to acquire TCP metric data and the fact that gmond is not a multithreaded daemon, the module doesn’t want to cause any delays in the gmond gathering process due to latency in calling an external process utility. In order to avoid any kind of latency issues, the TcpConn module starts up its own gathering thread, which is then free to invoke the netstat utility as required. By invoking the netstat utility within a thread, the module is able to gather the TCP connection related values without having to worry about delaying the gmond gathering process. As the metrics are being gathered from within the thread, these values are stored in a shared cache that can be accessed quickly whenever gmond asks the module for its metric values. Introducing threads through Python is actually a very convenient way to make a single-threaded gmond daemon act as if it were multithreaded.

Table A-5. TcpConn (Python module): monitor TCP connection states
Metric NameDescription
tcp_establishedTotal number of established TCP connections
tcp_listenTotal number of listening TCP connections
tcp_timewaitTotal number of time_wait TCP connections
tcp_closewaitTotal number of close_wait TCP connections
tcp_synsentTotal number of syn_sent TCP connections
tcp_synrecvTotal number of syn_recv TCP connections
tcp_synwaitTotal number of syn_wait TCP connections
tcp_finwait1Total number of fin_wait1 TCP connections
tcp_finwait2Total number of fin_wait2 TCP connections
tcp_closedTotal number of closed TCP connections
tcp_lastackTotal number of last_ack TCP connections
tcp_closingTotal number of closing TCP connections
tcp_unknownTotal number of unknown TCP connections

Advanced Metrics Aggregation and You

There are certain types of metrics aggregation that can’t easily be accomplished by using only gmond and gmetric submission. The most notable of these instances are “derivative” values and “counters.” These require collection and aggregation over time, in the case of derivative values, and are less than optimally collected using counters.

It is worth pointing out, at this point in this text, that Ganglia does have a way of dealing with some derivative values. Metrics submitted using a “positive” slope generate RRDs that are created as COUNTERs; however, this mechanism is not ideal for situations involving incrementing values that submit on each iteration (i.e., Apache httpd page serving counts without log-scraping).

One of the solutions for dealing with counter values is statsd. It was created by the nice folks at Etsy. It is written in Node.js, though there are quite a few ports and clones available. Table A-6 lists some of these that are available at the time of writing.

Table A-6. statsd implementations
SoftwareLanguageDescription
statsdNode.jshttps://github.com/etsy/statsd. It should be noted that the original statsd implementation does not have Ganglia/gmetric submission support without an additional module, which is available here.
statsd-goGohttps://github.com/jbuchbinder/statsd-go. This implementation is a fork of the gographite port of statsd, which did not have Ganglia/gmetric submission support at the time of writing.
py-statsdPythonhttps://github.com/sivy/py-statsd
Ruby statsdRubyhttps://github.com/fetep/ruby-statsdserver
 Chttps://github.com/jbuchbinder/statsd-c

The protocol for statsd is relatively simple, and most of the statsd servers come with an example client for submitting statsd.

In addition, there is another piece of software called VDED, which can be used to track ever-increasing values.

Configuring statsd

Most statsd instances are fairly similar to configure for submitting metrics to Ganglia. The important considerations should be how that data is represented in your Ganglia instance. For example, statsd and its clones don’t have any particular notion of “host,” so each statsd instance is tied to submitting metrics that will be associated with a specific Ganglia host.

statsd

The original statsd instance requires the additional Ganglia NPM module to be installed, using npm install statsd-ganglia-backend. It can then be configured by adding statsd-ganglia-backend to the array of backends and configuring the Ganglia config key in your statsd configuration file:

{
  ganglia: {
      host: "127.0.0.1"              // hostname/IP of gmond instance
    , port: 8649                     // port of gmond instance
    , spoof: "10.0.0.1:myhost.mynet" // ganglia spoof string
    , useHost: 'myhost.mynet"        // hostname to present to ganglia
  }
}
          

statsd-c

statsd-c can be configured to submit values to Ganglia by specifying:

-G (ganglia host) -g (ganglia port) -S (spoof string)

as part of the starting command line for statsd-c.

py-statsd

py-statsd can be configured to submit values to Ganglia by specifying:

--transport="ganglia" --ganglia_host="localhost" --ganglia_port=8649
--ganglia_spoof_host="statd:statd"

as part of the starting command line for py-statsd.

Configuring VDED

VDED has some of the same constraints as statsd, except that it is not constrained to submit all values as if they belonged to a single host. The optional “spoof” parameter for submitting metrics to VDED allows different spoof arguments to be associated with different tracked metrics. It is a good idea to remember that VDED aggregation is limited to the cluster of which the receiving gmond instance is a member.

The command-line arguments for VDED, which are managed in /etc/vded/config on RHEL installations, is configured using the following switches:

--ghost=(ganglia host) --gport=(ganglia port) --gspoof=(default spoof)

rrdcached

rrdcached is a high-performance RRD caching daemon, which allows a larger number of RRD files to be maintained by a gmetad instance without the higher IO load associated with reading/writing those files to and from disk. It can be controlled via a command socket and is distributed with the standard RRDtool packages for most Linux distributions.

Note that rrdcached may be unnecessary if you’re using a ramdisk to store your RRD files.

Installing

The rrdcached package can be installed on Debian-based distributions (Debian, Ubuntu, Linux, Mint, etc.) by using apt:

 $ sudo apt-get -y install rrdcached

For Red Hat/RHEL-based distributions (Red Hat/RHEL, Fedora, CentOS, etc.), rrdcached can be installed via the rrdtool package, which was probably installed already for gmetad to function properly:

 $ sudo yum install -y rrdtool

Configuring gmetad for rrdcached

gmetad can be configured to use rrdcached by setting the RRDCACHED_ADDRESS variable in the configuration file included by gmetad’s init script. For Red Hat distributions, this is /etc/sysconfig/gmetad, and for Debian distributions, it is /etc/default/gmetad. For local sockets, the format unix:/PATH/TO/SOCKET should be used to specify the address parameter.

Along with the gmetad configuration change (which will require a restart of any running gmetad processes), it is also recommended that a change be made to the Ganglia web frontend, which will force the frontend to also use rrdcached for forming graphs. Enabling rrdcached support in the web frontend is done by setting the configuration variable $conf['rrdcached_socket'] to the value of gmetad’s RRDCACHED_ADDRESS.

Controlling rrdcached

There are a number of useful operations that can be performed by using telnet, netcat, or socat (depending on whether you have a network or Unix socket set up as the control socket). For example, a FLUSHALL command forces the rrdcached daemon to flush all RRD data to disk as soon as it can:

$ echo "FLUSHALL" | sudo socat - unix:/var/rrdtool/rrdcached/rrdcached.sock
$ echo "FLUSHALL" | nc -v -w 3 localhost 42217

Troubleshooting

There are several things that can go awry with an rrdcached Ganglia installation, primarily because an extra layer of complexity has been added.

Permissions

Make sure that the permissions on the rrdcached socket file are permissive enough to allow both the gmetad service user and the web server user to be able to read and write. Failures to communicate via the socket will be visible in gmetad’s log.

Delays in metrics

rrdcached uses a series of event logs to cache changes to RRD files before it writes them to disk. Heavy load on the server hosting the rrdcached instance may result in a backlog of metrics that have not been written properly to disk. (Note that this does not indicate that the metrics have been dropped, but rather that the rrdcached file still has not written them to their final location.)

Individual metrics can be flushed to disk by using the rrdcached socket and issuing a FLUSH command followed by the full pathname to the target RRD file. This will bring the specified RRD file to the top of the rrdcached job queue. Alternatively, a full flush to disk of all queued RRD updates can be initiated by sending a FLUSHALL instead.

Debugging with gmond-debug

gmond-debug is a useful tool for debugging inbound gmetric formatted traffic. It can be used to debug any of the third-party gmetric libraries or to track down most unusual gmetric behaviors.

To install gmond-debug, run the following commands:

$ git clone git://github.com/ganglia/ganglia_contrib.git          
Cloning into 'ganglia_contrib'...
remote: Counting objects: 479, done.
remote: Compressing objects: 100% (302/302), done.
remote: Total 479 (delta 200), reused 434 (delta 156)
Receiving objects: 100% (479/479), 1.10 MiB | 908 KiB/s, done.
Resolving deltas: 100% (200/200), done.
$ cd ganglia_contrib/gmond-debug
$ . source.env
$ for i in gems/cache/*.gem; do gem install $i; done
Successfully installed dante-0.1.3
1 gem installed
Installing ri documentation for dante-0.1.3...
Installing RDoc documentation for dante-0.1.3...
Successfully installed diff-lcs-1.1.3
...
Successfully installed uuid-2.3.5
1 gem installed
Installing ri documentation for uuid-2.3.5...
Installing RDoc documentation for uuid-2.3.5...
$
      

Now that you have gmond-debug installed, starting the service is straightforward.

$ . source.env
$ ./bin/gmond-debug Starting gmond-zmq service...
With the following options:
{:zmq_port=>7777,
:host=>"127.0.0.1",
:verbose=>false,
:pid_path=>"/var/run/gmond-zmq.pid",
:gmond_host=>"127.0.0.1",
:test_zmq=>false,
:log_path=>false,
:gmond_port=>8649,
:debug=>true,
:gmond_interval=>0,
:zmq_host=>"127.0.0.1",
:port=>1234}
Now accepting gmond udp connections on address 127.0.0.1, port 1234...

To test using gmond-debug, point your gmetric submission software at this machine, port 1234. As soon as UDP packets on port 1234 are received, gmond-debug will attempt to decode it and print a serialized version of the information contained therein.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset