The following tables describe the metric modules that are part of the distribution of the Ganglia monitoring system. In addition to these metric modules, there are also a number of other modules that are available through the Ganglia module git repository. As new modules are developed, many developers share them with the Ganglia community through the Ganglia module repository. The Ganglia module git repository is open to the public, and the modules are free to download and use. Some of the additional modules that are available from this repository include modules for monitoring an Apache Web Server, MySQL database, and Xen virtual machine, as well as Tomcat and Jetty servlet monitoring through JMX.
Prior to the introduction of Mod_MultiCPU, gmond was able to produce only a single CPU-related value for each of the various CPU metrics that it reported. If the hardware architecture supported multiple CPUs, gmond reported only the overall usage rather than the usage for each individual CPU. The Mod_MultiCPU module is capable of detecting how many CPUs exist on the system and constructs the series of metric definitions for each one (Table A-1). Through the configuration of Mod_MultiCPU, all CPU-related metrics can be reported for each CPU detected on the system.
Metric Name | Description |
multicpu_user | Percentage of CPU utilization that occurred while executing at the user level |
multicpu_nice | Percentage of CPU utilization that occurred while executing at the nice level |
multicpu_system | Percentage of CPU utilization that occurred while executing at the system level |
multicpu_idle | Percentage of CPU utilization that occurred while executing at the idle level |
multicpu_wio | Percentage of CPU utilization that occurred while executing at the wio level |
multicpu_intr | Percentage of CPU utilization that occurred while executing at the intr level |
multicpu_sintr | Percentage of CPU utilization that occurred while executing at the sintr level |
The Mod_GStatus module (Table A-2) started out as a metric module debugging tool when the modular interface was first introduced into gmond. The purpose of this module was to detect and report all of the metric gathering, data packet sends, and receives as gmond was running. In other words, as gmond is capable of monitoring every aspect of the system, why shouldn’t gmond also monitor itself? Some of the metrics that Mod_GStatus reports are the number of metadata packets sent and received and the number of value packets sent and received, as well as the overall totals. It also tracks any failures in the system. If gmond is incapable of sending or receiving a packet of any kind, Mod_GStatus will report these failures as well.
Metric Name | Description |
gmond_pkts_recvd_value | Number of metric value packets received |
gmond_pkts_recvd_metadata | Number of metric metadata packets received |
gmond_pkts_sent_value | Number of metric value packets sent |
gmond_pkts_recvd_failed | Number of metric packets failed to receive |
gmond_pkts_sent_metadata | Number of metric metadata packets sent |
gmond_pkts_recvd_request | Number of metric metadata packet requests received (multicast only) |
gmond_pkts_sent_request | Number of metric metadata packet requests sent (multicast only) |
gmond_version | gmond version |
gmond_pkts_recvd_all | Total number of metric packets received |
gmond_pkts_sent_all | Total number of metric packets sent |
gmond_version_full | gmond full version |
gmond_pkts_recvd_ignored | Number of metric packets received that were ignored |
The Multidisk module (Table A-3) was introduced as one of the new metric gathering modules for many of the same reasons as Mod_MultiCPU. In previous versions of gmond, the metrics that reported disk space added up the totals for all disks and reported this value for total disk space and used space. The Multidisk module provided a way to report disk usage metrics for each individual disk rather than a total of all disks on the system.
The memcached module (Table A-4) introduced a way to take a closer look at what is actually happening under the hood of the memory management system. The standard memory metrics only report overall memory usage and totals for the system and do not provide any further details about memory management. This module dives a little deeper into how the memory is being used and can help to point out memory inefficiencies.
Metric Name | Description |
<metric
prefix>_curr_items | Current number of items stored |
<metric prefix>_cmd_get | Cumulative number of retrieval reqs |
<metric prefix>_cmd_set | Cumulative number of storage reqs |
<metric
prefix>_bytes_read | Total number of bytes read by this server from network |
<metric
prefix>_bytes_written | Total number of bytes sent by this server to network |
<metric prefix>_bytes | Current number of bytes used to store items |
<metric
prefix>_limit_maxbytes | Number of bytes this server is allowed to use for storage |
<metric
prefix>_curr_connections | Number of open connections |
<metric
prefix>_evictions | Number of valid items removed from cache to free memory for new items |
<metric prefix>_get_hits | Number of keys that have been requested and found present |
<metric
prefix>_get_misses | Number of items that have been requested and not found |
<metric
prefix>_get_hits_rate | Hits per second |
<metric
prefix>_get_misses_rate | Misses per second |
<metric
prefix>_cmd_get_rate | Gets per second |
<metric
prefix>_cmd_set_rate | Sets per second |
<metric
prefix>_cmd_set_hits | Number of keys that have been stored and found present |
<metric
prefix>_cmd_set_misses | Number of items that have been stored and not found |
<metric
prefix>_cmd_delete | Cumulative number of delete reqs |
<metric
prefix>_cmd_delete_hits | Number of keys that have been deleted and found present |
<metric
prefix>_cmd_delete_misses | Number of items that have been deleted and not found |
The TcpConn metric module (Table A-5) provides a way to look at TCP network connections in an effort to detect problems or misconfiguration. By monitoring the TCP connection activity on the system, this module can help point out issues that may affect network latency or the inability to send or receive data in an efficient manner. This module also introduced a new pattern for how to write Python metric modules that include threading and caching. Because the TcpConn module relies heavily on the netstat Linux utility to acquire TCP metric data and the fact that gmond is not a multithreaded daemon, the module doesn’t want to cause any delays in the gmond gathering process due to latency in calling an external process utility. In order to avoid any kind of latency issues, the TcpConn module starts up its own gathering thread, which is then free to invoke the netstat utility as required. By invoking the netstat utility within a thread, the module is able to gather the TCP connection related values without having to worry about delaying the gmond gathering process. As the metrics are being gathered from within the thread, these values are stored in a shared cache that can be accessed quickly whenever gmond asks the module for its metric values. Introducing threads through Python is actually a very convenient way to make a single-threaded gmond daemon act as if it were multithreaded.
Metric Name | Description |
tcp_established | Total number of established TCP connections |
tcp_listen | Total number of listening TCP connections |
tcp_timewait | Total number of time_wait TCP connections |
tcp_closewait | Total number of close_wait TCP connections |
tcp_synsent | Total number of syn_sent TCP connections |
tcp_synrecv | Total number of syn_recv TCP connections |
tcp_synwait | Total number of syn_wait TCP connections |
tcp_finwait1 | Total number of fin_wait1 TCP connections |
tcp_finwait2 | Total number of fin_wait2 TCP connections |
tcp_closed | Total number of closed TCP connections |
tcp_lastack | Total number of last_ack TCP connections |
tcp_closing | Total number of closing TCP connections |
tcp_unknown | Total number of unknown TCP connections |
There are certain types of metrics aggregation that can’t easily be accomplished by using only gmond and gmetric submission. The most notable of these instances are “derivative” values and “counters.” These require collection and aggregation over time, in the case of derivative values, and are less than optimally collected using counters.
It is worth pointing out, at this point in this text, that Ganglia
does have a way of dealing with some derivative
values. Metrics submitted using a “positive” slope generate RRDs that are
created as COUNTER
s; however, this mechanism is not
ideal for situations involving incrementing values that submit on each
iteration (i.e., Apache httpd page serving counts without
log-scraping).
One of the solutions for dealing with counter values is statsd. It was created by the nice folks at Etsy. It is written in Node.js, though there are quite a few ports and clones available. Table A-6 lists some of these that are available at the time of writing.
Software | Language | Description |
statsd | Node.js | https://github.com/etsy/statsd. It should be noted that the original statsd implementation does not have Ganglia/gmetric submission support without an additional module, which is available here. |
statsd-go | Go | https://github.com/jbuchbinder/statsd-go. This implementation is a fork of the gographite port of statsd, which did not have Ganglia/gmetric submission support at the time of writing. |
py-statsd | Python | https://github.com/sivy/py-statsd |
Ruby statsd | Ruby | https://github.com/fetep/ruby-statsdserver |
C | https://github.com/jbuchbinder/statsd-c |
The protocol for statsd is relatively simple, and most of the statsd servers come with an example client for submitting statsd.
In addition, there is another piece of software called VDED, which can be used to track ever-increasing values.
Most statsd instances are fairly similar to configure for submitting metrics to Ganglia. The important considerations should be how that data is represented in your Ganglia instance. For example, statsd and its clones don’t have any particular notion of “host,” so each statsd instance is tied to submitting metrics that will be associated with a specific Ganglia host.
The original statsd instance requires the additional Ganglia NPM
module to be installed, using npm install
statsd-ganglia-backend
. It can then be configured by adding
statsd-ganglia-backend
to the array of backends and
configuring the Ganglia config key in your statsd configuration
file:
{
ganglia: {
host: "127.0.0.1" // hostname/IP of gmond instance
, port: 8649 // port of gmond instance
, spoof: "10.0.0.1:myhost.mynet" // ganglia spoof string
, useHost: 'myhost.mynet" // hostname to present to ganglia
}
}
VDED has some of the same constraints as statsd, except that it is not constrained to submit all values as if they belonged to a single host. The optional “spoof” parameter for submitting metrics to VDED allows different spoof arguments to be associated with different tracked metrics. It is a good idea to remember that VDED aggregation is limited to the cluster of which the receiving gmond instance is a member.
The command-line arguments for VDED, which are managed in /etc/vded/config on RHEL installations, is configured using the following switches:
--ghost=(ganglia host) --gport=(ganglia port)
--gspoof=(default spoof)
rrdcached is a high-performance RRD caching daemon, which allows a larger number of RRD files to be maintained by a gmetad instance without the higher IO load associated with reading/writing those files to and from disk. It can be controlled via a command socket and is distributed with the standard RRDtool packages for most Linux distributions.
Note that rrdcached may be unnecessary if you’re using a ramdisk to store your RRD files.
The rrdcached package can be installed on Debian-based
distributions (Debian, Ubuntu, Linux, Mint, etc.) by using
apt
:
$ sudo apt-get -y install rrdcached
For Red Hat/RHEL-based distributions (Red Hat/RHEL, Fedora,
CentOS, etc.), rrdcached can be installed via the rrdtool
package, which was probably installed already for gmetad to function
properly:
$ sudo yum install -y rrdtool
gmetad can be configured to use rrdcached by setting the
RRDCACHED_ADDRESS
variable in the configuration file
included by gmetad’s init script. For Red Hat distributions, this is
/etc/sysconfig/gmetad, and for Debian
distributions, it is /etc/default/gmetad. For local
sockets, the format unix:/PATH/TO/SOCKET
should be used to
specify the address parameter.
Along with the gmetad configuration change (which will require a
restart of any running gmetad processes), it is also recommended that a
change be made to the Ganglia web frontend, which will force the
frontend to also use rrdcached for forming graphs. Enabling rrdcached
support in the web frontend is done by setting the configuration
variable $conf['rrdcached_socket']
to the value of gmetad’s
RRDCACHED_ADDRESS
.
There are a number of useful operations that can be performed by
using telnet, netcat, or socat (depending on whether you have a network
or Unix socket set up as the control socket). For example, a
FLUSHALL
command forces the rrdcached daemon to flush all
RRD data to disk as soon as it can:
$ echo "FLUSHALL" | sudo socat - unix:/var/rrdtool/rrdcached/rrdcached.sock
$ echo "FLUSHALL" | nc -v -w 3 localhost 42217
There are several things that can go awry with an rrdcached Ganglia installation, primarily because an extra layer of complexity has been added.
Make sure that the permissions on the rrdcached socket file are permissive enough to allow both the gmetad service user and the web server user to be able to read and write. Failures to communicate via the socket will be visible in gmetad’s log.
rrdcached uses a series of event logs to cache changes to RRD files before it writes them to disk. Heavy load on the server hosting the rrdcached instance may result in a backlog of metrics that have not been written properly to disk. (Note that this does not indicate that the metrics have been dropped, but rather that the rrdcached file still has not written them to their final location.)
Individual metrics can be flushed to disk by using the rrdcached
socket and issuing a FLUSH
command followed by the full
pathname to the target RRD file. This will bring the specified RRD
file to the top of the rrdcached job queue. Alternatively, a full
flush to disk of all queued RRD updates can be initiated by sending a
FLUSHALL
instead.
gmond-debug is a useful tool for debugging inbound gmetric formatted traffic. It can be used to debug any of the third-party gmetric libraries or to track down most unusual gmetric behaviors.
To install gmond-debug, run the following commands:
$ git clone git://github.com/ganglia/ganglia_contrib.git
Cloning into 'ganglia_contrib'...
remote: Counting objects: 479, done.
remote: Compressing objects: 100% (302/302), done.
remote: Total 479 (delta 200), reused 434 (delta 156)
Receiving objects: 100% (479/479), 1.10 MiB | 908 KiB/s, done.
Resolving deltas: 100% (200/200), done.
$ cd ganglia_contrib/gmond-debug
$ . source.env
$ for i in gems/cache/*.gem; do gem install $i; done
Successfully installed dante-0.1.3
1 gem installed
Installing ri documentation for dante-0.1.3...
Installing RDoc documentation for dante-0.1.3...
Successfully installed diff-lcs-1.1.3
...
Successfully installed uuid-2.3.5
1 gem installed
Installing ri documentation for uuid-2.3.5...
Installing RDoc documentation for uuid-2.3.5...
$
Now that you have gmond-debug installed, starting the service is straightforward.
$ . source.env
$ ./bin/gmond-debug Starting gmond-zmq service...
With the following options:
{:zmq_port=>7777,
:host=>"127.0.0.1",
:verbose=>false,
:pid_path=>"/var/run/gmond-zmq.pid",
:gmond_host=>"127.0.0.1",
:test_zmq=>false,
:log_path=>false,
:gmond_port=>8649,
:debug=>true,
:gmond_interval=>0,
:zmq_host=>"127.0.0.1",
:port=>1234}
Now accepting gmond udp connections on address 127.0.0.1, port 1234...
To test using gmond-debug, point your gmetric submission software at this machine, port 1234. As soon as UDP packets on port 1234 are received, gmond-debug will attempt to decode it and print a serialized version of the information contained therein.