You’ve got data—lots and lots of data that’s just too valuable to delete or take offline for even a minute. Your data is likely made up of a number of different formats, and you know that your data will only grow larger and more complex over time. Don’t fret. The growing pains you’re facing have been faced by other people and there are systems to handle it: Hadoop and HBase.
If you want to use Ganglia to monitor a Hadoop or HBase cluster, I have good news—Ganglia support is built in.
Hadoop is an Apache-licensed open source system modeled after Google’s MapReduce and Google File System (GFS) systems. Hadoop was created by Doug Cutting, who now works as an architect at Cloudera and serves as chair of the Apache Software Foundation. He named Hadoop after his son’s yellow stuffed toy elephant.
With Hadoop, you can grow the size of your filesystem by adding more machines to your cluster. This feature allows you to grow storage incrementally, regardless of whether you need terabytes or petabytes of space. Hadoop also ensures your data is safe by automatically replicating your data to multiple machines. You could remove a machine from your cluster and take it out to a grassy field with a baseball bat to reenact the printer scene from Office Space—and not lose a single byte of data.
The Hadoop MapReduce engine breaks data processing up into smaller units of work and intelligently distributes them across your cluster. The MapReduce APIs allow developers to focus on the question they’re trying to answer instead of worrying about how to handle machine failures—you’re regretting what you did with that bat now, aren’t you? Because data in the Hadoop filesystem is replicated, Hadoop can automatically handle failures by rerunning the computation on a replica, often without the user even noticing.
HBase is an Apache-licensed open source system modeled after Google’s Bigtable. HBase sits on top of the Hadoop filesystem and provides users random, real-time read/write access to their data. You can think of HBase, at a high level, as a flexible and extremely scalable database.
Hadoop and HBase are dynamic systems that are easier to manage when you have the metric visibility that Ganglia can provide. If you’re interested in learning more about Hadoop and HBase, I highly recommend the following books:
White, Tom. Hadoop: The Definitive Guide. O’Reilly Media, 2009.
Sammer, Eric. Hadoop Operations. O’Reilly Media, 2012.
George, Lars. HBase: The Definitive Guide. O’Reilly Media, 2011.
Ganglia’s monitoring daemon (gmond) publishes metrics in a well-defined format. You can configure the Hadoop metric subsystem to publish metrics directly to Ganglia in the format it understands.
The Ganglia wire format changed incompatibly at version 3.1.0. All Ganglia releases from 3.1.0 onward use a new message format; agents prior to 3.1.0 use the original format. Old agents can’t communicate with new agents, and vice versa.
In order to turn on Ganglia monitoring, you must update the
hadoop-metrics.properties file in your Hadoop
configuration directory. This file is organized into different contexts:
jvm
, rpc
, hdfs
,
mapred
, and hbase
. You can turn on Ganglia
monitoring for one or all contexts. It’s up to you.
Example B-1 through Example B-5 are example configuration snippets from each metric context. Your hadoop-metrics.properties file will likely already have these Ganglia configuration snippets in it, although they are commented out by default.
I’m sure you noticed some obvious patterns from these snippets. The
prefix of the configuration keys is the name of the context (e.g.,
mapred
), and each context has class
,
period
, and servers
properties.
The class
specifies what format that metric
should be published in. If you are running Ganglia 3.1.0 or newer,
this class should be set to
org.apache.hadoop.metrics.ganglia.GangliaContext31
;
otherwise, set the class to
org.apache.hadoop.metrics.ganglia.GangliaContext
. Some
people find the name GangliaContext31
to be a bit
confusing as it seems to imply that it works
only with Ganglia 3.1. Now you know that isn’t
the case.
The period
defines the number of seconds between
metric updates. Ten seconds is a good value here.
The servers
is a comma-separated list of unicast
or multicast addresses to publish metrics to. If you don’t
explicitly provide a port number, Hadoop will assume you want the
default gmond port: 8649.
If you bind the multicast address using the bind
option in gmond.conf, you cannot also send a metric message to the
unicast address of the host running gmond. This is a common source of
confusion. If you are not receiving Hadoop metrics after setting the
servers
property, double-check your gmond
udp_recv_channel
setting in
gmond.conf.
Let’s work through a few samples.
If you wanted to set up Hadoop to publish jvm
metrics
to three Ganglia 3.0.7 gmond instances running on hosts 10.10.10.1,
10.10.10.2, and 10.10.10.3 with the gmond on 10.10.10.3 running on a
nondefault port, 9999, you would drop the following snippet into your
hadoop-metrics.properties file:
# Configuration of the "jvm" context for ganglia jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext jvm.period=10 jvm.servers=10.10.10.1,10.10.10.2,10.10.10.3:9999
If you want to set up Hadoop to publish mapred
metrics
to the default multicast channel for Ganglia 3.4.0 gmond, drop the
following snippet into your hadoop-metrics.properties
file:
# Configuration of the "mapred" context for ganglia mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 mapred.period=10 mapred.servers=239.2.11.71
If you want to set up HBase to publish hbase
metrics to
a single host with the hostname bigdata.dev.oreilly.com
running Ganglia 3.3.7 gmond, use the following snippet:
# Configuration of the "mapred" context for ganglia mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 mapred.period=10 mapred.servers=bigdata.dev.oreilly.com
For your metric configuration changes to take effect, you must restart your Hadoop and HBase services.
Once you have Hadoop/HBase properly configured to publish Ganglia metrics (Table B-1), you will see the metrics in your gmond XML and graphs will appear in your Ganglia web console for each metric:
$ telnet 10.10.10.1 8649 <METRIC NAME="jvm.DataNode.metrics.gcCount" VAL="4" ... <METRIC NAME="jvm.NameNode.metrics.logError" VAL="0" ... <METRIC NAME="jvm.NameNode.metrics.maxMemoryM" VAL="888.9375" ... <METRIC NAME="jvm.DataNode.metrics.logWarn" VAL="9" ... <METRIC NAME="jvm.DataNode.metrics.memHeapUsedM" VAL="2.4761734" ... <METRIC NAME="jvm.DataNode.metrics.threadsWaiting" VAL="8" ... ...
At this point, you know how to turn on monitoring for any Hadoop context.
The Hadoop JVM metrics cover garbage collection, memory use, thread states, and the number of logging events for each service: DataNode, NameNode, SecondaryNameNode, JobTracker, and TaskTracker.
The Hadoop RPC metrics will give you information about the number of open connections, processing times, number of operations, and authentication successes and failures.
The Hadoop DFS metrics provide information about data block operations (read, removed, replicated, verify, written), verification failures, bytes read and written, volume failures, and local/remote client reads and writes.
The Hadoop MapReduce metrics provide information about the number of map and reduce slots used, shuffle failures, and tasks completed.