Appendix B. Ganglia and Hadoop/HBase

You’ve got data—lots and lots of data that’s just too valuable to delete or take offline for even a minute. Your data is likely made up of a number of different formats, and you know that your data will only grow larger and more complex over time. Don’t fret. The growing pains you’re facing have been faced by other people and there are systems to handle it: Hadoop and HBase.

If you want to use Ganglia to monitor a Hadoop or HBase cluster, I have good news—Ganglia support is built in.

Introducing Hadoop and HBase

Hadoop is an Apache-licensed open source system modeled after Google’s MapReduce and Google File System (GFS) systems. Hadoop was created by Doug Cutting, who now works as an architect at Cloudera and serves as chair of the Apache Software Foundation. He named Hadoop after his son’s yellow stuffed toy elephant.

With Hadoop, you can grow the size of your filesystem by adding more machines to your cluster. This feature allows you to grow storage incrementally, regardless of whether you need terabytes or petabytes of space. Hadoop also ensures your data is safe by automatically replicating your data to multiple machines. You could remove a machine from your cluster and take it out to a grassy field with a baseball bat to reenact the printer scene from Office Space—and not lose a single byte of data.

The Hadoop MapReduce engine breaks data processing up into smaller units of work and intelligently distributes them across your cluster. The MapReduce APIs allow developers to focus on the question they’re trying to answer instead of worrying about how to handle machine failures—you’re regretting what you did with that bat now, aren’t you? Because data in the Hadoop filesystem is replicated, Hadoop can automatically handle failures by rerunning the computation on a replica, often without the user even noticing.

HBase is an Apache-licensed open source system modeled after Google’s Bigtable. HBase sits on top of the Hadoop filesystem and provides users random, real-time read/write access to their data. You can think of HBase, at a high level, as a flexible and extremely scalable database.

Hadoop and HBase are dynamic systems that are easier to manage when you have the metric visibility that Ganglia can provide. If you’re interested in learning more about Hadoop and HBase, I highly recommend the following books:

Configuring Hadoop and HBase to Publish Metrics to Ganglia

Ganglia’s monitoring daemon (gmond) publishes metrics in a well-defined format. You can configure the Hadoop metric subsystem to publish metrics directly to Ganglia in the format it understands.

Warning

The Ganglia wire format changed incompatibly at version 3.1.0. All Ganglia releases from 3.1.0 onward use a new message format; agents prior to 3.1.0 use the original format. Old agents can’t communicate with new agents, and vice versa.

In order to turn on Ganglia monitoring, you must update the hadoop-metrics.properties file in your Hadoop configuration directory. This file is organized into different contexts: jvm, rpc, hdfs, mapred, and hbase. You can turn on Ganglia monitoring for one or all contexts. It’s up to you.

Example B-1 through Example B-5 are example configuration snippets from each metric context. Your hadoop-metrics.properties file will likely already have these Ganglia configuration snippets in it, although they are commented out by default.

Example B-1. Hadoop Java Virtual Machine (jvm) context

# Configuration of the "jvm" context for ganglia
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.period=10
jvm.servers=localhost:8649
Example B-2. Hadoop Remote Procedure Call (rpc) context

# Configuration of the "rpc" context for ganglia
rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext
rpc.period=10
rpc.servers=localhost:8649
Example B-3. Hadoop Distributed File System (dfs) context

# Configuration of the "dfs" context for ganglia
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=localhost:8649
Example B-4. Hadoop MapReduce (mapred) context

# Configuration of the "mapred" context for ganglia
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=localhost:8649
Example B-5. HBase (hbase) context

# Configuration of the "hbase" context for ganglia
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext
hbase.period=10
hbase.servers=localhost:8649

I’m sure you noticed some obvious patterns from these snippets. The prefix of the configuration keys is the name of the context (e.g., mapred), and each context has class, period, and servers properties.

class

The class specifies what format that metric should be published in. If you are running Ganglia 3.1.0 or newer, this class should be set to org.apache.hadoop.metrics.ganglia.GangliaContext31; otherwise, set the class to org.apache.hadoop.metrics.ganglia.GangliaContext. Some people find the name GangliaContext31 to be a bit confusing as it seems to imply that it works only with Ganglia 3.1. Now you know that isn’t the case.

period

The period defines the number of seconds between metric updates. Ten seconds is a good value here.

servers

The servers is a comma-separated list of unicast or multicast addresses to publish metrics to. If you don’t explicitly provide a port number, Hadoop will assume you want the default gmond port: 8649.

Note

If you bind the multicast address using the bind option in gmond.conf, you cannot also send a metric message to the unicast address of the host running gmond. This is a common source of confusion. If you are not receiving Hadoop metrics after setting the servers property, double-check your gmond udp_recv_channel setting in gmond.conf.

Let’s work through a few samples.

If you wanted to set up Hadoop to publish jvm metrics to three Ganglia 3.0.7 gmond instances running on hosts 10.10.10.1, 10.10.10.2, and 10.10.10.3 with the gmond on 10.10.10.3 running on a nondefault port, 9999, you would drop the following snippet into your hadoop-metrics.properties file:

# Configuration of the "jvm" context for ganglia
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.period=10
jvm.servers=10.10.10.1,10.10.10.2,10.10.10.3:9999

If you want to set up Hadoop to publish mapred metrics to the default multicast channel for Ganglia 3.4.0 gmond, drop the following snippet into your hadoop-metrics.properties file:

# Configuration of the "mapred" context for ganglia
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
mapred.servers=239.2.11.71

If you want to set up HBase to publish hbase metrics to a single host with the hostname bigdata.dev.oreilly.com running Ganglia 3.3.7 gmond, use the following snippet:

# Configuration of the "mapred" context for ganglia
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
mapred.servers=bigdata.dev.oreilly.com

Note

For your metric configuration changes to take effect, you must restart your Hadoop and HBase services.

Once you have Hadoop/HBase properly configured to publish Ganglia metrics (Table B-1), you will see the metrics in your gmond XML and graphs will appear in your Ganglia web console for each metric:

$ telnet 10.10.10.1 8649
<METRIC NAME="jvm.DataNode.metrics.gcCount" VAL="4" ...
<METRIC NAME="jvm.NameNode.metrics.logError" VAL="0" ...
<METRIC NAME="jvm.NameNode.metrics.maxMemoryM" VAL="888.9375" ...
<METRIC NAME="jvm.DataNode.metrics.logWarn" VAL="9" ...
<METRIC NAME="jvm.DataNode.metrics.memHeapUsedM" VAL="2.4761734" ...
<METRIC NAME="jvm.DataNode.metrics.threadsWaiting" VAL="8" ...
...

At this point, you know how to turn on monitoring for any Hadoop context.

The Hadoop JVM metrics cover garbage collection, memory use, thread states, and the number of logging events for each service: DataNode, NameNode, SecondaryNameNode, JobTracker, and TaskTracker.

The Hadoop RPC metrics will give you information about the number of open connections, processing times, number of operations, and authentication successes and failures.

The Hadoop DFS metrics provide information about data block operations (read, removed, replicated, verify, written), verification failures, bytes read and written, volume failures, and local/remote client reads and writes.

The Hadoop MapReduce metrics provide information about the number of map and reduce slots used, shuffle failures, and tasks completed.

Table B-1. List of HBase metrics
Metric NameExplanation of value
hbase.regionserver.blockCacheCountBlock cache item count in memory. This is the number of blocks of StoreFiles (HFiles) in the cache.
hbase.regionserver.blockCacheEvictedCountNumber of blocks that had to be evicted from the block cache due to heap size constraints.
hbase.regionserver.blockCacheFreeBlock cache memory available (bytes).
hbase.regionserver.blockCacheHitCachingRatioBlock cache hit caching ratio (0 to 100). The cache-hit ratio for reads configured to look in the cache (i.e., cacheBlocks=true).
hbase.regionserver.blockCacheHitCountNumber of blocks of StoreFiles (HFiles) read from the cache.
hbase.regionserver.blockCacheHitRatioBlock cache hit ratio (0 to 100). Includes all read requests, although those with cacheBlocks=false will always read from disk and be counted as a “cache miss.”
hbase.regionserver.blockCacheMissCountNumber of blocks of StoreFiles (HFiles) requested but not read from the cache.
hbase.regionserver.blockCacheSizeBlock cache size in memory (bytes), that is, memory in use by the BlockCache.
hbase.regionserver.compactionQueueSizeSize of the compaction queue. This is the number of Stores in the RegionServer that have been targeted for compaction.
hbase.regionserver.flushQueueSizeNumber of enqueued regions in the MemStore awaiting flush.
hbase.regionserver.fsReadLatency_avg_timeFilesystem read latency (ms). This is the average time to read from HDFS.
hbase.regionserver.fsReadLatency_num_ops

Filesystem read operations.

hbase.regionserver.memstoreSizeMBSum of all the memstore sizes in this RegionServer (MB).
hbase.regionserver.regionsNumber of regions served by the RegionServer.
hbase.regionserver.requestsTotal number of read and write requests. Requests correspond to RegionServer RPC calls; thus, a single Get will result in 1 request, but a Scan with caching set to 1,000 will result in 1 request for each “next” call (i.e., not each row). A bulk-load request will constitute 1 request per HFile.
hbase.regionserver.storeFileIndexSizeMBSum of all the StoreFile index sizes in this RegionServer (MB).
hbase.regionserver.storesNumber of Stores open on the RegionServer. A Store corresponds to a ColumnFamily. For example, if a table (which contains the column family) has three regions on a RegionServer, there will be three stores open for that column family.
hbase.regionserver.storeFilesNumber of StoreFiles open on the RegionServer. A store may have more than one StoreFile (HFile).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset