Hadoop Metrics and Their Relevance to Security
In Chapter 7’s “Hadoop Metrics” section, you reviewed what Hadoop metrics are, how you can apply filters to metrics, and how you can direct them to a file or monitoring software such as Ganglia. As you will soon learn, you can use these metrics for security, as well.
As you will remember, you can use Hadoop metrics to set alerts to capture sudden changes in system resources. In addition, you can set up your Hadoop cluster to monitor NameNode resources and generate alerts when any specified resources deviate from desired parameters. For example, I will show you how to generate alerts when deviation for the following resources exceed the monthly average by 50% or more:
FilesCreated
FilesDeleted
Transactions_avg_time
GcCount
GcTimeMillis
LogFatal
MemHeapUsedM
ThreadsWaiting
First, I direct output of the NameNode metrics to a file. To do so, I add the following lines to the file hadoop-metrics2.properties in the directory $HADOOP_INSTALL/hadoop/conf:
*.sink.tfile.class=org.apache.hadoop.metrics2.sink.FileSink
namenode.sink.tfile.filename = namenode-metrics.log
Next, I set filters to include only the necessary metrics:
*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter
*.record.filter.class=${*.source.filter.class}
*.metric.filter.class=${*.source.filter.class}
namenode.sink.file.metric.filter.include=FilesCreated
namenode.sink.file.metric.filter.include=FilesDeleted
namenode.sink.file.metric.filter.include=Transactions_avg_time
namenode.sink.file.metric.filter.include=GcCount
namenode.sink.file.metric.filter.include=GcTimeMillis
namenode.sink.file.metric.filter.include=LogFatal
namenode.sink.file.metric.filter.include=MemHeapUsedM
namenode.sink.file.metric.filter.include=ThreadsWaiting
My filtered list of metrics is now being written to the output file namenode-metrics.log.
Next, I develop a script to load this file daily to HDFS and add it to a Hive table as a new partition. I then recompute the 30-day average, taking into account the new values as well, and compare the average values with the newly loaded daily values.
If the deviation is more than 50% for any of these values, I can send a message to my Hadoop system administrator with the name of the node and the metric that deviated. The system administrator can then check appropriate logs to determine if there are any security breaches. For example, if the ThreadsWaiting metric is deviating more than 50%, then the system administrator will need to check the audit logs to see who was accessing the cluster and who was executing jobs at that time, and then check relevant jobs as indicated by the audit logs. For example, a suspicious job may require a check of the JobTracker and appropriate TaskTracker logs.
Alternately, you can direct the outputs of these jobs to Ganglia and then use Nagios to generate alerts if any of the metric values deviate.
Tables D-1 through D-4 list some commonly used Hadoop metrics. The JVM and RPC context metrics are listed first, because they are generated by all Hadoop daemons.
Table D-1. JVM and RPC Context Metrics
Metric Group |
Metric Name |
Description |
---|---|---|
JVM |
GcCount |
Number of garbage collections of the enterprise console JVM |
GcTimeMillis |
Calculates the total time all garbage collections have taken in milliseconds | |
LogError |
Number of log lines with Log4j level ERROR | |
LogFatal |
Number of log lines with Log4j level FATAL | |
LogWarn |
Number of log lines with Log4j level WARN | |
LogInfo |
Number of log lines with Log4j level INFO | |
MemHeapCommittedM |
Calculates the heap memory committed by the enterprise console JVM | |
MemHeapUsedM |
Calculates the heap memory committed by the enterprise console JVM | |
ThreadsBlocked |
Number of threads in a BLOCKED state, which means they are waiting for a lock | |
ThreadsWaiting |
Number of threads in a WAITING state, which means they are waiting for another thread to perform an action | |
ThreadsRunnable |
Number of threads in a RUNNABLE state that are executing in the JVM but may be waiting for system resources like CPU | |
ThreadsTerminated |
Number of threads in a TERMINATED state, which means they finished executing. This value should be around zero since the metric only collects information over live threads. | |
ThreadsNew |
Number of threads in a NEW state, which means they have not been started | |
RPC |
ReceivedBytes |
Number of RPC received bytes |
SentBytes |
Number of RPC sent bytes | |
RpcProcessingTimeAvgTime |
Average time for processing RPC requests | |
RpcProcessingTimeNumOps |
Number of processed RPC requests | |
RpcQueueTimeAvgTime |
Average time spent by an RPC request in the queue | |
RpcQueueTimeNumOps |
Number of RPC requests that were queued | |
RpcAuthorizationSuccesses |
Number of successful RPC authorization calls | |
RpcAuthorizationFailures |
Number of failed RPC authorization calls | |
RpcAuthenticationSuccesses |
Number of successful RPC authentication calls | |
RpcAuthenticationFailures |
Number of failed RPC authentication calls |
Table D-2. NameNode and DataNode Metrics
Metric Group |
Metric Name |
Description |
---|---|---|
Hadoop.HDFS.NameNode |
AddBlockOps |
Number of add block operations for a cluster |
CapacityRemaining |
Total capacity remaining in HDFS | |
CapacityTotal |
Total capacity remaining in HDFS and other distributed file systems | |
CapacityUsed |
Total capacity used in HDFS | |
CreateFileOps |
Number of create file operations for a cluster | |
DeadNodes |
Number of dead nodes that exist in a cluster | |
DecomNodes |
Number of decommissioned nodes that exist in a cluster | |
DeleteFileOps |
Number of “delete” file operations occurring in HDFS | |
FSState |
State of the NameNode, which can be in safe mode or operational mode | |
FileInfoOps |
Number of file access operations occurring in the cluster | |
FilesAppended |
Number of files appended in a cluster | |
FilesCreated |
Number of files created in a cluster | |
FilesDeleted |
Number of files deleted in a cluster | |
FilesInGetListingOps |
Number of get listing operations occurring in a cluster | |
FilesRenamed |
Number of files renamed in a cluster | |
LiveNodes |
Number of live nodes in a cluster | |
NonDfsUsedSpace |
Calculates the non-HDFS space used in the cluster | |
PercentRemaining |
Percentage of remaining HDFS capacity | |
PercentUsed |
Percentage of used HDFS capacity | |
Safemode |
Calculates the safe mode state: 1 indicates safe mode is on; 0 indicates safe mode is off | |
SafemodeTime |
Displays the time spent by NameNode in safe mode | |
Syncs_avg_time |
Average time for the sync operation | |
Syncs_num_ops |
Number of sync operations | |
TotalBlocks |
Total number of blocks in a cluster | |
TotalFiles |
Total number of files in a cluster | |
Transactions_avg_time |
Average time for a transaction | |
Transactions_num_ops |
Number of transaction operations | |
UpgradeFinalized |
Indicates if the upgrade is finalized as true or false | |
addBlock_avg_time |
Average time to create a new block in a cluster | |
addBlock_num_ops |
Number of operations to add data blocks in a cluster | |
blockReceived_avg_time |
Average time to receive a block operation | |
blockReceived_num_ops |
Number of block received operations | |
blockReport_num_ops |
Number of block report operations | |
blockReport_avg_time |
Average time for block report operation | |
TimeSinceLastCheckpoint |
Calculates the amount of time since the last checkpoint | |
Hadoop.HDFS.DataNode |
BlocksRead |
Number of times that a block is read from the hard disk, including copying a block |
BlocksRemoved |
Number of removed or invalidated blocks on the DataNode | |
BlocksReplicated |
Number of blocks transferred or replicated from one DataNode to another | |
BlocksVerified |
Number of block verifications, including successful or failed attempts | |
BlocksWritten |
Number of blocks written to disk | |
BytesRead |
Number of bytes read when reading and copying a block | |
BytesWritten |
Number of bytes written to disk in response to a received block | |
HeartbeatsAvgTime |
Average time to send a heartbeat from DataNode to NameNode | |
BlocksRemoved |
Number of removed or invalidated blocks on the DataNode | |
BlocksReplicated |
Number of blocks transferred or replicated from one DataNode to another | |
HeartbeatsNumOps |
Number of heartbeat operations occurring in a cluster |
Table D-3. MapReduce Metrics Generated by JobTracker
Metric Group |
Metric Name |
Description |
---|---|---|
Hadoop.Mapreduce.Jobtracker |
blacklisted_maps |
Number of blacklisted map slots in each TaskTracker |
Heartbeats |
Total Number of JobTracker heartbeats | |
blacklisted_reduces |
Number of blacklisted reduce slots in each TaskTracker | |
callQueueLen |
Calculates the RPC call queue length | |
HeartbeatAvgTime |
Average time for a heartbeat | |
jobs_completed |
Number of completed job | |
jobs_failed |
Number of failed jobs | |
jobs_killed |
Number of killed jobs | |
jobs_running |
Number of running jobs | |
jobs_submitted |
Number of submitted jobs | |
maps_completed |
Number of completed maps | |
maps_failed |
Number of failed maps | |
maps_killed |
Number of killed maps | |
maps_launched |
Number of launched maps | |
memNonHeapCommittedM |
Non-heap committed memory (MB) | |
memNonHeapUsedM |
Non-heap used memory (MB) | |
occupied_map_slots |
Number of occupied map slots | |
map_slots |
Number of map slots | |
occupied_reduce_slots |
Number of occupied reduce slots | |
reduce_slots |
Number of reduce slots | |
reduces_completed |
Number of reducers completed | |
reduces_failed |
Number of failed reducers | |
reduces_killed |
Number of killed reduces | |
reduces_launched |
Number of launched reducers | |
reserved_map_slots |
Number of reserved map slots | |
reserved_reduce_slots |
Number of reserved reduce slots | |
running_0 |
Number of running jobs | |
running_1440 |
Number of jobs running for more than 24 hours | |
running_300 |
Number of jobs running for more than five hours | |
running_60 |
Number of jobs running for more than one hour | |
running_maps |
Number of running maps | |
running_reduces |
Number of running reduces | |
Trackers |
Number of TaskTrackers | |
trackers_blacklisted |
Number of blacklisted TaskTrackers | |
trackers_decommissioned |
Number of decommissioned TaskTrackers | |
trackers_graylisted |
Number of gray-listed TaskTrackers | |
waiting_reduces |
Number of waiting reduces | |
waiting_maps |
Number of waiting maps |
Table D-4. HBase Metrics
Metric Group |
Metric Name |
Description |
---|---|---|
hbase.master |
MemHeapUsedM |
Heap memory used in MB |
MemHeapCommittedM |
Heap memory committed in MB | |
averageLoad |
Average number of regions served by each region server | |
numDeadRegionServers |
Number of dead region servers | |
numRegionServers |
Number of online region servers | |
ritCount |
Number of regions in transition | |
ritCountOverThreshold |
Number of regions in transition that exceed the threshold as defined by the property rit.metrics.threshold.time | |
clusterRequests |
Total number of requests from all region servers to a cluster | |
HlogSplitTime_mean |
Average time to split the total size of a write-ahead log file | |
HlogSplitTime_min |
Minimum time to split the total size of a write-ahead log file | |
HlogSplitTime_max |
Maximum time to split the write-ahead log file after a restart | |
HlogSplitTime_num_ops |
Time to split write-ahead log files | |
HlogSplitSize_mean |
Average time to split the total size of an Hlog file | |
HlogSplitSize_min |
Minimum time to split the total size of an Hlog file | |
HlogSplitSize_max |
Maximum time to split the total size of an Hlog file | |
HlogSplitSize_num_ops |
Size of write-ahead log files that were split | |
hbase.regionserver |
appendCount |
Number of WAL appends |
blockCacheCount |
Number of StoreFiles cached in the block cache | |
blockCacheEvictionCount |
Total Number of blocks that have been evicted from the block cache | |
blockCacheFreeSize |
Number of bytes that are free in the block cache | |
blockCacheExpressHitPercent |
Calculates the block cache hit percent for requests where caching was turned on | |
blockCacheHitCount |
Total number of block cache hits for requests, regardless of caching setting | |
blockCountHitPercent |
Block cache hit percent for all requests regardless of the caching setting | |
blockCacheMissCount |
Total Number of block cache misses for requests, regardless of caching setting | |
blockCacheSize |
Number of bytes used by cached blocks | |
compactionQueueLength |
Number of HRegions on the CompactionQueue. These regions call compact on all stores, and then find out if a compaction is needed along with the type of compaction. | |
MemMaxM |
Calculates the total heap memory used in MB | |
MemHeapUsedM |
Calculates the heap memory used in MB | |
MemHeapCommittedM |
Calculates the heap memory committed in MB | |
GcCount |
Number of total garbage collections | |
updatesBlockedTime |
Number of memstore updates that have been blocked so that memstore can be flushed | |
memstoreSize |
Calculates the size of all memstores in all regions in MB | |
readRequestCount |
Number of region server read requests | |
regionCount |
Number of online regions served by a region server | |
slowAppendCount |
Number of appends that took more than 1000 ms to complete | |
slowGetCount |
Number of gets that took more than 1000 ms to complete | |
slowPutCount |
Number of puts that took more than 1000 ms to complete | |
slowIncrementCount |
Number of increments that took more than 1000 ms to complete | |
slowDeleteCount |
Number of deletes that took more than 1000 ms to complete | |
storeFileIndexSize |
Calculates the size of all StoreFile indexes in MB. These are not necessarily in memory because they are stored in the block cache as well and might have been evicted. | |
storeFileCount |
Number of StoreFiles in all stores and regions | |
storeCount |
Number of stores in all Regions | |
staticBloomSize |
Calculates the total size of all bloom filters, which are not necessarily loaded in memory | |
writeRequestCount |
Number of write requests to a region server | |
staticIndexSize |
Calculates the total static index size for all region server entities |