APPENDIX D

image

Hadoop Metrics and Their Relevance to Security

In Chapter 7’s “Hadoop Metrics” section, you reviewed what Hadoop metrics are, how you can apply filters to metrics, and how you can direct them to a file or monitoring software such as Ganglia. As you will soon learn, you can use these metrics for security, as well.

As you will remember, you can use Hadoop metrics to set alerts to capture sudden changes in system resources. In addition, you can set up your Hadoop cluster to monitor NameNode resources and generate alerts when any specified resources deviate from desired parameters. For example, I will show you how to generate alerts when deviation for the following resources exceed the monthly average by 50% or more:

FilesCreated
FilesDeleted
Transactions_avg_time
GcCount
GcTimeMillis
LogFatal
MemHeapUsedM
ThreadsWaiting

First, I direct output of the NameNode metrics to a file. To do so, I add the following lines to the file hadoop-metrics2.properties in the directory $HADOOP_INSTALL/hadoop/conf:

*.sink.tfile.class=org.apache.hadoop.metrics2.sink.FileSink
namenode.sink.tfile.filename = namenode-metrics.log

Next, I set filters to include only the necessary metrics:

*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter
*.record.filter.class=${*.source.filter.class}
*.metric.filter.class=${*.source.filter.class}
namenode.sink.file.metric.filter.include=FilesCreated
namenode.sink.file.metric.filter.include=FilesDeleted
namenode.sink.file.metric.filter.include=Transactions_avg_time
namenode.sink.file.metric.filter.include=GcCount
namenode.sink.file.metric.filter.include=GcTimeMillis
namenode.sink.file.metric.filter.include=LogFatal
namenode.sink.file.metric.filter.include=MemHeapUsedM
namenode.sink.file.metric.filter.include=ThreadsWaiting

My filtered list of metrics is now being written to the output file namenode-metrics.log.

Next, I develop a script to load this file daily to HDFS and add it to a Hive table as a new partition. I then recompute the 30-day average, taking into account the new values as well, and compare the average values with the newly loaded daily values.

If the deviation is more than 50% for any of these values, I can send a message to my Hadoop system administrator with the name of the node and the metric that deviated. The system administrator can then check appropriate logs to determine if there are any security breaches. For example, if the ThreadsWaiting metric is deviating more than 50%, then the system administrator will need to check the audit logs to see who was accessing the cluster and who was executing jobs at that time, and then check relevant jobs as indicated by the audit logs. For example, a suspicious job may require a check of the JobTracker and appropriate TaskTracker logs.

Alternately, you can direct the outputs of these jobs to Ganglia and then use Nagios to generate alerts if any of the metric values deviate.

Tables D-1 through D-4 list some commonly used Hadoop metrics. The JVM and RPC context metrics are listed first, because they are generated by all Hadoop daemons.

Table D-1. JVM and RPC Context Metrics

Metric Group

Metric Name

Description

JVM

GcCount

Number of garbage collections of the enterprise console JVM

 

GcTimeMillis

Calculates the total time all garbage collections have taken in milliseconds

 

LogError

Number of log lines with Log4j level  ERROR

 

LogFatal

Number of log lines with Log4j level FATAL

 

LogWarn

Number of log lines with Log4j level WARN

 

LogInfo

Number of log lines with Log4j level INFO

 

MemHeapCommittedM

Calculates the heap memory committed by the enterprise console JVM

 

MemHeapUsedM

Calculates the heap memory committed by the enterprise console JVM

 

ThreadsBlocked

Number of threads in a BLOCKED state, which means they are waiting for a lock

 

ThreadsWaiting

Number of threads in a WAITING state, which means they are waiting for another thread to perform an action

 

ThreadsRunnable

Number of threads in a RUNNABLE state that are executing in the JVM but may be waiting for system resources like CPU

 

ThreadsTerminated

Number of threads in a TERMINATED state, which means they finished executing. This value should be around zero since the metric only collects information over live threads.

 

ThreadsNew

Number of threads in a NEW state, which means they have not been started

RPC

ReceivedBytes

Number of RPC received bytes

 

SentBytes

Number of RPC sent bytes

 

RpcProcessingTimeAvgTime

Average time for processing RPC requests

 

RpcProcessingTimeNumOps

Number of processed RPC requests

 

RpcQueueTimeAvgTime

Average time spent by an RPC request in the queue

 

RpcQueueTimeNumOps

Number of RPC requests that were queued

 

RpcAuthorizationSuccesses

Number of successful RPC authorization calls

 

RpcAuthorizationFailures

Number of failed RPC authorization calls

 

RpcAuthenticationSuccesses

Number of successful RPC authentication calls

 

RpcAuthenticationFailures

Number of failed RPC authentication calls

Table D-2. NameNode and DataNode Metrics

Metric Group

Metric Name

Description

Hadoop.HDFS.NameNode

AddBlockOps

Number of add block operations for a cluster

 

CapacityRemaining

Total capacity remaining in HDFS

 

CapacityTotal

Total capacity remaining in HDFS and other distributed file systems

 

CapacityUsed

Total capacity used in HDFS

 

CreateFileOps

Number of create file operations for a cluster

 

DeadNodes

Number of dead nodes that exist in a cluster

 

DecomNodes

Number of decommissioned nodes that exist in a cluster

 

DeleteFileOps

Number of “delete” file operations occurring in HDFS

 

FSState

State of the NameNode, which can be in safe mode or operational mode

 

FileInfoOps

Number of file access operations occurring in the cluster

 

FilesAppended

Number of files appended in a cluster

 

FilesCreated

Number of files created in a cluster

 

FilesDeleted

Number of files deleted in a cluster

 

FilesInGetListingOps

Number of get listing operations occurring in a cluster

 

FilesRenamed

Number of files renamed in a cluster

 

LiveNodes

Number of live nodes in a cluster

 

NonDfsUsedSpace

Calculates the non-HDFS space used in the cluster

 

PercentRemaining

Percentage of remaining HDFS capacity

 

PercentUsed

Percentage of used HDFS capacity

 

Safemode

Calculates the safe mode state: 1 indicates safe mode is on; 0 indicates safe mode is off

 

SafemodeTime

Displays the time spent by NameNode in safe mode

 

Syncs_avg_time

Average time for the sync operation

 

Syncs_num_ops

Number of sync operations

 

TotalBlocks

Total number of blocks in a cluster

 

TotalFiles

Total number of files in a cluster

 

Transactions_avg_time

Average time for a transaction

 

Transactions_num_ops

Number of transaction operations

 

UpgradeFinalized

Indicates if the upgrade is finalized as true or false

 

addBlock_avg_time

Average time to create a new block in a cluster

 

addBlock_num_ops

Number of operations to add data blocks in a cluster

 

blockReceived_avg_time

Average time to receive a block operation

 

blockReceived_num_ops

Number of block received operations

 

blockReport_num_ops

Number of block report operations

 

blockReport_avg_time

Average time for block report operation

 

TimeSinceLastCheckpoint

Calculates the amount of time since the last checkpoint

Hadoop.HDFS.DataNode

BlocksRead

Number of times that a block is read from the hard disk, including copying a block

 

BlocksRemoved

Number of removed or invalidated blocks on the DataNode

 

BlocksReplicated

Number of blocks transferred or replicated from one DataNode to another

 

BlocksVerified

Number of block verifications, including successful or failed attempts

 

BlocksWritten

Number of blocks written to disk

 

BytesRead

Number of bytes read when reading and copying a block

 

BytesWritten

Number of bytes written to disk in response to a received block

 

HeartbeatsAvgTime

Average time to send a heartbeat from DataNode to NameNode

 

BlocksRemoved

Number of removed or invalidated blocks on the DataNode

 

BlocksReplicated

Number of blocks transferred or replicated from one DataNode to another

 

HeartbeatsNumOps

Number of heartbeat operations occurring in a cluster

Table D-3. MapReduce Metrics Generated by JobTracker

Metric Group

Metric Name

Description

Hadoop.Mapreduce.Jobtracker

blacklisted_maps

Number of blacklisted map slots in each TaskTracker

 

Heartbeats

Total Number of JobTracker heartbeats

 

blacklisted_reduces

Number of blacklisted reduce slots in each TaskTracker

 

callQueueLen

Calculates the RPC call queue length

 

HeartbeatAvgTime

Average time for a heartbeat

 

jobs_completed

Number of completed job

 

jobs_failed

Number of failed jobs

 

jobs_killed

Number of killed jobs

 

jobs_running

Number of running jobs

 

jobs_submitted

Number of submitted jobs

 

maps_completed

Number of completed maps

 

maps_failed

Number of failed maps

 

maps_killed

Number of killed maps

 

maps_launched

Number of launched maps

 

memNonHeapCommittedM

Non-heap committed memory (MB)

 

memNonHeapUsedM

Non-heap used memory (MB)

 

occupied_map_slots

Number of occupied map slots

 

map_slots

Number of map slots

 

occupied_reduce_slots

Number of occupied reduce slots

 

reduce_slots

Number of reduce slots

 

reduces_completed

Number of reducers completed

 

reduces_failed

Number of failed reducers

 

reduces_killed

Number of killed reduces

 

reduces_launched

Number of launched reducers

 

reserved_map_slots

Number of reserved map slots

 

reserved_reduce_slots

Number of reserved reduce slots

 

running_0

Number of running jobs

 

running_1440

Number of jobs running for more than 24 hours

 

running_300

Number of jobs running for more than five hours

 

running_60

Number of jobs running for more than one hour

 

running_maps

Number of running maps

 

running_reduces

Number of running reduces

 

Trackers

Number of TaskTrackers

 

trackers_blacklisted

Number of blacklisted TaskTrackers

 

trackers_decommissioned

Number of decommissioned TaskTrackers

 

trackers_graylisted

Number of gray-listed TaskTrackers

 

waiting_reduces

Number of waiting reduces

 

waiting_maps

Number of waiting maps

Table D-4. HBase Metrics

Metric Group

Metric Name

Description

hbase.master

MemHeapUsedM

Heap memory used in MB

 

MemHeapCommittedM

Heap memory committed in MB

 

averageLoad

Average number of regions served by each region server

 

numDeadRegionServers

Number of dead region servers

 

numRegionServers

Number of online region servers

 

ritCount

Number of regions in transition

 

ritCountOverThreshold

Number of regions in transition that exceed the threshold as defined by the property rit.metrics.threshold.time

 

clusterRequests

Total number of requests from all region servers to a cluster

 

HlogSplitTime_mean

Average time to split the total size of a write-ahead log file

 

HlogSplitTime_min

Minimum time to split the total size of a write-ahead log file

 

HlogSplitTime_max

Maximum time to split the write-ahead log file after a restart

 

HlogSplitTime_num_ops

Time to split write-ahead log files

 

HlogSplitSize_mean

Average time to split the total size of an Hlog file

 

HlogSplitSize_min

Minimum time to split the total size of an Hlog file

 

HlogSplitSize_max

Maximum time to split the total size of an Hlog file

 

HlogSplitSize_num_ops

Size of write-ahead log files that were split

hbase.regionserver

appendCount

Number of WAL appends

 

blockCacheCount

Number of StoreFiles cached in the block cache

 

blockCacheEvictionCount

Total Number of blocks that have been evicted from the block cache

 

blockCacheFreeSize

Number of bytes that are free in the block cache

 

blockCacheExpressHitPercent

Calculates the block cache hit percent for requests where caching was turned on

 

blockCacheHitCount

Total number of block cache hits for requests, regardless of caching setting

 

blockCountHitPercent

Block cache hit percent for all requests regardless of the caching setting

 

blockCacheMissCount

Total Number of block cache misses for requests, regardless of caching setting

 

blockCacheSize

Number of bytes used by cached blocks

 

compactionQueueLength

Number of HRegions on the CompactionQueue. These regions call compact on all stores, and then find out if a compaction is needed along with the type of compaction.

 

MemMaxM

Calculates the total heap memory used in MB

 

MemHeapUsedM

Calculates the heap memory used in MB

 

MemHeapCommittedM

Calculates the heap memory committed in MB

 

GcCount

Number of total garbage collections

 

updatesBlockedTime

Number of memstore updates that have been blocked so that memstore can be flushed

 

memstoreSize

Calculates the size of all memstores in all regions in MB

 

readRequestCount

Number of region server read requests

 

regionCount

Number of online regions served by a region server

 

slowAppendCount

Number of appends that took more than 1000 ms to complete

 

slowGetCount

Number of gets that took more than 1000 ms to complete

 

slowPutCount

Number of puts that took more than 1000 ms to complete

 

slowIncrementCount

Number of increments that took more than 1000 ms to complete

 

slowDeleteCount

Number of deletes that took more than 1000 ms to complete

 

storeFileIndexSize

Calculates the size of all StoreFile indexes in MB. These are not necessarily in memory because they are stored in the block cache as well and might have been evicted.

 

storeFileCount

Number of StoreFiles in all stores and regions

 

storeCount

Number of stores in all Regions

 

staticBloomSize

Calculates the total size of all bloom filters, which are not necessarily loaded in memory

 

writeRequestCount

Number of write requests to a region server

 

staticIndexSize

Calculates the total static index size for all region server entities

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset