Appendix D: Hadoop Metrics and Their Relevance to Security

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

APPENDIX D

Hadoop Metrics and Their Relevance to Security

In Chapter 7’s “Hadoop Metrics” section, you reviewed what Hadoop metrics are, how you can apply filters to metrics, and how you can direct them to a file or monitoring software such as Ganglia. As you will soon learn, you can use these metrics for security, as well.

As you will remember, you can use Hadoop metrics to set alerts to capture sudden changes in system resources. In addition, you can set up your Hadoop cluster to monitor NameNode resources and generate alerts when any specified resources deviate from desired parameters. For example, I will show you how to generate alerts when deviation for the following resources exceed the monthly average by 50% or more:

FilesCreated
FilesDeleted
Transactions_avg_time
GcCount
GcTimeMillis
LogFatal
MemHeapUsedM
ThreadsWaiting

First, I direct output of the NameNode metrics to a file. To do so, I add the following lines to the file hadoop-metrics2.properties in the directory $HADOOP_INSTALL/hadoop/conf:

*.sink.tfile.class=org.apache.hadoop.metrics2.sink.FileSink
namenode.sink.tfile.filename = namenode-metrics.log

Next, I set filters to include only the necessary metrics:

*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter
*.record.filter.class=${*.source.filter.class}
*.metric.filter.class=${*.source.filter.class}
namenode.sink.file.metric.filter.include=FilesCreated
namenode.sink.file.metric.filter.include=FilesDeleted
namenode.sink.file.metric.filter.include=Transactions_avg_time
namenode.sink.file.metric.filter.include=GcCount
namenode.sink.file.metric.filter.include=GcTimeMillis
namenode.sink.file.metric.filter.include=LogFatal
namenode.sink.file.metric.filter.include=MemHeapUsedM
namenode.sink.file.metric.filter.include=ThreadsWaiting

My filtered list of metrics is now being written to the output file namenode-metrics.log.

Next, I develop a script to load this file daily to HDFS and add it to a Hive table as a new partition. I then recompute the 30-day average, taking into account the new values as well, and compare the average values with the newly loaded daily values.

If the deviation is more than 50% for any of these values, I can send a message to my Hadoop system administrator with the name of the node and the metric that deviated. The system administrator can then check appropriate logs to determine if there are any security breaches. For example, if the ThreadsWaiting metric is deviating more than 50%, then the system administrator will need to check the audit logs to see who was accessing the cluster and who was executing jobs at that time, and then check relevant jobs as indicated by the audit logs. For example, a suspicious job may require a check of the JobTracker and appropriate TaskTracker logs.

Alternately, you can direct the outputs of these jobs to Ganglia and then use Nagios to generate alerts if any of the metric values deviate.

Tables D-1 through D-4 list some commonly used Hadoop metrics. The JVM and RPC context metrics are listed first, because they are generated by all Hadoop daemons.

Table D-1. JVM and RPC Context Metrics

Metric Group	Metric Name	Description
JVM	GcCount	Number of garbage collections of the enterprise console JVM
	GcTimeMillis	Calculates the total time all garbage collections have taken in milliseconds
	LogError	Number of log lines with Log4j level ERROR
	LogFatal	Number of log lines with Log4j level FATAL
	LogWarn	Number of log lines with Log4j level WARN
	LogInfo	Number of log lines with Log4j level INFO
	MemHeapCommittedM	Calculates the heap memory committed by the enterprise console JVM
	MemHeapUsedM	Calculates the heap memory committed by the enterprise console JVM
	ThreadsBlocked	Number of threads in a BLOCKED state, which means they are waiting for a lock
	ThreadsWaiting	Number of threads in a WAITING state, which means they are waiting for another thread to perform an action
	ThreadsRunnable	Number of threads in a RUNNABLE state that are executing in the JVM but may be waiting for system resources like CPU
	ThreadsTerminated	Number of threads in a TERMINATED state, which means they finished executing. This value should be around zero since the metric only collects information over live threads.
	ThreadsNew	Number of threads in a NEW state, which means they have not been started
RPC	ReceivedBytes	Number of RPC received bytes
	SentBytes	Number of RPC sent bytes
	RpcProcessingTimeAvgTime	Average time for processing RPC requests
	RpcProcessingTimeNumOps	Number of processed RPC requests
	RpcQueueTimeAvgTime	Average time spent by an RPC request in the queue
	RpcQueueTimeNumOps	Number of RPC requests that were queued
	RpcAuthorizationSuccesses	Number of successful RPC authorization calls
	RpcAuthorizationFailures	Number of failed RPC authorization calls
	RpcAuthenticationSuccesses	Number of successful RPC authentication calls
	RpcAuthenticationFailures	Number of failed RPC authentication calls

Table D-2. NameNode and DataNode Metrics

Metric Group	Metric Name	Description
Hadoop.HDFS.NameNode	AddBlockOps	Number of add block operations for a cluster
	CapacityRemaining	Total capacity remaining in HDFS
	CapacityTotal	Total capacity remaining in HDFS and other distributed file systems
	CapacityUsed	Total capacity used in HDFS
	CreateFileOps	Number of create file operations for a cluster
	DeadNodes	Number of dead nodes that exist in a cluster
	DecomNodes	Number of decommissioned nodes that exist in a cluster
	DeleteFileOps	Number of “delete” file operations occurring in HDFS
	FSState	State of the NameNode, which can be in safe mode or operational mode
	FileInfoOps	Number of file access operations occurring in the cluster
	FilesAppended	Number of files appended in a cluster
	FilesCreated	Number of files created in a cluster
	FilesDeleted	Number of files deleted in a cluster
	FilesInGetListingOps	Number of get listing operations occurring in a cluster
	FilesRenamed	Number of files renamed in a cluster
	LiveNodes	Number of live nodes in a cluster
	NonDfsUsedSpace	Calculates the non-HDFS space used in the cluster
	PercentRemaining	Percentage of remaining HDFS capacity
	PercentUsed	Percentage of used HDFS capacity
	Safemode	Calculates the safe mode state: 1 indicates safe mode is on; 0 indicates safe mode is off
	SafemodeTime	Displays the time spent by NameNode in safe mode
	Syncs_avg_time	Average time for the sync operation
	Syncs_num_ops	Number of sync operations
	TotalBlocks	Total number of blocks in a cluster
	TotalFiles	Total number of files in a cluster
	Transactions_avg_time	Average time for a transaction
	Transactions_num_ops	Number of transaction operations
	UpgradeFinalized	Indicates if the upgrade is finalized as true or false
	addBlock_avg_time	Average time to create a new block in a cluster
	addBlock_num_ops	Number of operations to add data blocks in a cluster
	blockReceived_avg_time	Average time to receive a block operation
	blockReceived_num_ops	Number of block received operations
	blockReport_num_ops	Number of block report operations
	blockReport_avg_time	Average time for block report operation
	TimeSinceLastCheckpoint	Calculates the amount of time since the last checkpoint
Hadoop.HDFS.DataNode	BlocksRead	Number of times that a block is read from the hard disk, including copying a block
	BlocksRemoved	Number of removed or invalidated blocks on the DataNode
	BlocksReplicated	Number of blocks transferred or replicated from one DataNode to another
	BlocksVerified	Number of block verifications, including successful or failed attempts
	BlocksWritten	Number of blocks written to disk
	BytesRead	Number of bytes read when reading and copying a block
	BytesWritten	Number of bytes written to disk in response to a received block
	HeartbeatsAvgTime	Average time to send a heartbeat from DataNode to NameNode
	BlocksRemoved	Number of removed or invalidated blocks on the DataNode
	BlocksReplicated	Number of blocks transferred or replicated from one DataNode to another
	HeartbeatsNumOps	Number of heartbeat operations occurring in a cluster

Table D-3. MapReduce Metrics Generated by JobTracker

Metric Group	Metric Name	Description
Hadoop.Mapreduce.Jobtracker	blacklisted_maps	Number of blacklisted map slots in each TaskTracker
	Heartbeats	Total Number of JobTracker heartbeats
	blacklisted_reduces	Number of blacklisted reduce slots in each TaskTracker
	callQueueLen	Calculates the RPC call queue length
	HeartbeatAvgTime	Average time for a heartbeat
	jobs_completed	Number of completed job
	jobs_failed	Number of failed jobs
	jobs_killed	Number of killed jobs
	jobs_running	Number of running jobs
	jobs_submitted	Number of submitted jobs
	maps_completed	Number of completed maps
	maps_failed	Number of failed maps
	maps_killed	Number of killed maps
	maps_launched	Number of launched maps
	memNonHeapCommittedM	Non-heap committed memory (MB)
	memNonHeapUsedM	Non-heap used memory (MB)
	occupied_map_slots	Number of occupied map slots
	map_slots	Number of map slots
	occupied_reduce_slots	Number of occupied reduce slots
	reduce_slots	Number of reduce slots
	reduces_completed	Number of reducers completed
	reduces_failed	Number of failed reducers
	reduces_killed	Number of killed reduces
	reduces_launched	Number of launched reducers
	reserved_map_slots	Number of reserved map slots
	reserved_reduce_slots	Number of reserved reduce slots
	running_0	Number of running jobs
	running_1440	Number of jobs running for more than 24 hours
	running_300	Number of jobs running for more than five hours
	running_60	Number of jobs running for more than one hour
	running_maps	Number of running maps
	running_reduces	Number of running reduces
	Trackers	Number of TaskTrackers
	trackers_blacklisted	Number of blacklisted TaskTrackers
	trackers_decommissioned	Number of decommissioned TaskTrackers
	trackers_graylisted	Number of gray-listed TaskTrackers
	waiting_reduces	Number of waiting reduces
	waiting_maps	Number of waiting maps

Table D-4. HBase Metrics

Metric Group	Metric Name	Description
hbase.master	MemHeapUsedM	Heap memory used in MB
	MemHeapCommittedM	Heap memory committed in MB
	averageLoad	Average number of regions served by each region server
	numDeadRegionServers	Number of dead region servers
	numRegionServers	Number of online region servers
	ritCount	Number of regions in transition
	ritCountOverThreshold	Number of regions in transition that exceed the threshold as defined by the property rit.metrics.threshold.time
	clusterRequests	Total number of requests from all region servers to a cluster
	HlogSplitTime_mean	Average time to split the total size of a write-ahead log file
	HlogSplitTime_min	Minimum time to split the total size of a write-ahead log file
	HlogSplitTime_max	Maximum time to split the write-ahead log file after a restart
	HlogSplitTime_num_ops	Time to split write-ahead log files
	HlogSplitSize_mean	Average time to split the total size of an Hlog file
	HlogSplitSize_min	Minimum time to split the total size of an Hlog file
	HlogSplitSize_max	Maximum time to split the total size of an Hlog file
	HlogSplitSize_num_ops	Size of write-ahead log files that were split
hbase.regionserver	appendCount	Number of WAL appends
	blockCacheCount	Number of StoreFiles cached in the block cache
	blockCacheEvictionCount	Total Number of blocks that have been evicted from the block cache
	blockCacheFreeSize	Number of bytes that are free in the block cache
	blockCacheExpressHitPercent	Calculates the block cache hit percent for requests where caching was turned on
	blockCacheHitCount	Total number of block cache hits for requests, regardless of caching setting
	blockCountHitPercent	Block cache hit percent for all requests regardless of the caching setting
	blockCacheMissCount	Total Number of block cache misses for requests, regardless of caching setting
	blockCacheSize	Number of bytes used by cached blocks
	compactionQueueLength	Number of HRegions on the CompactionQueue. These regions call compact on all stores, and then find out if a compaction is needed along with the type of compaction.
	MemMaxM	Calculates the total heap memory used in MB
	MemHeapUsedM	Calculates the heap memory used in MB
	MemHeapCommittedM	Calculates the heap memory committed in MB
	GcCount	Number of total garbage collections
	updatesBlockedTime	Number of memstore updates that have been blocked so that memstore can be flushed
	memstoreSize	Calculates the size of all memstores in all regions in MB
	readRequestCount	Number of region server read requests
	regionCount	Number of online regions served by a region server
	slowAppendCount	Number of appends that took more than 1000 ms to complete
	slowGetCount	Number of gets that took more than 1000 ms to complete
	slowPutCount	Number of puts that took more than 1000 ms to complete
	slowIncrementCount	Number of increments that took more than 1000 ms to complete
	slowDeleteCount	Number of deletes that took more than 1000 ms to complete
	storeFileIndexSize	Calculates the size of all StoreFile indexes in MB. These are not necessarily in memory because they are stored in the block cache as well and might have been evicted.
	storeFileCount	Number of StoreFiles in all stores and regions
	storeCount	Number of stores in all Regions
	staticBloomSize	Calculates the total size of all bloom filters, which are not necessarily loaded in memory
	writeRequestCount	Number of write requests to a region server
	staticIndexSize	Calculates the total static index size for all region server entities

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix D: Hadoop Metrics and Their Relevance to Security

Create new playlist

Sign In

Sign Up

Table of Contents for
Appendix D: Hadoop Metrics and Their Relevance to Security