When it comes to monitoring the MapReduce status in the current Hadoop implementation, all required metrics can be obtained on the JobTracker level. There is no reason to monitor individual TaskTrackers, at least on an alert level. A periodic report on the number of alive and dead TaskTrackers should be sent out to monitor the overall framework health.
The following is the list of host-level resources to monitor on a JobTracker:
The following checks are specific to JobTracker process::
HeapMemoryUsage.used
and HeapMemoryUsage.max
variables. Type: criticalSummaryJson.nodes
and SummaryJson.alive
status variables will give you an idea of what portion of TaskTrackers is available at any given moment in time. There is no strict threshold for this metric. Your jobs will run even if there is only one TaskTracker available, but performance will, obviously, deteriorate significantly. Choose a threshold based on your cluster size, and adjust it over time according to what the failure trend is. Type: criticalJobTracker can blacklist some of the worker nodes if they constantly report slow performance or fail too often. You should monitor the total number of blacklisted TaskTrackers by looking at the SummaryJson.blacklisted
metric. Type: critical