Chapter 5. Monitoring Hadoop Cluster

Every production system requires a well-planned monitoring strategy; hence Hadoop cluster also requires it. It is not a simple task, taking into account the multiple components involved and the multiple machines comprising of the cluster. Hadoop provides a wide variety of metrics about the internal state of its components, but there are no ready to use tools to monitor and alert on these metrics. In this chapter, we will provide an overview of the monitoring strategy, as well as tools that you can use to implement it.

Monitoring strategy overview

Hadoop monitoring strategy is different from what you may use for traditional databases.When you have a cluster of hundreds of servers, failure of various components becomes a norm. If you will treat a failure of single DataNode as an emergency, there is a big chance that your monitoring system will be overloaded with false alerts.

Instead, it is important to outline which components are critical and failure of which components can be tolerated (up to a certain point). For critical components, you will need to define rules, which will alert on call personnel right away. For non-critical components, regular reports on the overall system status should be enough.

You should already have an idea about Hadoop components whose failure should be treated as an emergency. Failure of NameNode or a JobTracker will make cluster unusable and should be investigated right away. Even if you have High Availability configured for these components, it is still important to find out what was the root cause of the problem. This will help you to prevent similar problems occurring in the future. If you have followed our instructions to set up High Availability for the NameNode with automatic failover, it is important to provide proper monitoring for all the involved components.

You need to make sure that enough JournalNodes are up and running to provide quorum for NameNode logs, as well as monitor the ZooKeeper cluster status. Besides complete failures of a given service, you will also need to monitor some health metrics to be able to prevent disasters. Things such as available disk space on NameNode and JournalNodes, as well as total cluster capacity and current usage are one of the most critical metrics that you should monitor.

Worker nodes are the non-critical part of the cluster from a monitoring perspective. Hadoop can tolerate failure of several worker nodes and still keep the cluster available. It is important to monitor what portion of DataNodes and TaskTrackers is available, and configure monitoring rules based on this metric. For example, failure of one or two worker nodes may not need immediate attention from the Operations team. Failure of 30 percent of the worker nodes on the other hand compromises cluster availability and is probably a sign of a larger problem. It could be caused by a faulty hardware or network outage.

Hadoop doesn't come with any built-in monitoring system. The most common practice is the use of open source monitoring systems such as Nagios for alerting, and tools such as Ganglia for trending and historical information. In the following sections, we will review the metrics Hadoop services reveals and how to access them. We will also look at how to integrate these metrics with the existing monitoring systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset