Chapter 1. Setting Up Hadoop Cluster – from Hardware to Distribution

Hadoop is a free and open source distributed storage and computational platform. It was created to allow storing and processing large amounts of data using clusters of commodity hardware. In the last couple of years, Hadoop became a de facto standard for the big data projects. In this chapter, we will cover the following topics:

  • Choosing Hadoop cluster hardware
  • Hadoop distributions
  • Choosing OS for the Hadoop cluster

This chapter will give an overview of the Hadoop philosophy when it comes to choosing and configuring hardware for the cluster. We will also review the different Hadoop distributions, the number of which is growing every year. This chapter will explain the similarities and differences between those distributions.

For you, as a Hadoop administrator or an architect, the practical part of cluster implementation starts with making decisions on what kind of hardware to use and how much of it you will need, but there are some essential questions that need to be asked before you can place your hardware order, roll up your sleeves, and start setting things up. Among such questions are those related to cluster design, such as how much data will the cluster need to store, what are the projections of data growth rate, what would be the main data access pattern, will the cluster be used mostly for predefined scheduled tasks, or will it be a multitenant environment used for exploratory data analysis? Hadoop's architecture and data access model allows great flexibility. It can accommodate different types of workload, such as batch processing huge amounts of data or supporting real-time analytics with projects like Impala.

At the same time, some clusters are better suited for specific types of work and hence it is important to arrive at the hardware specification phase with the ideas about cluster design and purpose in mind. When dealing with clusters of hundreds of servers, initial decisions about hardware and general layout will have a significant influence on a cluster's performance, stability, and associated costs.

Choosing Hadoop cluster hardware

Hadoop is a scalable clustered non-shared system for massively parallel data processing. The whole concept of Hadoop is that a single node doesn't play a significant role in the overall cluster reliability and performance. This design assumption leads to choosing hardware that can efficiently process small (relative to total data size) amounts of data on a single node and doesn't require lots of reliability and redundancy on a hardware level. As you may already know, there are several types of servers that comprise the Hadoop cluster. There are master nodes, such as NameNode, Secondary NameNode, and JobTracker and worker nodes that are called DataNodes. In addition to the core Hadoop components, it is a common practice to deploy several auxiliary servers, such as Gateways, Hue server, and Hive Metastore. A typical Hadoop cluster can look like the following diagram:

Choosing Hadoop cluster hardware

Typical Hadoop cluster layout

The roles that those types of servers play in a cluster are different, so are the requirements for hardware specifications and reliability of these nodes. We will first discuss different hardware configurations for DataNodes and then will talk about typical setups for NameNode and JobTracker.

Choosing the DataNode hardware

DataNode is the main worker node in a Hadoop cluster and it plays two main roles: it stores pieces of HDFS data and executes MapReduce tasks. DataNode is Hadoop's primary storage and computational resource. One may think that since DataNodes play such an important role in a cluster, you should use the best hardware available for them. This is not entirely true. Hadoop was designed with an idea that DataNodes are "disposable workers", servers that are fast enough to do useful work as a part of the cluster, but cheap enough to be easily replaced if they fail. Frequency of hardware failures in large clusters is probably one of the most important considerations that core Hadoop developers had in mind. Hadoop addresses this issue by moving the redundancy implementation from the cluster hardware to the cluster software itself.

Note

Hadoop provides redundancy on many levels. Each DataNode stores only some blocks for the HDFS files and those blocks are replicated multiple times to different nodes, so in the event of a single server failure, data remains accessible. The cluster can even tolerate multiple nodes' failure, depending on the configuration you choose. Hadoop goes beyond that and allows you to specify which servers reside on which racks and tries to store copies of data on separate racks, thus, significantly increasing probability that your data remains accessible even if the whole rack goes down (though this is not a strict guarantee). This design means that there is no reason to invest into the RAID controller for Hadoop DataNodes.

Instead of using RAID for local disks, a setup that is known as JBOD (Just a Bunch of Disks) is a preferred choice. It provides better performance for Hadoop workload and reduces hardware costs. You don't have to worry about individual disk failure since redundancy is provided by HDFS.

Storing data is the first role that DataNode plays. The second role is to serve as a data processing node and execute custom MapReduce code. MapReduce jobs are split into lots of separate tasks, which are executed in parallel on multiple DataNodes and for a job to produce logically consistent results, all subtasks must be completed.

This means that Hadoop has to provide redundancy not only on storage, but also on a computational layer. Hadoop achieves this by retrying failed tasks on different nodes, without interrupting the whole job. It also keeps track of nodes that have abnormally high rate of failures or have been responding slower than others and eventually such nodes can be blacklisted and excluded from the cluster.

So, what should the hardware for a typical DataNode look like? Ideally, DataNode should be a balanced system with a reasonable amount of disk storage and processing power. Defining "balanced system" and "reasonable amount of storage" is not as simple a task as it may sound. There are many factors that come into play when you are trying to spec out an optimal and scalable Hadoop cluster. One of the most important considerations is total cluster storage capacity and cluster storage density. These parameters are tightly related. Total cluster storage capacity is relatively simple to estimate. It basically answers questions such as how much data we can put into the cluster. The following is a list of steps that you can take to estimate the required capacity for your cluster:

  1. Identify data sources: List out all known data sources and decide whether full or partial initial data import will be required. You should reserve 15-20 percent of your total cluster capacity, or even more to accommodate any new data sources or unplanned data size growth.
  2. Estimate data growth rate: Each identified data source will have a data ingestion rate associated with it. For example, if you are planning to do daily exports from your OLTP database, you can easily estimate how much data this source will produce over the course of the week, month, year, and so on. You will need to do some test exports to get an accurate number.
  3. Multiply your estimated storage requirements by a replication factor: So far, we talked about the usable storage capacity. Hadoop achieves redundancy on the HDFS level by copying data blocks several times and placing them on different nodes in the cluster. By default, each block is replicated three times. You can adjust this parameter, both by increasing or decreasing the replication factor. Setting the replication factor to 1 completely diminishes a cluster's reliability and should not be used. So, to get raw cluster storage capacity, you need to multiply your estimates by a replication factor. If you estimated that you need 300 TB of usable storage this year and you are planning to use a replication factor of 3, your raw capacity will be 900 TB.
  4. Factoring in MapReduce temporary files and system data: MapReduce tasks produce intermediate data that is being passed from the map execution phase to the reduce phase. This temporary data doesn't reside on HDFS, but you need to allocate about 25-30 percent of total server disk capacity for temporary files. Additionally, you will need separate disk volumes for an operating system, but storage requirements for OS are usually insignificant.

Identifying total usable and raw cluster storage capacity is the first step in nailing down hardware specifications for the DataNode. For further discussions, we will mean raw capacity when referring to cluster's total available storage, since this is what's important from the hardware perspective. Another important metric is storage density, which is the total cluster storage capacity divided by the number of DataNodes in the cluster. Generally, you have two choices: either deploy lots of servers with low storage density, or use less servers with higher storage density. We will review both the options and outline the pros and cons for each.

Low storage density cluster

Historically, Hadoop clusters were deployed on reasonably low storage density servers. This allowed scaling clusters to petabytes of storage capacity using low capacity hard drives available on the market at that time. While the hard drive capacity increased significantly over the last several years, using a large low-density cluster is still a valid option for many. Cost is the main reason you will want to go this route. Individual Hadoop node performance is driven not only by storage capacity, but rather by a balance that you have between RAM/CPU and disks. Having lots of storage on every DataNode, but not having enough RAM and CPU resources to process all the data, will not be beneficial in most cases.

It is always hard to give specific recommendations about the Hadoop cluster hardware. A balanced setup will depend on the cluster workload, as well as the allocated budget. New hardware appears on the market all the time, so any considerations should be adjusted accordingly. To illustrate hardware selection logic for a low density cluster, we will use the following example:

Let's assume we have picked up a server with 6 HDD slots. If we choose reasonably priced 2 TB hard drives, it will give us 12 TB of raw capacity per server.

Note

There is little reason to choose faster 15000 rpm drives for your cluster. Sequential read/write performance matters much more for Hadoop cluster, than random access speed. 7200 rpm drives are a preferred choice in most cases.

For a low density server, our main aim is to keep the cost low to be able to afford a large number of machines. 2 x 4 core CPUs match this requirement and will give reasonable processing power. Each map or reduce task will utilize a single CPU core, but since some time will be spent waiting on IO, it is OK to oversubscribe the CPU core. With 8 cores available, we can configure about 12 map/reduce slots per node.

Each task will require from 2 to 4 GB of RAM. 36 GB of RAM is a reasonable choice for this type of server, but going with 48 GB is ideal. Note that we are trying to balance different components. It's of little use to significantly increase the amount of RAM for this configuration, because you will not be able to schedule enough tasks on one node to properly utilize it.

Let's say you are planning to store 500 TB of data in your cluster. With the default replication factor of 3, this will result in 1500 TB of raw capacity. If you use low density DataNode configuration, you will need 63 servers to satisfy this requirement. If you double the required capacity, you will need more than 100 servers in your cluster. Managing a large number of servers has lots of challenges of its own. You will need to think if there is enough physical room in your data center to host additional racks. Additional power consumption and air conditioning also present significant challenges when the number of servers grows. To address these problems, you can increase the storage capacity of an individual server, as well as tune up other hardware specs.

High storage density cluster

Many companies are looking into building smaller Hadoop clusters, but with more storage and computational power per server. Besides addressing issues mentioned above, such clusters can be a better fit for workload where huge amounts of storage are not a priority. Such workload is computationally intensive and includes machine learning, exploratory analytics, and other problems.

The logic behind choosing and balancing hardware components for a high density cluster is the same as for a lower density one. As an example of such a configuration, we will choose a server with 16 x 2 TB hard drives or 24 x 1 TB hard drives. Having more lower capacity disks per server is preferable, because it will provide better IO throughput and better fault tolerance. To increase the computational power of the individual machine, we will use 16 CPU cores and 96 GB of RAM.

NameNode and JobTracker hardware configuration

Hadoop implements a centralized coordination model, where there is a node (or a group of nodes) whose role is to coordinate tasks among servers that comprise the cluster. The server that is responsible for HDFS coordination is called NameNode and the server responsible for MapReduce jobs dispatching is called JobTracker. Actually NameNode and JobTracker are just separate Hadoop processes, but due to their critical role in almost all cases, these services run on dedicated machines.

The NameNode hardware

NameNode is critical to HDFS availability. It stores all the filesystem metadata: which blocks comprise which files, on which DataNodes these blocks can be found, how many free blocks are available, and which servers can host them. Without NameNode, data in HDFS is almost completely useless. The data is actually still there, but without NameNode you will not be able to reconstruct files from data blocks, nor will you be able to upload new data. For a long time, NameNode was a single point of failure, which was less than ideal for a system that advertises high fault tolerance and redundancy of most components and processes. This was addressed with the introduction of the NameNode High Availability setup in Apache Hadoop 2.0.0, but still hardware requirements for NameNode are very different from what was outlined for DataNode in the previous section. Let's start with the memory estimates for NameNode. NameNode has to store all HDFS metadata info, including files, directories' structures, and blocks allocation in memory. This may sound like a wasteful usage of RAM, but NameNode has to guarantee fast access to files on hundreds or thousands of machines, so using hard drives for accessing this information would be too slow. According to the Apache Hadoop documentation, each HDFS block will occupy approximately 250 bytes of RAM on NameNode, plus an additional 250 bytes will be required for each file and directory. Let's say you have 5,000 files with an average of 20 GB per file. If you use the default HDFS block file size of 64 MB and a replication factor of 3, your NameNode will need to hold information about 50 million blocks, which will require 50 million x 250 bytes plus filesystem overhead equals 1.5 GB of RAM. This is not as much as you may have imagined, but in most cases a Hadoop cluster has many more files in total and since each file will consist of at least one block, memory usage on NameNode will be much higher. There is no penalty for having more RAM on the NameNode than your cluster requires at the moment, so overprovisioning is fine. Systems with 64-96 GB of RAM are a good choice for the NameNode server.

To guarantee persistency of filesystem metadata, NameNode has to keep a copy of its memory structures on disk as well. For this, NameNode maintains a file called editlog, which captures all changes that are happening to the HDFS, such as new files and directories creation and replication factor changes. This is very similar to the redo logfiles that most relational databases use. In addition to editlog, NameNode maintains a full snapshot of the current HDFS metadata state in an fsimage file. In case of a restart, or server crash, NameNode will use the latest fsimage and apply all the changes from the editlog file that needs to be applied to restore a valid point-in-time state of the filesystem.

Unlike traditional database systems, NameNode delegates the task of periodically applying changes from editlog to fsimage to a separate server called Secondary NameNode. This is done to keep the editlog file size under control, because changes that are already applied to fsimage are no longer required in the logfile and also to minimize the recovery time. Since these files are mirroring data structures that NameNode keeps in memory, disk space requirements for them are normally pretty low. fsimage will not grow bigger than the amount of RAM you allocated for NameNode and editlog will be rotated once it has reached 64 MB by default. This means that you can keep the disk space requirements for NameNode in the 500 GB range. Using RAID on the NameNode makes a lot of sense, because it provides protection of critical data from individual disk crashes. Besides serving filesystem requests from HDFS clients, NameNode also has to process heartbeat messages from all DataNodes in the cluster. This type of workload requires significant CPU resources, so it's a good idea to provision 8-16 CPU cores for NameNode, depending on the planned cluster size.

In this book, we will focus on setting up NameNode HA, which will require Primary and Standby NameNodes to be identical in terms of hardware. More details on how to achieve high availability for NameNode will be provided in Chapter 2, Installing and Configuring Hadoop.

The JobTracker hardware

Besides NameNode and Secondary NameNode, there is another master server in the Hadoop cluster called the JobTracker. Conceptually, it plays a similar role for the MapReduce framework as NameNode does for HDFS. JobTracker is responsible for submitting user jobs to TaskTrackers, which are services running on each DataNode. TaskTrackers send periodic heartbeat messages to JobTracker, reporting current status of running jobs, available map/reduce slots, and so on. Additionally, JobTracker keeps a history of the last executed jobs (number is configurable) in memory and provides access to Hadoop-specific or user-defined counters associated with the jobs. While RAM availability is critical to JobTracker, its memory footprint is normally smaller than that of NameNode. Having 24-48 GB of RAM for mid- and large-size clusters is a reasonable estimate. You can review this number if your cluster will be a multitenant environment with thousands of users. By default, JobTracker doesn't save any state information to the disk and uses persistent storage only for logging purpose. This means that total disk requirements for this service are minimal. Just like NameNode, JobTracker will need to be able to process huge amounts of heartbeat information from TaskTrackers, accept and dispatch incoming user jobs, and also apply job scheduling algorithms to be able to utilize a cluster most efficiently. These are highly CPU-intensive tasks, so make sure you invest in fast multi-core processors, similar to what you would pick up for NameNode.

All three types of master nodes are critical to Hadoop cluster availability. If you lose a NameNode server, you will lose access to HDFS data. Issues with Secondary NameNode will not cause an immediate outage, but will delay the filesystem checkpointing process. Similarly, a crash of JobTracker will cause all running MapReduce jobs to abort and no new jobs will be able to run. All these consequences require a different approach to the master's hardware selection than what we have discussed for DataNode. Using RAID arrays for critical data volumes, redundant network and power supplies, and potentially higher-grade enterprise level hardware components is a preferred choice.

Gateway and other auxiliary services

Gateway servers are a client's access points to the Hadoop cluster. Interaction with data in HDFS requires having connectivity between the client program and all nodes inside the cluster. This is not always practical from a network design and security perspective. Gateways are usually deployed outside of the primary cluster subnet and are used for data imports and other user programs. Additional infrastructure components and different shells can be deployed on standalone servers, or combined with other services. Hardware requirements to these optional services are obviously much lower than those for cluster nodes and often you can deploy gateways on virtual machines. 4-8 CPU cores and 16-24 GB of RAM is a reasonable configuration for a Gateway node.

Network considerations

In Hadoop cluster, network is a component that is as important as a CPU, disk, or RAM. HDFS relies on network communication to update NameNode on a current filesystem status, as well as to receive and send data blocks to the client. MapReduce jobs also use the network for status messages, but additionally uses bandwidth when a file block has to be read from a DataNode that is not local to the current TaskTracker, and to send intermediate data from mappers to the reducers. In short, there is a lot of network activity going on in a Hadoop cluster. As of now, there are two main choices when it comes to the network hardware. A 1 GbE network is cheap, but is rather limited in throughput, while a 10 GbE network can significantly increase the costs of a large Hadoop deployment. Like every other component of the cluster, the network choice will depend on the intended cluster layout.

For larger clusters, we came up with generally lower spec machines, with less disks, RAM, and CPU per node, assuming that a large volume of such servers will provide enough capacity. For the smaller cluster, we have chosen high-end servers. We can use the same arguments when it comes to choosing which network architecture to apply.

For clusters with multiple less powerful nodes, installing 10 GbE makes little sense for two reasons. First of all, it will increase the total cost of building the cluster significantly and you may not be able to utilize all the available network capacity. For example, with six disks per DataNode, you should be able to achieve about 420 MB/sec of local write throughput, which is less than the network bandwidth. This means that the cluster bottleneck will shift from the network to the disks' IO capacity. On the other hand, a smaller cluster of fast servers with lots of storage will most probably choke on a 1 GbE network and most of the server's available resources will be wasted. Since such clusters are normally smaller, a 10 GbE network hardware will not have as big of an impact on the budget as for a larger setup.

Tip

Most of the modern servers come with several network controllers. You can use bonding to increase network throughput.

Hadoop hardware summary

Let's summarize the possible Hadoop hardware configurations required for different types of clusters.

DataNode for low storage density cluster:

Component

Specification

Storage

6-8 2 TB hard drives per server, JBOD setup, no RAID

CPU

8 CPU cores

RAM

32-48 GB per node

Network

1 GbE interfaces, bonding of several NICs for higher throughput is possible

DataNode for high storage density cluster

Component

Specification

Storage

16-24 1 TB hard drives per server, JBOD setup, no RAID

CPU

16 CPU cores

RAM

64-96 GB per node

Network

10 GbE network interface

NameNode and Standby NameNode

Component

Specification

Storage

Low disk space requirements: 500 GB should be enough in most cases. RAID 10 or RAID 5 for fsimage and editlog. Network attached storage to place a copy of these files

CPU

8-16 CPU cores, depending on the cluster size

RAM

64-96 GB

Network

1 GbE or 10 GbE interfaces, bonding of several NICs for higher throughput is possible

JobTracker

Component

Specification

Storage

Low disk space requirements: 500 GB should be enough in most cases for logfiles and the job's state information

CPU

8-16 CPU cores, depending on the cluster size

RAM

64-96 GB.

Network

1 GbE or 10 GbE interfaces, bonding of several NICs for higher throughput is possible

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset