Chapter 2. Installing and Configuring Hadoop

After you have decided on your cluster layout and size, it is time to get Hadoop installed and get the cluster operational. We will walk you through the installation and configuration steps for three core Hadoop components: NameNode, DataNode, and JobTracker. We will also review different options for configuring NameNode High Availability and ways of quickly assessing cluster's health and performance. By the end of this chapter, you should have your Hadoop cluster up and running. We will keep the structure of the cluster similar to what was outlined in Chapter 1, Setting Up Hadoop Cluster – from Hardware to Distribution.

Configuring OS for Hadoop cluster

As mentioned earlier, Hadoop can run on almost any modern flavor of Linux. Instructions in this, and following chapters, will be focused on CentOS 6.x—CentOS and Red Hat are the most popular choices for production related Hadoop installations. It shouldn't be too hard to adopt these instructions for, let's say Debian, because all things that are directly related to configuring Hadoop components will stay the same and you should be able to substitute package managers for your favorite distributions easily.

Choosing and setting up the filesystem

Modern Linux distributions support different filesystems: EXT3, EXT4, XFS, BTRFS, among others. These filesystems possess slightly different characteristics when it comes to its performance on certain workloads.

If you favor stability over performance and advanced features, you might want to use EXT3, which is battle-tested on some of the largest Hadoop clusters. The complete list can be seen at http://wiki.apache.org/hadoop/DiskSetup. We will use EXT4 for our cluster setup, since it provides better performance on large files, which makes it a good candidate for Hadoop.

To format a volume using EXT4 filesystem, run the following command as a root user in your shell:

# mkfs –t ext4 –m 0 –O extent,sparse_super,flex_bg /dev/sdb1

In this example, partition 1 on drive b will be formatted. There are several options in the format command that need to be explained.

  • -m 0: This option reduces the space reserved for a super-user to 0 percent from the default value of 5 percent. This can save a significant amount of disk space on large filesystems. If you have 16 TB per server, you will save about 800 GB.
  • -O extent,sparse_super,flex_bg: This option will enable extent-based allocation, which will increase performance on large sequential IO requests. The sparse_super option is another disk space saving option. You can save space on large filesystems by allocating less superblock backup copies. The flex_bg option forces the filesystem to pack metadata blocks closed together, providing some performance improvements.

There are a couple of important options you need to know when mounting the filesystem. Those are noatime and noadirtime. By default, a filesystem would keep track of all operations, including reading a file or accessing a directory, by updating a metadata timestamp field. This can cause a significant overhead on a busy system and should be disabled. Here is an example of how to disable this feature in /etc/fstab:

/dev/sda1 /disk1 ext4 noatime,noadirtime 1 2
/dev/sdb1 /disk2 ext4 noatime,noadirtime 1 2

Note

Keep in mind that these disk configuration options are applied only on DataNode data disks. It is recommended to have RAID configured for NameNode volumes. RAID configuration is specific to your controller manufacturer.

Setting up Java Development Kit

Since Hadoop is written in Java, you need to make sure that a proper version of JDK is installed on all Hadoop nodes. It is absolutely critical to make sure that the version and distribution of JDK is the same on all nodes. Currently, the only officially supported distribution of JVM is Oracle JVM. There are reports that Hadoop can be built and run fine on OpenJDK, but we will stick to Oracle JDK. At the time of writing this book, Hadoop was tested to work on Java Version 6, while the current Oracle Java version is 7, and Java 6 actually reached the end of its life in February 2013. You can see the list of all the Java versions Hadoop has been tested against at http://wiki.apache.org/hadoop/HadoopJavaVersions. CentOS doesn't include Oracle JDK in the repositories, so you will need to download and install it separately. Download archived rpms from http://www.oracle.com/technetwork/java/javase/downloads/jdk6downloads-1902814.html (or Google Oracle Java 6 download in case the link changes). It is OK to choose the latest 6.x version, since new updates and security patches are being released quite often. Make sure you go for an rpm install. We will use Cloudera's Distribution, including Apache Hadoop (CDH) packages to install Hadoop in further sections, which rely on Oracle Java rpms. Here is how you install the 64-bit Oracle Java Version 1.6.0_45:

# chmod 755 jdk-6u45-linux-x64-rpm.bin
# ./jdk-6u45-linux-x64-rpm.bin

Make sure you repeat this step on all Hadoop nodes, including Gateway servers.

Other OS settings

There are several other operating system settings that you need to change to ensure proper operation of the Hadoop cluster. First of all, you need to make sure that the hostname/IP resolution is working properly across the cluster. When Hadoop master nodes, such as NameNode or JobTracker, receive a heartbeat message from a new DataNode for the first time, they record its IP address and use it for further communications. So, it is important to configure proper hostnames for all nodes in the cluster, and to make sure they resolve to correct IP addresses using the /etc/hosts file. To make sure the host reports the correct IP address, use the ping command and check the IP address returned. Here is an example of what /etc/hosts may look like:

127.0.0.1   localhost.localdomain localhost
::1         localhost.localdomain localhost
192.168.0.100 nn1.hadoop.test.com nn1
192.168.0.101 sn1.hadoop.test.com sn1
192.168.0.102 jt1.hadoop.test.com jt1
192.168.0.40  dn1.hadoop.test.com dn1
192.168.0.41  dn2.hadoop.test.com dn2

Tip

It's a good practice to give meaningful names to the nodes in the cluster, so the name reflects a function the host plays. Such an approach will make it easy to generate hosts/IP lists with a script and propagate it on all the servers.

Setting up the CDH repositories

There are many ways to install Hadoop, depending on which distribution you choose. Even within one distribution, you can choose different routes. CDH provides various assisted modes of installing Hadoop packages on your cluster: you can use the Cloudera Manager web interface to perform autodiscovery of the nodes in your cluster and install and preconfigure appropriate packages for you, or you can set up the CDH repository and install components manually. In this book, we will go with manual install, because it will help to better understand Hadoop mechanics and how different components interact with each other. We will still use a yum package management utility to take care of copying files in correct locations, setting up services, and so on. This will allow us to focus on the components' configuration more.

The first thing you need to do, is to add a new yum repository. The repository you need depends on your OS version, and the full list can be found at http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_4_4.html. All of the examples in this book will use the latest available version, CDH 4.2 on CentOS 6 64-bit. Make sure you adjust the instructions accordingly, since newer CDH versions might be available when you are reading this book. To add a repository, download this file http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/cloudera-cdh4.repo and place it into /etc/yum.repos.d/ on your server:

You will also need to add a repository GPG key:

# rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

After this is done, you can check what Hadoop packages are available by running:

# yum search Hadoop
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset