Chapter 6. Deploying Hadoop to the Cloud

Previously, we were focused on deploying Hadoop on real physical servers. This is the most common and generally recommended way to use Hadoop. The Hadoop cluster performs the best when it can fully utilize the available hardware resources. Previously, using virtualized servers for Hadoop was not considered a good practice. This has been changing over the past few years. More and more people realized that the main benefit of running Hadoop in the cloud is not the performance, but rather the great flexibility when it comes to provisioning resources. With cloud you can create large clusters of hundreds of computer nodes in no time, perform the required task, and then destroy the cluster if you don't need it any longer. In this chapter, we will describe the several options you have when deploying Hadoop in the cloud. We will focus on Amazon Web Services (AWS) since it is the most popular public cloud at the moment, and it also provides some advanced features for Hadoop.

Amazon Elastic MapReduce

AWS is probably one of the most popular public clouds at the moment. It allows users to quickly provision virtual servers on demand and discard them when they are no longer required. While Hadoop was not originally designed to run in such environments, the ability to create large clusters for specific tasks is very appealing in many use cases.

Imagine you need to process application logfiles and prepare data to be loaded in relational databases. If this task takes a couple of hours and runs only once a day, there is little reason to keep the Hadoop cluster running all the time, as it would be idle most of the time. In this case, a more practical solution would be to provision a virtual cluster using Elastic MapReduce (EMR) and destroy it after the work is done.

EMR clusters don't have to be destroyed and recreated from scratch every time. You can choose to keep the cluster running and use it for interactive Hive queries, and so on.

We will now take you through the steps to provision a Hadoop cluster using EMR. We will assume that you are familiar with the main AWS concepts such as EC2, instance types, regions, and so on. You can always refer to the AWS documentation for more details at the following website:

https://aws.amazon.com/documentation/

Installing the EMR command-line interface

You can interact with EMR in different ways. AWS provides a great web console interface for all its services. This means that you don't have to use a command line to create, launch, and monitor clusters. This may be fine for testing end exploring, but when it's time to implement an EMR job as part of the production ETL process, the command line comes in handy.

Though this is a high-level overview of EMR capabilities, we will focus on using a command line instead of a web interface, because this is what you will do in a production environment.

To install the EMR Command-line Interface (CLI), you need to perform the following steps:

  1. Download CLI and unpack the tools from the website.

    http://aws.amazon.com/developertools/2264.

  2. Depending on your platform, you may need to install Ruby.
  3. Create an S3 bucket to store the logfiles produced by Hadoop. Since, in many cases, temporary EMR clusters are running unattended, you need a persistent location for the logfiles to be able to review the status or debug issues.
  4. To be able to use EMR CLI, you need to have your AWS Access Key ID ready and also generate a key pair. You should refer to the AWS documentation for details on how to obtain your Access Key at http://docs.aws.amazon.com/general/latest/gr/getting-aws-sec-creds.html.
  5. You need to change the permissions on the key pair .pem file to be only readable by the owner: chmod og-rwx emr-keys.pem.
  6. Now, you can create a configuration for EMR CLI. Go to the directory where you have placed the CLI files and edit the credentials.json file to look like this:
    {
    "access_id": "Your Access Key ID",
    "private_key": "Your AWS Secret Access Key",
    "keypair": "emr-keys",
    "key-pair-file": "/path/to/key-file/emr-keys.pem",
    "log_uri": "s3n://emr-logs-x123/",
    "egion": "us-east-1"
    }

This sample configuration file has all the options you need to launch the test EMR cluster. To verify that CLI works properly, just run it with the –version option:

# ./elastic-mapreduce --version
Version 2013-07-08

You can refer to the EMR CLI documentation for more details at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-install.html.

Choosing the Hadoop version

EMR makes launching Hadoop clusters easy by taking care of deploying and configuring Hadoop components. You don't have to manually download, install, and configure packages. On the other hand, EMR provides a decent amount of flexibility. You can control different aspects of your cluster configuration by passing parameters to elastic-mapreduce CLI. One of the options you can specify is which Hadoop version to use. To do this, you need to pass the –hadoop-version option to CLI. Currently, EMR supports the following Hadoop versions: 1.0.3, 0.20.205, 0.20, and 0.18. If you don't specify the Hadoop version explicitly, the EMR cluster will be launched using the latest Hadoop version.

Launching the EMR cluster

Once you have installed and configured CLI tools, launching a new EMR cluster is very easy. The following command will start a 3-node Hadoop cluster:

# ./elastic-mapreduce --create --alive --name "EMR Test" 
--num-instances 3 --instance-type m1.small
Created job flow j-2RNNYC3TUCZIO

There are several important options to this command. The --alive option tells EMR to launch the cluster and make it available for the user to connect to servers via SSH and perform any tasks that they may need to. An alternative to this approach is to launch a cluster to execute one specified job and automatically terminate the cluster when it is completed. We will explore this option in more detail later. We have also specified the name of the cluster, how many servers to launch, and what type of EC2 instances to use.

This command will be executed almost instantly, but it doesn't mean your cluster is ready to use right away. It may take EMR several minutes to launch the cluster. EMR uses the term job flow to describe clusters. The idea behind this is that you can set up a number of steps such as launch the cluster, run the Hive script, save data, and terminate the cluster. These steps form a job flow and can be executed automatically. In this case, CLI prints out the ID of the cluster that we have started. Since it takes some time to launch the cluster, you can use the --describe option to check the current status:

# ./elastic-mapreduce --jobflow j-2RNNYC3TUCZIO --describe

The preceding command provides a lot of useful information about your cluster. The output is a JSON document, which makes it easy to be consumed by various automation scripts. We will take a look at the first section that shows the current status of the cluster:

"ExecutionStatusDetail": {
        "CreationDateTime": 1375748997.0,
        "ReadyDateTime": null,
        "EndDateTime": null,
        "StartDateTime": null,
        "State": "STARTING",
        "LastStateChangeReason": "Starting instances"
      }

You can see that the State field has the STARTING status, which means that the cluster is not yet ready to be used. If you re-run this command in a couple of minutes, you should see State changing to WAITING. This means you can connect to your cluster and start executing jobs.

Once the EMR cluster is in the WAITING state, you can identify your master instance by looking at the MasterPublicDnsName field:

"Instances": {
        "SlaveInstanceType": "m1.small",
        "HadoopVersion": "1.0.3",
        "Ec2KeyName": "emr-keys",
        "MasterPublicDnsName": "ec2-107-20-83-146.compute-1.amazonaws.com",
        "TerminationProtected": false,
        "NormalizedInstanceHours": 3,
        "MasterInstanceType": "m1.small",
        "KeepJobFlowAliveWhenNoSteps": true,
        "Ec2SubnetId": null,
        "Placement": {
          "AvailabilityZone": "us-east-1a"
        }

Master instance in EMR clusters is the instance that is hosting Hadoop NameNode and JobTracker. It is also your gateway to access the cluster. You can log in to the master instance using the key pair that you generated earlier when configuring CLI tools.

Note

Another way to check the cluster status and get the address of the master instance is by using the EMR Web console available at

https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1

You can terminate the EMR cluster once you no longer need it:

# ./elastic-mapreduce --jobflow j-2RNNYC3TUCZIO --terminate
Terminated job flow j-2RNNYC3TUCZIO

Temporary EMR clusters

Another way to use EMR is to launch a temporary cluster. In this case, you need to prepare input and output data locations, as well as a.jar file with a MapReduce job or the Hadoop streaming script. Once this is done, you can launch the EMR cluster that will process the data (that you provide using your MapReduce job), write the output into the specified location, and terminate the cluster when the work is done. This approach doesn't require any human interaction (once you prepared input data and MapReduce job) and can be easily scheduled to run on a regular basis.

Preparing input and output locations

One thing to keep in mind with temporary EMR clusters is that the HDFS storage is temporary as well. This means that you can use HDFS while the cluster is running, but the data will be deleted when the EMR cluster is terminated. EMR solves this problem by relying on the Amazon S3 storage service. You can specify S3 buckets as input and output sources for EMR clusters.

Before you can launch your EMR cluster, you need to create the S3 buckets that will be used to store the input/output data for your cluster. Depending on your preferences, you can use the AWS web console to create buckets or use AWS command line tools. You can refer to the S3 documentation for more details at

http://aws.amazon.com/documentation/s3/

For our test, we will create two buckets: emr-logs-x123 and emr-data-x123.

Note

In the S3 bucket, the name must be globally unique. You may need to adjust your bucket names to satisfy this rule.

We will use the emr-data-x123 bucket to store input and output data, as well as store the .jar file for our MapReduce job. You will need to create the input directory for the input data and the jobs directory for the .jar files. The simplest way to create directories in S3 buckets is via the S3 web interface. You can also upload a sample text file into the input directory using the same interface.

EMR relies on S3, not only to store input and output data, but also to keep MapReduce programs. You can place the .jar files or streaming scripts on S3 and point your EMR cluster to it. In this example, we will use the WordCount program that comes with the Hadoop distribution. We will create the mapreduce-jobs directory in the same emr-data-x123 bucket and place the hadoop-examples-1.0.3.jar file there.

Note

The hadoop-examples.jar file is supplied with all Hadoop distributions. You need to make sure the version of this file matches the version of the EMR cluster you are planning to launch.

We will use the emr-logs-x123 S3 bucket to keep the logfiles of the EMR job attempts. You can always refer to these files to get additional information about your job.

Once you have completed all the preparation steps, you can use EMR CLI to launch the cluster and execute the job:

# ./elastic-mapreduce --create --name "Word Count JAR Test" 
  --jar s3n://emr-data-x123/mapreduce-jobs/hadoop-examples-1.0.3.jar 
  --arg wordcount 
  --arg s3n://emr-data-x123/input/cont-test.txt 
  --arg s3n://emr-tests-x123/output/

The arguments to the preceding command are very similar to what you would use for running a MapReduce job on a standard Hadoop cluster. We specify the location of the .jar file on S3, the name of the class to execute, as well as locations of the input and output data directories. The location of the S3 bucket for logfiles is not specified here, but is a part of the credentials.json configuration files described previously in Installing the EMR command-line interface section..

Once the job is completed, the EMR cluster will be terminated automatically, but S3 directories will still be available.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset