Chapter 21. Hive and Amazon Web Services (AWS)

Mark Grover

One of the services that Amazon provides as a part of Amazon Web Services (AWS) is Elastic MapReduce (EMR). With EMR comes the ability to spin up a cluster of nodes on demand. These clusters come with Hadoop and Hive installed and configured. (You can also configure the clusters with Pig and other tools.) You can then run your Hive queries and terminate the cluster when you are done, only paying for the time you used the cluster. This section describes how to use Elastic MapReduce, some best practices, and wraps up with pros and cons of using EMR versus other options.

You may wish to refer to the online AWS documentation available at http://aws.amazon.com/elasticmapreduce/ while reading this chapter. This chapter won’t cover all the details of using Amazon EMR with Hive. It is designed to provide an overview and discuss some practical details.

Why Elastic MapReduce?

Small teams and start-ups often don’t have the resources to set up their own cluster. An in-house cluster is a fixed cost of initial investment. It requires effort to set up and servers and switches as well as maintaining a Hadoop and Hive installation.

On the other hand, Elastic MapReduce comes with a variable cost, plus the installation and maintenance is Amazon’s responsibility. This is a huge benefit for teams that can’t or don’t want to invest in their own clusters, and even for larger teams that need a test bed to try out new tools and ideas without affecting their production clusters.

Instances

An Amazon cluster is comprised of one or more instances. Instances come in various sizes, with different RAM, compute power, disk drive, platform, and I/O performance. It can be hard to determine what size would work the best for your use case. With EMR, it’s easy to start with small instance sizes, monitor performance with tools like Ganglia, and then experiment with different instance sizes to find the best balance of cost versus performance.

Before You Start

Before using Amazon EMR, you need to set up an Amazon Web Services (AWS) account. The Amazon EMR Getting Started Guide provides instructions on how to sign up for an AWS account.

You will also need to create an Amazon S3 bucket for storing your input data and retrieving the output results of your Hive processing.

When you set up your AWS account, make sure that all your Amazon EC2 instances, key pairs, security groups, and EMR jobflows are located in the same region to avoid cross-region transfer costs. Try to locate your Amazon S3 buckets and EMR jobflows in the same availability zone for better performance.

Although Amazon EMR supports several versions of Hadoop and Hive, only some combinations of versions of Hadoop and Hive are supported. See the Amazon EMR documentation to find out the supported version combinations of Hadoop and Hive.

Managing Your EMR Hive Cluster

Amazon provides multiple ways to bring up, terminate, and modify a Hive cluster. Currently, there are three ways you can manage your EMR Hive cluster:

EMR AWS Management Console (web-based frontend)

This is the easiest way to bring up a cluster and requires no setup. However, as you start to scale, it is best to move to one of the other methods.

EMR Command-Line Interface

This allows users to manage a cluster using a simple Ruby-based CLI, named elastic-mapreduce. The Amazon EMR online documentation describes how to install and use this CLI.

EMR API

This allows users to manage an EMR cluster by using a language-specific SDK to call EMR APIs. Details on downloading and using the SDK are available in the Amazon EMR documentation. SDKs are available for Android, iOS, Java, PHP, Python, Ruby, Windows, and .NET. A drawback of an SDK is that sometimes particular SDK wrapper implementations lag behind the latest version of the AWS API.

It is common to use more than one way to manage Hive clusters.

Here is an example that uses the Ruby elastic-mapreduce CLI to start up a single-node Amazon EMR cluster with Hive configured. It also sets up the cluster for interactive use, rather than for running a job and exiting. This cluster would be ideal for learning Hive:

elastic-mapreduce --create --alive --name "Test Hive" --hive-interactive

If you also want Pig available, add the --pig-interface option.

Next you would log in to this cluster as described in the Amazon EMR documentation.

Thrift Server on EMR Hive

Typically, the Hive Thrift server (see Chapter 16) listens for connections on port 10000. However, in the Amazon Hive installation, this port number depends on the version of Hive being used. This change was implemented in order to allow users to install and support concurrent versions of Hive. Consequently, Hive v0.5.X operates on port 10000, Hive v0.7.X on 10001, and Hive v0.7.1 on 10002. These port numbers are expected to change as newer versions of Hive get ported to Amazon EMR.

Instance Groups on EMR

Each Amazon cluster has one or more nodes. Each of these nodes can fit into one of the following three instance groups:

Master Instance Group

This instance group contains exactly one node, which is called the master node. The master node performs the same duties as the conventional Hadoop master node. It runs the namenode and jobtracker daemons, but it also has Hive installed on it. In addition, it has a MySQL server installed, which is configured to serve as the metastore for the EMR Hive installation. (The embedded Derby metastore that is used as the default metastore in Apache Hive installations is not used.) There is also an instance controller that runs on the master node. It is responsible for launching and managing other instances from the other two instance groups. Note that this instance controller also uses the MySQL server on the master node. If the MySQL server becomes unavailable, the instance controller will be unable to launch and manage instances.

Core Instance Group

The nodes in the core instance group have the same function as Hadoop slave nodes that run both the datanode and tasktracker daemons. These nodes are used for MapReduce jobs and for the ephemeral storage on these nodes that is used for HDFS. Once a cluster has been started, the number of nodes in this instance group can only be increased but not decreased. It is important to note that ephemeral storage will be lost if the cluster is terminated.

Task Instance Group

This is an optional instance group. The nodes in this group also function as Hadoop slave nodes. However, they only run the tasktracker processes. Hence, these nodes are used for MapReduce tasks, but not for storing HDFS blocks. Once the cluster has been started, the number of nodes in the task instance group can be increased or decreased.

The task instance group is convenient when you want to increase cluster capacity during hours of peak demand and bring it back to normal afterwards. It is also useful when using spot instances (discussed below) for lower costs without risking the loss of data when a node gets removed from the cluster.

If you are running a cluster with just a single node, the node would be a master node and a core node at the same time.

Configuring Your EMR Cluster

You will often want to deploy your own configuration files when launching an EMR cluster. The most common files to customize are hive-site.xml, .hiverc, hadoop-env.sh. Amazon provides a way to override these configuration files.

Deploying hive-site.xml

For overriding hive-site.xml, upload your custom hive-site.xml to S3. Let’s assume it has been uploaded to s3n://example.hive.oreilly.com/tables/hive_site.xml.

Note

It is recommended to use the newer s3n “scheme” for accessing S3, which has better performance than the original s3 scheme.

If you are starting you cluster via the elastic-mapreduce Ruby client, use a command like the following to spin up your cluster with your custom hive-site.xml:

elastic-mapreduce --create --alive --name "Test Hive" --hive-interactive 
--hive-site=s3n://example.hive.oreilly.com/conf/hive_site.xml

If you are using the SDK to spin up a cluster, use the appropriate method to override the hive-site.xml file. After the bootstrap actions, you would need two config steps, one for installing Hive and another for deploying hive-site.xml. The first step of installing Hive is to call --install-hive along with --hive-versions flag followed by a comma-separated list of Hive versions you would like to install on your EMR cluster.

The second step of installing Hive site configuration calls --install-hive-site with an additional parameter like --hive-site=s3n://example.hive.oreilly.com/tables/hive_site.xml pointing to the location of the hive-site.xml file to use.

Deploying a .hiverc Script

For .hiverc, you must first upload to S3 the file you want to install. Then you can either use a config step or a bootstrap action to deploy the file to your cluster. Note that .hiverc can be placed in the user’s home directory or in the bin directory of the Hive installation.

Deploying .hiverc using a config step

At the time of this writing, the functionality to override the .hiverc file is not available in the Amazon-provided Ruby script, named hive-script, which is available at s3n://us-east-1.elasticmapreduce/libs/hive/hive-script.

Consequently, .hiverc cannot be installed as easily as hive-site.xml. However, it is fairly straightforward to extend the Amazon-provided hive-script to enable installation of .hiverc, if you are comfortable modifying Ruby code. After implementing this change to hive-script, upload it to S3 and use that version instead of the original Amazon version. Have your modified script install .hiverc to the user’s home directory or to the bin directory of the Hive installation.

Deploying a .hiverc using a bootstrap action

Alternatively, you can create a custom bootstrap script that transfers .hiverc from S3 to the user’s home directory or Hive’s bin directory of the master node. In this script, you should first configure s3cmd on the cluster with your S3 access key so you can use it to download the .hiverc file from S3. Then, simply use a command such as the following to download the file from S3 and deploy it in the home directory:

s3cmd get s3n://example.hive.oreilly.com/conf/.hiverc ~/.hiverc

Then use a bootstrap action to call this script during the cluster creation process, just like you would any other bootstrap action.

Setting Up a Memory-Intensive Configuration

If you are running a memory-intensive job, Amazon provides some predefined bootstrap actions that can be used to fine tune the Hadoop configuration parameters. For example, to use the memory-intensive bootstrap action when spinning up your cluster, use the following flag in your elastic-mapreduce --create command (wrapped for space):

--bootstrap-action
  s3n://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

Persistence and the Metastore on EMR

An EMR cluster comes with a MySQL server installed on the master node of the cluster. By default, EMR Hive uses this MySQL server as its metastore. However, all data stored on the cluster nodes are deleted once you terminate your cluster. This includes the data stored on the master node metastore, as well! This is usually unacceptable because you would like to retain your table schemas, etc., in a persistent metastore.

You can use one of the following methods to work around this limitation:

Use a persistent metastore external to your EMR cluster

The details on how to configure your Hive installation to use an external metastore are in Metastore Using JDBC. You can use the Amazon RDS (Relational Data Service), which is based on MySQL, or another, in-house database server as a metastore. This is the best choice if you want to use the same metastore for multiple EMR clusters or the same EMR cluster running more than one version of Hive.

Leverage a start-up script

If you don’t intend to use an external database server for your metastore, you can still use the master node metastore in conjunction with your start-up script. You can place your create table statements in startup.q, as follows:

CREATE EXTERNAL TABLE IF NOT EXISTS emr_table(id INT, value STRING)
PARTITIONED BY (dt STRING)
LOCATION 's3n://example.hive.oreilly.com/tables/emr_table';

It is important to include the IF NOT EXISTS clause in your create statement to ensure that the script doesn’t try to re-create the table on the master node metastore if it was previously created by a prior invocation of startup.q.

At this point, we have our table definitions in the master node metastore but we haven’t yet imported the partitioning metadata. To do so, include a line like the following in your startup.q file after the create table statement:

ALTER TABLE emr_table RECOVER PARTITIONS;

This will populate all the partitioning related metadata in the metastore. Instead of your custom start-up script, you could use .hiverc, which will be sourced automatically when Hive CLI starts up. (We’ll discuss this feature again in EMR Versus EC2 and Apache Hive).

The benefit of using .hiverc is that it provides automatic invocation. The disadvantage is that it gets executed on every invocation of the Hive CLI, which leads to unnecessary overhead on subsequent invocations.

The advantage of using your custom start-up script is that you can more strictly control when it gets executed in the lifecycle of your workflow. However, you will have to manage this invocation yourself. In any case, a side benefit of using a file to store Hive queries for initialization is that you can track the changes to your DDL via version control.

Note

As your meta information gets larger with more tables and more partitions, the start-up time using this system will take longer and longer. This solution is not suggested if you have more than a few tables or partitions.

MySQL dump on S3

Another, albeit cumbersome, alternative is to back up your metastore before you terminate the cluster and restore it at the beginning of the next workflow. S3 is a good place to persist the backup while the cluster is not in use.

Note that this metastore is not shared amongst different versions of Hive running on your EMR cluster. Suppose you spin up a cluster with both Hive v0.5 and v0.7.1 installed. When you create a table using Hive v0.5, you won’t be able to access this table using Hive v0.7.1. If you would like to share the metadata between different Hive versions, you will have to use an external persistent metastore.

HDFS and S3 on EMR Cluster

HDFS and S3 have their own distinct roles in an EMR cluster. All the data stored on the cluster nodes is deleted once the cluster is terminated. Since HDFS is formed by ephemeral storage of the nodes in the core instance group, the data stored on HDFS is lost after cluster termination.

S3, on the other hand, provides a persistent storage for data associated with the EMR cluster. Therefore, the input data to the cluster should be stored on S3 and the final results obtained from Hive processing should be persisted to S3, as well.

However, S3 is an expensive storage alternative to HDFS. Therefore, intermediate results of processing should be stored in HDFS, with only the final results saved to S3 that need to persist.

Please note that as a side effect of using S3 as a source for input data, you lose the Hadoop data locality optimization, which may be significant. If this optimization is crucial for your analysis, you should consider importing “hot” data from S3 onto HDFS before processing it. This initial overhead will allow you to make use of Hadoop’s data locality optimization in your subsequent processing.

Putting Resources, Configs, and Bootstrap Scripts on S3

You should upload all your bootstrap scripts, configuration scripts (e.g., hive-site.xml and .hiverc), resources (e.g., files that need to go in the distributed cache, UDF or streaming JARs), etc., onto S3. Since EMR Hive and Hadoop installations natively understand S3 paths, it is straightforward to work with these files in subsequent Hadoop jobs.

For example, you can add the following lines in .hiverc without any errors:

ADD FILE s3n://example.hive.oreilly.com/files/my_file.txt;
ADD JAR s3n://example.hive.oreilly.com/jars/udfs.jar;
CREATE TEMPORARY FUNCTION my_count AS 'com.oreilly.hive.example.MyCount';

Logs on S3

Amazon EMR saves the log files to the S3 location pointed to by the log-uri field. These include logs from bootstrap actions of the cluster and the logs from running daemon processes on the various cluster nodes. The log-uri field can be set in the credentials.json file found in the installation directory of the elastic-mapreduce Ruby client. It can also be specified or overridden explicitly when spinning up the cluster using elastic-mapreduce by using the --log-uri flag. However, if this field is not set, those logs will not be available on S3.

If your workflow is configured to terminate if your job encounters an error, any logs on the cluster will be lost after the cluster termination. If your log-uri field is set, these logs will be available at the specified location on S3 even after the cluster has been terminated. They can be an essential aid in debugging the issues that caused the failure.

However, if you store logs on S3, remember to purge unwanted logs on a frequent basis to save yourself from unnecessary storage costs!

Spot Instances

Spot instances allows users to bid on unused Amazon capacity to get instances at cheaper rates compared to on-demand prices. Amazon’s online documentation describes them in more detail.

Depending on your use case, you might want instances in all three instance groups to be spot instances. In this case, your entire cluster could terminate at any stage during the workflow, resulting in a loss of intermediate data. If it’s “cheap” to repeat the calculation, this might not be a serious issue. An alternative is to persist intermediate data periodically to S3, as long as your jobs can start again from those snapshots.

Another option is to only include the nodes in the task instance group as spot nodes. If these spot nodes get taken out of the cluster because of unavailability or because the spot prices increased, the workflow will continue with the master and core nodes, but with no data loss. When spot nodes get added to the cluster again, MapReduce tasks can be delegated to them, speeding up the workflow.

Using the elastic-mapreduce Ruby client, spot instances can be ordered by using the --bid-price option along with a bid price. The following example shows a command to create a cluster with one master node, two core nodes and two spot nodes (in the task instance group) with a bid price of 10 cents:

elastic-mapreduce --create --alive --hive-interactive 
--name "Test Spot Instances" 
--instance-group master --instance-type m1.large 
--instance-count 1 --instance-group core 
--instance-type m1.small --instance-count 2 --instance-group task 
--instance-type m1.small --instance-count 2 --bid-price 0.10

If you are spinning up a similar cluster using the Java SDK, use the following InstanceGroupConfig variables for master, core, and task instance groups:

InstanceGroupConfig masterConfig = new InstanceGroupConfig()
.withInstanceCount(1)
.withInstanceRole("MASTER")
.withInstanceType("m1.large");
InstanceGroupConfig coreConfig = new InstanceGroupConfig()
.withInstanceCount(2)
.withInstanceRole("CORE")
.withInstanceType("m1.small");
InstanceGroupConfig taskConfig = new InstanceGroupConfig()
.withInstanceCount(2)
.withInstanceRole("TASK")
.withInstanceType("m1.small")
.withMarket("SPOT")
.withBidPrice("0.05");

Note

If a map or reduce task fails, Hadoop will have to start them from the beginning. If the same task fails four times (configurable by setting the MapReduce properties mapred.map.max.attempts for map tasks and mapred.reduce.max.attempts for reduce tasks), the entire job will fail. If you rely on too many spot instances, your job times may be unpredictable or fail entirely by TaskTrackers getting removed from the cluster.

Security Groups

The Hadoop JobTracker and NameNode User Interfaces are accessible on port 9100 and 9101 respectively in the EMRmaster node. You can use ssh tunneling or a dynamic SOCKS proxy to view them.

In order to be able to view these from a browser on your client machine (outside of the Amazon network), you need to modify the Elastic MapReduce master security group via your AWS Web Console. Add a new custom TCP rule to allow inbound connections from your client machine’s IP address on ports 9100 and 9101.

EMR Versus EC2 and Apache Hive

An elastic alternative to EMR is to bring up several Amazon EC2 nodes and install Hadoop and Hive on a custom Amazon Machine Image (AMI). This approach gives you more control over the version and configuration of Hive and Hadoop. For example, you can experiment with new releases of tools before they are made available through EMR.

The drawback of this approach is that customizations available through EMR may not be available in the Apache Hive release. As an example, the S3 filesystem is not fully supported on Apache Hive [see JIRA HIVE-2318]. There is also an optimization for reducing start-up time for Amazon S3 queries, which is only available in EMR Hive. This optimization is enabled by adding the following snippet in your hive-site.xml:

<property>
  <name>hive.optimize.s3.query</name>
  <value>true</value>
  <description> Improves Hive query performance for Amazon S3 queries
  by reducing their start up time </description>
</property>

Alternatively, you can run the following command on your Hive CLI:

set hive.optimize.s3.query=true;

Another example is a command that allows the user to recover partitions if they exist in the correct directory structure on HDFS or S3. This is convenient when an external process is populating the contents of the Hive table in appropriate partitions. In order to track these partitions in the metastore, one could run the following command, where emr_table is the name of the table:

ALTER TABLE emr_table RECOVER PARTITIONS;

Here is the statement that creates the table, for your reference:

CREATE EXTERNAL TABLE emr_table(id INT, value STRING)
PARTITIONED BY (dt STRING)
LOCATION 's3n://example.hive.oreilly.com/tables/emr_table';

Wrapping Up

Amazon EMR provides an elastic, scalable, easy-to-set-up way to bring up a cluster with Hadoop and Hive ready to run queries as soon as it boots. It works well with data stored on S3. While much of the configuration is done for you, it is flexible enough to allow users to have their own custom configurations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset