—Mark Grover
One of the services that Amazon provides as a part of Amazon Web Services (AWS) is Elastic MapReduce (EMR). With EMR comes the ability to spin up a cluster of nodes on demand. These clusters come with Hadoop and Hive installed and configured. (You can also configure the clusters with Pig and other tools.) You can then run your Hive queries and terminate the cluster when you are done, only paying for the time you used the cluster. This section describes how to use Elastic MapReduce, some best practices, and wraps up with pros and cons of using EMR versus other options.
You may wish to refer to the online AWS documentation available at http://aws.amazon.com/elasticmapreduce/ while reading this chapter. This chapter won’t cover all the details of using Amazon EMR with Hive. It is designed to provide an overview and discuss some practical details.
Small teams and start-ups often don’t have the resources to set up their own cluster. An in-house cluster is a fixed cost of initial investment. It requires effort to set up and servers and switches as well as maintaining a Hadoop and Hive installation.
On the other hand, Elastic MapReduce comes with a variable cost, plus the installation and maintenance is Amazon’s responsibility. This is a huge benefit for teams that can’t or don’t want to invest in their own clusters, and even for larger teams that need a test bed to try out new tools and ideas without affecting their production clusters.
An Amazon cluster is comprised of one or more instances. Instances come in various sizes, with different RAM, compute power, disk drive, platform, and I/O performance. It can be hard to determine what size would work the best for your use case. With EMR, it’s easy to start with small instance sizes, monitor performance with tools like Ganglia, and then experiment with different instance sizes to find the best balance of cost versus performance.
Before using Amazon EMR, you need to set up an Amazon Web Services (AWS) account. The Amazon EMR Getting Started Guide provides instructions on how to sign up for an AWS account.
You will also need to create an Amazon S3 bucket for storing your input data and retrieving the output results of your Hive processing.
When you set up your AWS account, make sure that all your Amazon EC2 instances, key pairs, security groups, and EMR jobflows are located in the same region to avoid cross-region transfer costs. Try to locate your Amazon S3 buckets and EMR jobflows in the same availability zone for better performance.
Although Amazon EMR supports several versions of Hadoop and Hive, only some combinations of versions of Hadoop and Hive are supported. See the Amazon EMR documentation to find out the supported version combinations of Hadoop and Hive.
Amazon provides multiple ways to bring up, terminate, and modify a Hive cluster. Currently, there are three ways you can manage your EMR Hive cluster:
This is the easiest way to bring up a cluster and requires no setup. However, as you start to scale, it is best to move to one of the other methods.
This allows users to manage a cluster using a simple
Ruby-based CLI, named elastic-mapreduce
. The Amazon EMR online
documentation describes how to install and use this CLI.
This allows users to manage an EMR cluster by using a language-specific SDK to call EMR APIs. Details on downloading and using the SDK are available in the Amazon EMR documentation. SDKs are available for Android, iOS, Java, PHP, Python, Ruby, Windows, and .NET. A drawback of an SDK is that sometimes particular SDK wrapper implementations lag behind the latest version of the AWS API.
It is common to use more than one way to manage Hive clusters.
Here is an example that uses the Ruby elastic-mapreduce
CLI to start up a single-node
Amazon EMR cluster with Hive configured. It also sets up the cluster for
interactive use, rather than for running a job and exiting. This cluster
would be ideal for learning Hive:
elastic-mapreduce --create --alive --name "Test Hive"
--hive-interactive
If you also want Pig available, add the
--pig-interface
option.
Next you would log in to this cluster as described in the Amazon EMR documentation.
Typically, the Hive Thrift server (see Chapter 16) listens for connections on port 10000. However, in the Amazon Hive installation, this port number depends on the version of Hive being used. This change was implemented in order to allow users to install and support concurrent versions of Hive. Consequently, Hive v0.5.X operates on port 10000, Hive v0.7.X on 10001, and Hive v0.7.1 on 10002. These port numbers are expected to change as newer versions of Hive get ported to Amazon EMR.
Each Amazon cluster has one or more nodes. Each of these nodes can fit into one of the following three instance groups:
This instance group contains exactly one node, which is called
the master node. The master node performs the same duties as the
conventional Hadoop master node. It runs the namenode
and jobtracker
daemons, but it also has Hive
installed on it. In addition, it has a MySQL server installed, which
is configured to serve as the metastore for the
EMR Hive installation. (The embedded Derby
metastore that is used as the default metastore in Apache Hive
installations is not used.) There is also an instance
controller that runs on the master node. It is
responsible for launching and managing other instances from the
other two instance groups. Note that this instance controller also
uses the MySQL server on the master node. If the MySQL server
becomes unavailable, the instance controller will be unable to
launch and manage instances.
The nodes in the core instance group have the same function as Hadoop
slave nodes that run both the datanode
and tasktracker
daemons. These nodes are used
for MapReduce jobs and for the ephemeral
storage on these nodes that is used for HDFS. Once a
cluster has been started, the number of nodes in this instance group
can only be increased but not decreased. It is important to note
that ephemeral storage will be lost if the
cluster is terminated.
This is an optional instance group. The nodes in this group also function
as Hadoop slave nodes. However, they only run the tasktracker
processes. Hence, these nodes
are used for MapReduce tasks, but not for storing HDFS
blocks. Once the cluster has been started, the
number of nodes in the task instance group can be increased or
decreased.
The task instance group is convenient when you want to increase cluster capacity during hours of peak demand and bring it back to normal afterwards. It is also useful when using spot instances (discussed below) for lower costs without risking the loss of data when a node gets removed from the cluster.
If you are running a cluster with just a single node, the node would be a master node and a core node at the same time.
You will often want to deploy your own configuration files when launching an EMR cluster. The most common files to customize are hive-site.xml, .hiverc, hadoop-env.sh. Amazon provides a way to override these configuration files.
For overriding hive-site.xml, upload your custom hive-site.xml to S3. Let’s assume it has been uploaded to s3n://example.hive.oreilly.com/tables/hive_site.xml.
It is recommended to use the newer s3n
“scheme” for accessing S3, which has
better performance than the original s3
scheme.
If you are starting you cluster via the elastic-mapreduce
Ruby client, use a command
like the following to spin up your cluster with your custom
hive-site.xml:
elastic-mapreduce --create --alive --name"Test Hive"
--hive-interactive--hive-site
=
s3n://example.hive.oreilly.com/conf/hive_site.xml
If you are using the SDK to spin up a cluster, use the appropriate
method to override the hive-site.xml file. After
the bootstrap actions, you would need two config steps, one for
installing Hive and another for deploying
hive-site.xml. The first step of installing Hive is
to call --install-hive
along with
--hive-versions
flag followed by a
comma-separated list of Hive versions you would like to install on your
EMR cluster.
The second step of installing Hive site configuration calls
--install-hive-site
with an additional parameter like --hive-site=s3n://example.hive.oreilly.com/tables/hive_site.xml
pointing to the location of the hive-site.xml file
to use.
For .hiverc, you must first upload to S3 the file you want to install. Then you can either use a config step or a bootstrap action to deploy the file to your cluster. Note that .hiverc can be placed in the user’s home directory or in the bin directory of the Hive installation.
At the time of this writing, the functionality to override the
.hiverc file is not available in the
Amazon-provided Ruby script, named hive-script
, which is available at
s3n://us-east-1.elasticmapreduce/libs/hive/hive-script.
Consequently, .hiverc cannot be installed
as easily as hive-site.xml. However, it is fairly
straightforward to extend the Amazon-provided hive-script
to enable installation of
.hiverc, if you are comfortable modifying Ruby
code. After implementing this change to hive-script
, upload it to S3 and use that
version instead of the original Amazon version. Have your modified
script install .hiverc to the user’s home
directory or to the bin directory of the Hive
installation.
Alternatively, you can create a custom bootstrap script that
transfers .hiverc from S3 to the user’s home
directory or Hive’s bin directory of the master
node. In this script, you should first configure s3cmd
on the cluster with your S3 access key
so you can use it to download the .hiverc file
from S3. Then, simply use a command such as the following to download
the file from S3 and deploy it in the home directory:
s3cmd get s3n://example.hive.oreilly.com/conf/.hiverc ~/.hiverc
Then use a bootstrap action to call this script during the cluster creation process, just like you would any other bootstrap action.
If you are running a memory-intensive job, Amazon provides some
predefined bootstrap actions that can be used to fine tune the Hadoop
configuration parameters. For example, to use the memory-intensive
bootstrap action when spinning up your cluster, use the following flag
in your elastic-mapreduce --create
command (wrapped for space):
--bootstrap-action s3n://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
An EMR cluster comes with a MySQL server installed on the master node of the cluster. By default, EMR Hive uses this MySQL server as its metastore. However, all data stored on the cluster nodes are deleted once you terminate your cluster. This includes the data stored on the master node metastore, as well! This is usually unacceptable because you would like to retain your table schemas, etc., in a persistent metastore.
You can use one of the following methods to work around this limitation:
The details on how to configure your Hive installation to use an external metastore are in Metastore Using JDBC. You can use the Amazon RDS (Relational Data Service), which is based on MySQL, or another, in-house database server as a metastore. This is the best choice if you want to use the same metastore for multiple EMR clusters or the same EMR cluster running more than one version of Hive.
If you don’t intend to use an external database server for
your metastore, you can still use the master node metastore in
conjunction with your start-up script. You can place your create
table statements in startup.q
, as
follows:
CREATE
EXTERNAL
TABLE
IF
NOT
EXISTS
emr_table
(
id
INT
,
value
STRING
)
PARTITIONED
BY
(
dt
STRING
)
LOCATION
's3n://example.hive.oreilly.com/tables/emr_table'
;
It is important to include the IF NOT
EXISTS
clause in your create statement to ensure that the
script doesn’t try to re-create the table on the master node
metastore if it was previously created by a prior invocation of
startup.q.
At this point, we have our table definitions in the master node metastore but we haven’t yet imported the partitioning metadata. To do so, include a line like the following in your startup.q file after the create table statement:
ALTER
TABLE
emr_table
RECOVER
PARTITIONS
;
This will populate all the partitioning related metadata in the metastore. Instead of your custom start-up script, you could use .hiverc, which will be sourced automatically when Hive CLI starts up. (We’ll discuss this feature again in EMR Versus EC2 and Apache Hive).
The benefit of using .hiverc is that it provides automatic invocation. The disadvantage is that it gets executed on every invocation of the Hive CLI, which leads to unnecessary overhead on subsequent invocations.
The advantage of using your custom start-up script is that you can more strictly control when it gets executed in the lifecycle of your workflow. However, you will have to manage this invocation yourself. In any case, a side benefit of using a file to store Hive queries for initialization is that you can track the changes to your DDL via version control.
As your meta information gets larger with more tables and more partitions, the start-up time using this system will take longer and longer. This solution is not suggested if you have more than a few tables or partitions.
Another, albeit cumbersome, alternative is to back up your metastore before you terminate the cluster and restore it at the beginning of the next workflow. S3 is a good place to persist the backup while the cluster is not in use.
Note that this metastore is not shared amongst different versions of Hive running on your EMR cluster. Suppose you spin up a cluster with both Hive v0.5 and v0.7.1 installed. When you create a table using Hive v0.5, you won’t be able to access this table using Hive v0.7.1. If you would like to share the metadata between different Hive versions, you will have to use an external persistent metastore.
HDFS and S3 have their own distinct roles in an EMR cluster. All the data stored on the cluster nodes is deleted once the cluster is terminated. Since HDFS is formed by ephemeral storage of the nodes in the core instance group, the data stored on HDFS is lost after cluster termination.
S3, on the other hand, provides a persistent storage for data associated with the EMR cluster. Therefore, the input data to the cluster should be stored on S3 and the final results obtained from Hive processing should be persisted to S3, as well.
However, S3 is an expensive storage alternative to HDFS. Therefore, intermediate results of processing should be stored in HDFS, with only the final results saved to S3 that need to persist.
Please note that as a side effect of using S3 as a source for input data, you lose the Hadoop data locality optimization, which may be significant. If this optimization is crucial for your analysis, you should consider importing “hot” data from S3 onto HDFS before processing it. This initial overhead will allow you to make use of Hadoop’s data locality optimization in your subsequent processing.
You should upload all your bootstrap scripts, configuration scripts (e.g., hive-site.xml and .hiverc), resources (e.g., files that need to go in the distributed cache, UDF or streaming JARs), etc., onto S3. Since EMR Hive and Hadoop installations natively understand S3 paths, it is straightforward to work with these files in subsequent Hadoop jobs.
For example, you can add the following lines in .hiverc without any errors:
ADD
FILE
s3n
:
//
example
.
hive
.
oreilly
.
com
/
files
/
my_file
.
txt
;
ADD
JAR
s3n
:
//
example
.
hive
.
oreilly
.
com
/
jars
/
udfs
.
jar
;
CREATE
TEMPORARY
FUNCTION
my_count
AS
'com.oreilly.hive.example.MyCount'
;
Amazon EMR saves the log files to the S3 location pointed to by the log-uri
field. These include logs from bootstrap
actions of the cluster and the logs from running daemon processes on the
various cluster nodes. The log-uri
field can be set in the credentials.json file found
in the installation directory of the elastic-mapreduce
Ruby client. It can also be
specified or overridden explicitly when spinning up the cluster using
elastic-mapreduce
by using the --log-uri
flag. However, if this field is not
set, those logs will not be available on S3.
If your workflow is configured to terminate if your job encounters
an error, any logs on the cluster will be lost after the cluster
termination. If your log-uri
field is
set, these logs will be available at the specified location on S3 even
after the cluster has been terminated. They can be an essential aid in
debugging the issues that caused the failure.
However, if you store logs on S3, remember to purge unwanted logs on a frequent basis to save yourself from unnecessary storage costs!
Spot instances allows users to bid on unused Amazon capacity to get instances at cheaper rates compared to on-demand prices. Amazon’s online documentation describes them in more detail.
Depending on your use case, you might want instances in all three instance groups to be spot instances. In this case, your entire cluster could terminate at any stage during the workflow, resulting in a loss of intermediate data. If it’s “cheap” to repeat the calculation, this might not be a serious issue. An alternative is to persist intermediate data periodically to S3, as long as your jobs can start again from those snapshots.
Another option is to only include the nodes in the task instance group as spot nodes. If these spot nodes get taken out of the cluster because of unavailability or because the spot prices increased, the workflow will continue with the master and core nodes, but with no data loss. When spot nodes get added to the cluster again, MapReduce tasks can be delegated to them, speeding up the workflow.
Using the elastic-mapreduce
Ruby
client, spot instances can be ordered by using the --bid-price
option along with a bid price. The
following example shows a command to create a cluster with one master
node, two core nodes and two spot nodes (in the task instance group) with
a bid price of 10 cents:
elastic-mapreduce --create --alive --hive-interactive--name
"Test Spot Instances"
--instance-group master --instance-type m1.large
--instance-count 1 --instance-group core
--instance-type m1.small --instance-count 2 --instance-group task
--instance-type m1.small --instance-count 2 --bid-price 0.10
If you are spinning up a similar cluster using the Java SDK, use the following
InstanceGroupConfig
variables for
master, core, and task instance groups:
InstanceGroupConfig
masterConfig
=
new
InstanceGroupConfig
()
.
withInstanceCount
(
1
)
.
withInstanceRole
(
"MASTER"
)
.
withInstanceType
(
"m1.large"
);
InstanceGroupConfig
coreConfig
=
new
InstanceGroupConfig
()
.
withInstanceCount
(
2
)
.
withInstanceRole
(
"CORE"
)
.
withInstanceType
(
"m1.small"
);
InstanceGroupConfig
taskConfig
=
new
InstanceGroupConfig
()
.
withInstanceCount
(
2
)
.
withInstanceRole
(
"TASK"
)
.
withInstanceType
(
"m1.small"
)
.
withMarket
(
"SPOT"
)
.
withBidPrice
(
"0.05"
);
If a map or reduce task fails, Hadoop will have to start them from the
beginning. If the same task fails four times (configurable by setting
the MapReduce properties mapred.map.max.attempts
for map tasks and
mapred.reduce.max.attempts
for reduce
tasks), the entire job will fail. If you rely on too many spot
instances, your job times may be unpredictable or fail entirely by
TaskTrackers getting removed from the
cluster.
The Hadoop JobTracker and NameNode User Interfaces are accessible on port 9100 and 9101 respectively in the EMRmaster node. You can use ssh tunneling or a dynamic SOCKS proxy to view them.
In order to be able to view these from a browser on your client machine (outside of the Amazon network), you need to modify the Elastic MapReduce master security group via your AWS Web Console. Add a new custom TCP rule to allow inbound connections from your client machine’s IP address on ports 9100 and 9101.
An elastic alternative to EMR is to bring up several Amazon EC2 nodes and install Hadoop and Hive on a custom Amazon Machine Image (AMI). This approach gives you more control over the version and configuration of Hive and Hadoop. For example, you can experiment with new releases of tools before they are made available through EMR.
The drawback of this approach is that customizations available through EMR may not be available in the Apache Hive release. As an example, the S3 filesystem is not fully supported on Apache Hive [see JIRA HIVE-2318]. There is also an optimization for reducing start-up time for Amazon S3 queries, which is only available in EMR Hive. This optimization is enabled by adding the following snippet in your hive-site.xml:
<property>
<name>
hive.optimize.s3.query</name>
<value>
true</value>
<description>
Improves Hive query performance for Amazon S3 queries by reducing their start up time</description>
</property>
Alternatively, you can run the following command on your Hive CLI:
set
hive
.
optimize
.
s3
.
query
=
true
;
Another example is a command that allows the user to recover
partitions if they exist in the correct directory structure on HDFS or S3.
This is convenient when an external process is populating the contents of
the Hive table in appropriate partitions. In order to track these
partitions in the metastore, one could run the following command, where
emr_table
is the name of the
table:
ALTER
TABLE
emr_table
RECOVER
PARTITIONS
;
Here is the statement that creates the table, for your reference:
CREATE
EXTERNAL
TABLE
emr_table
(
id
INT
,
value
STRING
)
PARTITIONED
BY
(
dt
STRING
)
LOCATION
's3n://example.hive.oreilly.com/tables/emr_table'
;
Amazon EMR provides an elastic, scalable, easy-to-set-up way to bring up a cluster with Hadoop and Hive ready to run queries as soon as it boots. It works well with data stored on S3. While much of the configuration is done for you, it is flexible enough to allow users to have their own custom configurations.