Let’s install Hadoop and Hive on our personal workstation. This is a convenient way to learn and experiment with Hadoop. Then we’ll discuss how to configure Hive for use on Hadoop clusters.
If you already use Amazon Web Services, the fastest path to setting up Hive for learning is to run a Hive-configured job flow on Amazon Elastic MapReduce (EMR). We discuss this option in Chapter 21.
If you have access to a Hadoop cluster with Hive already installed, we encourage you to skim the first part of this chapter and pick up again at What Is Inside Hive?.
There are several ways you can install Hadoop and Hive. An easy way to install a complete Hadoop system, including Hive, is to download a preconfigured virtual machine (VM) that runs in VMWare[11] or VirtualBox[12]. For VMWare, either VMWare Player for Windows and Linux (free) or VMWare Fusion for Mac OS X (inexpensive) can be used. VirtualBox is free for all these platforms, and also Solaris.
The virtual machines use Linux as the operating system, which is currently the only recommended operating system for running Hadoop in production.[13]
Using a virtual machine is currently the only way to run Hadoop on Windows systems, even when Cygwin or similar Unix-like software is installed.
Most of the preconfigured virtual machines (VMs) available are only designed for VMWare, but if you prefer VirtualBox you may find instructions on the Web that explain how to import a particular VM into VirtualBox.
You can download preconfigured virtual machines from one of the websites given in Table 2-1.[14] Follow the instructions on these web sites for loading the VM into VMWare.
Table 2-1. Preconfigured Hadoop virtual machines for VMWare
Provider | URL | Notes |
---|---|---|
https://ccp.cloudera.com/display/SUPPORT/Cloudera’s+Hadoop+Demo+VM | Uses Cloudera’s own distribution of Hadoop, CDH3 or CDH4. | |
http://www.mapr.com/doc/display/MapR/Quick+Start+-+Test+Drive+MapR+on+a+Virtual+Machine | MapR’s Hadoop distribution, which replaces HDFS with the MapR Filesystem (MapR-FS). | |
http://docs.hortonworks.com/HDP-1.0.4-PREVIEW-6/Using_HDP_Single_Box_VM/HDP_Single_Box_VM.htm | Based on the latest, stable Apache releases. | |
Think Big Analytics, Inc. | http://thinkbigacademy.s3-website-us-east-1.amazonaws.com/vm/README.html | Based on the latest, stable Apache releases. |
Next, go to What Is Inside Hive?.
While using a preconfigured virtual machine may be an easy way to run Hive, installing Hadoop and Hive yourself will give you valuable insights into how these tools work, especially if you are a developer.
The instructions that follow describe the minimum necessary Hadoop and Hive installation steps for your personal Linux or Mac OS X workstation. For production installations, consult the recommended installation procedures for your Hadoop distributor.
Hive requires Hadoop and Hadoop requires Java. Ensure your system has a recent v1.6.X or v1.7.X JVM (Java Virtual Machine). Although the JRE (Java Runtime Environment) is all you need to run Hive, you will need the full JDK (Java Development Kit) to build examples in this book that demonstrate how to extend Hive with Java code. However, if you are not a programmer, the companion source code distribution for this book (see the Preface) contains prebuilt examples.
After the installation is complete, you’ll need to ensure that
Java is in your path and the JAVA_HOME
environment variable is
set.
On Linux systems, the following instructions set up a
bash file in the
/etc/profile.d/ directory that defines
JAVA_HOME for all users. Changing environmental
settings in this folder requires root access and affects all users of
the system. (We’re using $
as the
bash shell prompt.) The Oracle JVM installer
typically installs the software in
/usr/java/jdk-1.6.X (for v1.6) and it creates
sym-links from /usr/java/default and
/usr/java/latest to the installation:
$
/usr/java/latest/bin/java -version java version"1.6.0_23"
Java(
TM)
SE Runtime Environment(
build 1.6.0_23-b05)
Java HotSpot(
TM)
64-Bit Server VM(
build 19.0-b09, mixed mode)
$
sudoecho
"export JAVA_HOME=/usr/java/latest"
> /etc/profile.d/java.sh$
sudoecho
"PATH=$PATH:$JAVA_HOME/bin"
>> /etc/profile.d/java.sh$
. /etc/profile$
echo
$JAVA_HOME
/usr/java/latest
If you’ve never used sudo
(“super user do
something”) before to run a command as a
“privileged” user, as in two of the commands, just type your normal
password when you’re asked for it. If you’re on a personal machine,
your user account probably has “sudo rights.” If not, ask your
administrator to run those commands.
However, if you don’t want to make permanent changes that affect all users of the system, an alternative is to put the definitions shown for PATH and JAVA_HOME in your $HOME/.bashrc file:
export
JAVA_HOME
=
/usr/java/latestexport
PATH
=
$PATH
:$JAVA_HOME
/bin
Mac OS X systems don’t have the /etc/profile.d directory and they are typically single-user systems, so it’s best to put the environment variable definitions in your $HOME/.bashrc. The Java paths are different, too, and they may be in one of several places.[15]
Here are a few examples. You’ll need to determine where Java is installed on your Mac and adjust the definitions accordingly. Here is a Java 1.6 example for Mac OS X:
$
export
JAVA_HOME
=
/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home$
export
PATH
=
$PATH
:$JAVA_HOME
/bin
Here is a Java 1.7 example for Mac OS X:
$
export
JAVA_HOME
=
/Library/Java/JavaVirtualMachines/1.7.0.jdk/Contents/Home$
export
PATH
=
$PATH
:$JAVA_HOME
/bin
OpenJDK 1.7
releases also
install under
/Library/Java/JavaVirtualMachines.
Hive runs on top of Hadoop. Hadoop is an active open source project with many releases and branches. Also, many commercial software companies are now producing their own distributions of Hadoop, sometimes with custom enhancements or replacements for some components. This situation promotes innovation, but also potential confusion and compatibility issues.
Keeping software up to date lets you exploit the latest performance enhancements and bug fixes. However, sometimes you introduce new bugs and compatibility issues. So, for this book, we’ll show you how to install the Apache Hadoop release v0.20.2. This edition is not the most recent stable release, but it has been the reliable gold standard for some time for performance and compatibility.
However, you should be able to choose a different version, distribution, or release without problems for learning and using Hive, such as the Apache Hadoop v0.20.205 or 1.0.X releases, Cloudera CDH3 or CDH4, MapR M3 or M5, and the forthcoming Hortonworks distribution. Note that the bundled Cloudera, MapR, and planned Hortonworks distributions all include a Hive release.
However, we don’t recommend installing the new, alpha-quality, “Next Generation” Hadoop v2.0 (also known as v0.23), at least for the purposes of this book. While this release will bring significant enhancements to the Hadoop ecosystem, it is too new for our purposes.
To install Hadoop on a Linux system, run the following commands.
Note that we wrapped the long line for the wget
command:
$
cd
~# or use another directory of your choice.
$
wgethttp://www.us.apache.org/dist/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz
$
tar -xzf hadoop-0.20.2.tar.gz$
sudoecho
"export HADOOP_HOME=$PWD/hadoop-0.20.2"
> /etc/profile.d/hadoop.sh$
sudoecho
"PATH=$PATH:$HADOOP_HOME/bin"
>> /etc/profile.d/hadoop.sh$
. /etc/profile
To install Hadoop on a Mac OS X system, run the following
commands. Note that we wrapped the long line for the curl
command:
$
cd
~# or use another directory of your choice.
$
curl -ohttp://www.us.apache.org/dist/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz
$
tar -xzf hadoop-0.20.2.tar.gz$
echo
"export HADOOP_HOME=$PWD/hadoop-0.20.2"
>>$HOME
/.bashrc$
echo
"PATH=$PATH:$HADOOP_HOME/bin"
>>$HOME
/.bashrc$
.$HOME
/.bashrc
In what follows, we will assume that you added $HADOOP_HOME/bin
to your path, as in the
previous commands. This will allow you to simply type the hadoop
command without the path
prefix.
Before we proceed, let’s clarify the different runtime modes for Hadoop. We mentioned above that the default mode is local mode, where filesystem references use the local filesystem. Also in local mode, when Hadoop jobs are executed (including most Hive queries), the Map and Reduce tasks are run as part of the same process.
Actual clusters are configured in distributed mode, where all filesystem references that aren’t full URIs default to the distributed filesystem (usually HDFS) and jobs are managed by the JobTracker service, with individual tasks executed in separate processes.
A dilemma for developers working on personal machines is the fact that local mode doesn’t closely resemble the behavior of a real cluster, which is important to remember when testing applications. To address this need, a single machine can be configured to run in pseudodistributed mode, where the behavior is identical to distributed mode, namely filesystem references default to the distributed filesystem and jobs are managed by the JobTracker service, but there is just a single machine. Hence, for example, HDFS file block replication is limited to one copy. In other words, the behavior is like a single-node “cluster.” We’ll discuss these configuration options in Configuring Your Hadoop Environment.
Because Hive uses Hadoop jobs for most of its work, its behavior reflects the Hadoop mode you’re using. However, even when running in distributed mode, Hive can decide on a per-query basis whether or not it can perform the query using just local mode, where it reads the data files and manages the MapReduce tasks itself, providing faster turnaround. Hence, the distinction between the different modes is more of an execution style for Hive than a deployment style, as it is for Hadoop.
For most of the book, it won’t matter which mode you’re using. We’ll assume you’re working on a personal machine in local mode and we’ll discuss the cases where the mode matters.
When working with small data sets, using local mode
execution will make Hive queries
much faster. Setting the property set
hive.exec.mode.local.auto=true;
will cause Hive to use this
mode more aggressively, even when you are running Hadoop in
distributed or pseudodistributed mode. To always use this setting, add
the command to your $HOME/.hiverc file (see The .hiverc File).
Assuming you’re using local mode, let’s look at the local
filesystem two different ways. The following output of the Linux
ls
command shows the typical contents
of the “root” directory of a Linux system:
$
ls /
bin cgroup etc lib lost+found mnt opt root selinux sys user var
boot dev home lib64 media null proc sbin srv tmp usr
Hadoop provides a dfs
tool that
offers basic filesystem functionality like ls
for the default
filesystem. Since we’re using local mode, the default filesystem is the
local filesystem:[16]
$
hadoop dfs -ls /
Found 26 items
drwxrwxrwx - root root 24576 2012-06-03 14:28 /tmp
drwxr-xr-x - root root 4096 2012-01-25 22:43 /opt
drwx------ - root root 16384 2010-12-30 14:56 /lost+found
drwxr-xr-x - root root 0 2012-05-11 16:44 /selinux
dr-xr-x--- - root root 4096 2012-05-23 22:32 /root
...
If instead you get an error message that hadoop
isn’t found, either invoke the command
with the full path (e.g., $HOME/hadoop-0.20.2/bin/hadoop
) or add the
bin
directory to your PATH
variable, as discussed in Installing Hadoop above.
If you find yourself using the hadoop dfs
command frequently, it’s convenient to define an alias for it
(e.g., alias hdfs="hadoop
dfs"
).
Hadoop offers a framework for MapReduce. The Hadoop distribution contains an implementation of the Word Count algorithm we discussed in Chapter 1. Let’s run it!
Start by creating an input directory (inside your current working directory) with files to be processed by Hadoop:
$
mkdir wc-in$
echo
"bla bla"
> wc-in/a.txt$
echo
"bla wa wa "
> wc-in/b.txt
Use the hadoop
command to
launch the Word Count application on the input directory we just
created. Note that it’s conventional to always specify
directories for input and output, not individual
files, since there will often be multiple input
and/or output files per directory, a consequence of the parallelism of
the system.
If you are running these commands on your local installation that
was configured to use local mode, the hadoop
command will launch the MapReduce
components in the same process. If you are running on a cluster or on a
single machine using pseudodistributed mode, the
hadoop
command will launch one or
more separate processes using the JobTracker
service (and the output below will be slightly different). Also, if you
are running with a different version of Hadoop, change the name of the
examples.jar as needed:
$
hadoop jar$HADOOP_HOME
/hadoop-0.20.2-examples.jar wordcount wc-in wc-out 12/06/03 15:40:26 INFO input.FileInputFormat: Total input paths to process : 2 ... 12/06/03 15:40:27 INFO mapred.JobClient: Running job: job_local_0001 12/06/03 15:40:30 INFO mapred.JobClient: map 100% reduce 0% 12/06/03 15:40:41 INFO mapred.JobClient: map 100% reduce 100% 12/06/03 15:40:41 INFO mapred.JobClient: Jobcomplete
: job_local_0001
The results of the Word count application can be viewed through local filesystem commands:
$
ls wc-out/* part-r-00000$
cat wc-out/* bla 3 wa 2
They can also be viewed by the equivalent dfs
command (again, because we assume you are
running in local mode):
$
hadoop dfs -cat wc-out/*
bla 3
wa 2
For very big files, if you want to view just the first
or last parts, there is no -more
,
-head
, nor -tail
subcommand. Instead, just pipe the
output of the -cat
command through
the shell’s more
, head
, or tail
. For example: hadoop dfs -cat wc-out/* | more
.
Now that we have installed and tested an installation of Hadoop, we can install Hive.
Installing Hive is similar to installing Hadoop. We will download and extract a tarball for Hive, which does not include an embedded version of Hadoop. A single Hive binary is designed to work with multiple versions of Hadoop. This means it’s often easier and less risky to upgrade to newer Hive releases than it is to upgrade to newer Hadoop releases.
Hive uses the environment variable HADOOP_HOME
to locate the Hadoop JARs and
configuration files. So, make sure you set that variable as discussed
above before proceeding. The following commands work for both Linux and
Mac OS X:
$
cd
~# or use another directory of your choice.
$
curl -o http://archive.apache.org/dist/hive/hive-0.9.0/hive-0.9.0-bin.tar.gz$
tar -xzf hive-0.9.0.tar.gz$
sudo mkdir -p /user/hive/warehouse$
sudo chmod a+rwx /user/hive/warehouse
As you can infer from these commands, we are using the latest stable release of Hive at the time of this writing, v0.9.0. However, most of the material in this book works with Hive v0.7.X and v0.8.X. We’ll call out the differences as we come to them.
You’ll want to add the hive
command to your path, like we did for the hadoop
command. We’ll follow the same
approach, by first defining a HIVE_HOME
variable, but unlike HADOOP_HOME
, this variable isn’t really
essential. We’ll assume it’s defined for some examples later in the
book.
For Linux, run these commands:
$
sudoecho
"export HIVE_HOME=$PWD/hive-0.9.0"
> /etc/profile.d/hive.sh$
sudoecho
"
PATH
=
$PATH
:$HIVE_HOME
/bin >> /etc/profile.d/hive.sh$
. /etc/profile
For Mac OS X, run these commands:
$
echo
"export HIVE_HOME=$PWD/hive-0.9.0"
>>$HOME
/.bashrc$
echo
"PATH=$PATH:$HIVE_HOME/bin"
>>$HOME
/.bashrc$
.$HOME
/.bashrc
The core of a Hive binary distribution contains three parts.
The main part is the Java code itself. Multiple JAR (Java archive) files
such as hive-exec*.jar
and hive-metastore*.jar
are found under the
$HIVE_HOME/lib directory. Each JAR file implements a
particular subset of Hive’s functionality, but the details don’t concern
us now.
The $HIVE_HOME/bin directory contains
executable scripts that launch various Hive services, including the
hive
command-line
interface (CLI). The CLI is the most popular way to use Hive.
We will use hive
(in lowercase, with a
fixed-width font) to refer to the CLI, except where noted. The CLI can be
used interactively to type in statements one at a time or it can be used
to run “scripts” of Hive statements, as we’ll see.
Hive also has other components. A Thrift service provides remote access from other processes. Access using JDBC and ODBC are provided, too. They are implemented on top of the Thrift service. We’ll describe these features in later chapters.
All Hive installations require a metastore service, which Hive uses to store table schemas and other metadata. It is typically implemented using tables in a relational database. By default, Hive uses a built-in Derby SQL server, which provides limited, single-process storage. For example, when using Derby, you can’t run two simultaneous instances of the Hive CLI. However, this is fine for learning Hive on a personal machine and some developer tasks. For clusters, MySQL or a similar relational database is required. We will discuss the details in Metastore Using JDBC.
Finally, a simple web interface, called Hive Web Interface (HWI), provides remote access to Hive.
The conf
directory
contains the files that configure Hive. Hive has a number of configuration
properties that we will discuss as needed. These properties control
features such as the metastore (where data is
stored), various optimizations, and “safety controls,” etc.
Let’s finally start the Hive command-line interface (CLI) and run a few commands! We’ll briefly comment on what’s happening, but save the details for discussion later.
In the following session, we’ll use the $HIVE_HOME/bin/hive
command, which is a bash
shell script, to start the CLI. Substitute
the directory where Hive is installed on your system whenever
$HIVE_HOME
is listed in the following script. Or, if
you added $HIVE_HOME/bin
to your
PATH
, you can just type hive
to run the command. We’ll make that
assumption for the rest of the book.
As before, $
is the bash
prompt. In the Hive CLI, the hive>
string is the hive
prompt, and the indented >
is the secondary
prompt. Here is a sample session, where we have added a blank line after
the output of each command, for clarity:
$
cd
$HIVE_HOME
$
bin/hive Hivehistory
file
=
/tmp/myname/hive_job_log_myname_201201271126_1992326118.txt hive> CREATE TABLE x(
a INT)
; OK Time taken: 3.543 seconds hive> SELECT * FROM x; OK Time taken: 0.231 seconds hive> SELECT * > FROM x; OK Time taken: 0.072 seconds hive> DROP TABLE x; OK Time taken: 0.834 seconds hive>exit
;$
The first line printed by the CLI is the local filesystem location
where the CLI writes log data about the commands and queries you execute.
If a command or query is successful,
the first line of output will be OK
,
followed by the output, and finished by the line showing the amount of
time taken to run the command or query.
Throughout the book, we will follow the SQL convention of showing
Hive keywords in uppercase (e.g., CREATE
, TABLE
, SELECT
and FROM
), even though case is ignored by Hive,
following SQL conventions.
Going forward, we’ll usually add the blank line after the
command output for all sessions. Also, when starting a session, we’ll
omit the line about the logfile. For individual commands and queries,
we’ll omit the OK
and Time taken:...
lines, too, except in special
cases, such as when we want to emphasize that a command or query was
successful, but it had no other output.
At the successive prompts, we create a simple table named x
with a single INT
(4-byte integer) column named a
, then query it twice, the second time showing
how queries and commands can spread across multiple lines. Finally, we
drop the table.
If you are running with the default Derby
database for the metastore, you’ll notice that your current working
directory now contains a new subdirectory called
metastore_db that was created by
Derby during the short hive
session you just executed. If you are
running one of the VMs, it’s possible it has configured different
behavior, as we’ll discuss later.
Creating a metastore_db subdirectory under whatever working directory you happen to be in is not convenient, as Derby “forgets” about previous metastores when you change to a new working directory! In the next section, we’ll see how to configure a permanent location for the metastore database, as well as make other changes.
Let’s dive a little deeper into the different Hadoop modes and discuss more configuration issues relevant to Hive.
You can skip this section if you’re using Hadoop on an existing cluster or you are using a virtual machine instance. If you are a developer or you installed Hadoop and Hive yourself, you’ll want to understand the rest of this section. However, we won’t provide a complete discussion. See Appendix A of Hadoop: The Definitive Guide by Tom White for the full details on configuring the different modes.
Recall that in local mode, all references to files go to your local filesystem, not the distributed filesystem. There are no services running. Instead, your jobs run all tasks in a single JVM instance.
Figure 2-1 illustrates a Hadoop job running in local mode.
If you plan to use the local mode regularly, it’s worth configuring a standard location for the Derby metastore_db, where Hive stores metadata about your tables, etc.
You can also configure a different directory for Hive to store table data, if you don’t want to use the default location, which is file:///user/hive/warehouse, for local mode, and hdfs://namenode_server/user/hive/warehouse for the other modes discussed next.
First, go to the $HIVE_HOME/conf directory. The curious may want to peek at the large hive-default.xml.template file, which shows the different configuration properties supported by Hive and their default values. Most of these properties you can safely ignore. Changes to your configuration are done by editing the hive-site.xml file. Create one if it doesn’t already exist.
Here is an example configuration file where we set several properties for local mode execution (Example 2-1).
Example 2-1. Local-mode hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>
hive.metastore.warehouse.dir</name>
<value>
/home/me/hive/warehouse</value>
<description>
Local or HDFS directory where Hive keeps table contents.</description>
</property>
<property>
<name>
hive.metastore.local</name>
<value>
true</value>
<description>
Use false if a production metastore server is used.</description>
</property>
<property>
<name>
javax.jdo.option.ConnectionURL</name>
<value>
jdbc:derby:;databaseName=/home/me/hive/metastore_db;create=true</value>
<description>
The JDBC connection URL.</description>
</property>
</configuration>
You can remove any of these <property>...</property>
tags you
don’t want to change.
As the <description>
tags
indicate, the hive.metastore.warehouse.dir
tells Hive where
in your local filesystem to keep the data contents for Hive’s tables.
(This value is appended to the value of
fs.default.name
defined in the Hadoop configuration
and defaults to file:///.) You can use any
directory path you want for the value. Note that this directory will not
be used to store the table metadata, which goes in the separate
metastore.
The hive.metastore.local
property defaults to true
, so we
don’t really need to show it in Example 2-1.
It’s there more for documentation purposes. This property controls
whether to connect to a remote metastore server or open a new metastore
server as part of the Hive Client JVM. This setting is almost always set
to true
and JDBC is used to
communicate directly to a relational database. When it is set to
false
, Hive will communicate through a metastore server,
which we’ll discuss in Metastore Methods.
The value for the javax.jdo.option.ConnectionURL
property makes
one small but convenient change to the default value for this property.
This property tells Hive how to connect to the
metastore server. By default, it uses the current
working directory for the databaseName
part of the value string. As
shown in Example 2-1, we use databaseName=/home/me/hive/metastore_db
as the
absolute path instead, which is the location where the
metastore_db directory will always be located. This
change eliminates the problem of Hive dropping the
metastore_db directory in the current working
directory every time we start a new Hive session. Now, we’ll always have
access to all our metadata, no matter what directory we are
working in.
In distributed mode, several services run in the cluster. The JobTracker manages jobs and the NameNode is the HDFS master. Worker nodes run individual job tasks, managed by a TaskTracker service on each node, and then hold blocks for files in the distributed filesystem, managed by DataNode services.
Figure 2-2 shows a typical distributed mode configuration for a Hadoop cluster.
We’re using the convention that *.domain.pvt
is our DNS naming convention for
the cluster’s private, internal network.
Pseudodistributed mode is nearly identical; it’s effectively a one-node cluster.
We’ll assume that your administrator has already configured Hadoop, including your distributed filesystem (e.g., HDFS, or see Appendix A of Hadoop: The Definitive Guide by Tom White). Here, we’ll focus on the unique configuration steps required by Hive.
One Hive property you might want to configure is the top-level
directory for table storage, which is specified by the property hive.metastore.warehouse.dir
, which we also
discussed in Local Mode Configuration.
The default value for this property is /user/hive/warehouse
in the Apache Hadoop and
MapR distributions, which will be interpreted as a distributed
filesystem path when Hadoop is configured for distributed or
pseudodistributed mode. For Amazon Elastic MapReduce (EMR), the default
value is /mnt/hive_0M_N/warehouse
when using Hive v0.M.N (e.g., /mnt/hive_08_1/warehouse
).
Specifying a different value here allows each user to define their own warehouse directory, so they don’t affect other system users. Hence, each user might use the following statement to define their own warehouse directory:
set
hive
.
metastore
.
warehouse
.
dir
=/
user
/
myname
/
hive
/
warehouse
;
It’s tedious to type this each time you start the Hive CLI or to remember to add it to every Hive script. Of course, it’s also easy to forget to define this property. Instead, it’s best to put commands like this in the $HOME/.hiverc file, which will be processed when Hive starts. See The .hiverc File for more details.
We’ll assume the value is /user/hive/warehouse
from here on.
Hive requires only one extra component that Hadoop does
not already have; the metastore component. The
metastore stores metadata such as table schema and partition information
that you specify when you run commands such as create table x...
, or alter table y...
, etc. Because multiple users
and systems are likely to need concurrent access to the metastore, the
default embedded database is not suitable for production.
If you are using a single node in pseudodistributed mode, you may not find it useful to set up a full relational database for the metastore. Rather, you may wish to continue using the default Derby store, but configure it to use a central location for its data, as described in Local Mode Configuration.
Any JDBC-compliant database can be used for the metastore. In practice, most installations of Hive use MySQL. We’ll discuss how to use MySQL. It is straightforward to adapt this information to other JDBC-compliant databases.
The information required for table schema, partition information, etc., is small, typically much smaller than the large quantity of data stored in Hive. As a result, you typically don’t need a powerful dedicated database server for the metastore. However because it represents a Single Point of Failure (SPOF), it is strongly recommended that you replicate and back up this database using the standard techniques you would normally use with other relational database instances. We won’t discuss those techniques here.
For our MySQL configuration, we need to know the host and
port the service is running on. We will assume db1.mydomain.pvt
and port 3306, which is the
standard MySQL port. Finally, we will assume that hive_db
is the name of our catalog. We define
these properties in Example 2-2.
Example 2-2. Metastore database configuration in hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>
javax.jdo.option.ConnectionURL</name>
<value>
jdbc:mysql://db1.mydomain.pvt/hive_db?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>
javax.jdo.option.ConnectionDriverName</name>
<value>
com.mysql.jdbc.Driver</value>
</property>
<property>
<name>
javax.jdo.option.ConnectionUserName</name>
<value>
database_user</value>
</property>
<property>
<name>
javax.jdo.option.ConnectionPassword</name>
<value>
database_pass</value>
</property>
</configuration>
You may have noticed the ConnectionURL
property starts with a prefix of
jdbc:mysql
. For Hive to be able to connect to MySQL,
we need to place the JDBC driver in our classpath. Download the MySQL
JDBC driver (Jconnector
) from http://www.mysql.com/downloads/connector/j/. The driver
can be placed in the Hive library path, $HIVE_
HOME/lib
. Some teams put all such support
libraries in their Hadoop lib
directory.
With the driver and the configuration settings in place, Hive will store its metastore information in MySQL.
The $HIVE_HOME/bin/hive
shell command, which we’ll simply refer to as hive
from now on, is the gateway to Hive
services, including the command-line interface or
CLI.
We’ll also assume that you have added $HIVE_HOME/bin
to your environment’s PATH
so you can type hive
at the shell prompt and your shell
environment (e.g., bash) will find the command.
If you run the following command, you’ll see a brief list of the
options for the hive
command. Here is
the output for Hive v0.8.X and v0.9.X:
$
bin/hive --help Usage ./hive <parameters> --service serviceName <service parameters> Service List: clihelp
hiveserver hwi jar lineage metastore rcfilecat Parameters parsed: --auxpath : Auxiliary jars --config : Hive configuration directory --service : Starts specific service/component. cli is default Parameters used: HADOOP_HOME : Hadoop install directory HIVE_OPT : Hive options Forhelp
on a particular service: ./hive --service serviceName --help Debughelp
: ./hive --debug --help
Note the Service List. There are
several services available, including the CLI that we will spend most of
our time using. You can invoke a service using the --service name
option, although there are
shorthand invocations for some of the services, as well. Table 2-2 describes the most useful
services.
Table 2-2. Hive services
Option | Name | Description |
---|---|---|
| Command-line interface | Used to define tables, run queries, etc. It is the default service if no other service is specified. See The Command-Line Interface. |
| Hive Server | A daemon that listens for Thrift connections from other processes. See Chapter 16 for more details. |
| Hive Web Interface | A simple web interface for running queries and other commands without logging into a cluster machine and using the CLI. |
| An extension of the | |
| Start an external Hive metastore service to support multiple clients (see also Metastore Using JDBC). | |
| A tool for printing the contents of an |
The --auxpath
option
lets you specify a colon-separated list of “auxiliary” Java archive
(JAR) files that contain custom extensions, etc., that you might
require.
The --config
directory
is mostly useful if you have to override the default configuration
properties in $HIVE_HOME/conf
in a
new directory.
The command-line interface or CLI is the most common way to interact with Hive. Using the CLI, you can create tables, inspect schema and query tables, etc.
The following command shows a brief list of the options for the CLI. Here we show the output for Hive v0.8.X and v0.9.X:
$
hive --help --service cli usage: hive -d,--define <key
=
value> Variable substitution to apply to hive commands. e.g. -dA
=
B or --defineA
=
B -e <quoted-query-string> SQL fromcommand
line -f <filename> SQL from files -H,--help Printhelp
information -h <hostname> connecting to Hive Server on remote host --hiveconf <property
=
value> Use valuefor
given property --hivevar <key
=
value> Variable substitution to apply to hive commands. e.g. --hivevarA
=
B -i <filename> Initialization SQL file -p <port> connecting to Hive Server on port number -S,--silent Silent mode in interactive shell -v,--verbose Verbose mode(
echo
executed SQL to the console)
A shorter version of this command is hive
-h
. However, that’s technically an unsupported option, but it
produces the help output with an additional line that complains about
Missing argument for option:
h
.
For Hive v0.7.X, the -d
,
--hivevar
, and -p
options are not supported.
Let’s explore these options in more detail.
The --define key=value
option is effectively equivalent to the --hivevar key=value
option. Both let you
define on the command line custom variables that you can reference in
Hive scripts to customize execution. This feature is only supported in
Hive v0.8.0 and later versions.
When you use this feature, Hive puts the key-value pair in the
hivevar
“namespace” to distinguish
these definitions from three other built-in namespaces, hiveconf
, system
, and env
.
The terms variable or property are used in different contexts, but they function the same way in most cases.
The namespace options are described in Table 2-3.
Hive’s variables are internally stored as Java Strings
. You can reference variables in
queries; Hive replaces the reference with the variable’s value before
sending the query to the query processor.
Inside the CLI, variables are displayed and changed using
the SET
command. For example, the
following session shows the value for one variable, in the env
namespace, and then all variable
definitions! Here is a Hive session where some output has been omitted
and we have added a blank line after the output of each command for
clarity:
$
hive hive>set
env:HOME; env:HOME=
/home/thisuser hive>set
; ... lots of output including these variables: hive.stats.retries.wait=
3000 env:TERM=
xterm system:user.timezone=
America/New_York ... hive>set
-v; ... even more output!...
Without the -v
flag, set
prints all the variables in the namespaces
hivevar
, hiveconf
, system
, and env
. With the -v
option, it also prints all the properties
defined by Hadoop, such as properties controlling HDFS and
MapReduce.
The set
command is also used to
set new values for variables. Let’s look specifically at the hivevar
namespace and a variable that is
defined for it on the command line:
$
hive --definefoo
=
bar hive>set
foo;foo
=
bar; hive>set
hivevar:foo; hivevar:foo=
bar; hive>set
hivevar:foo=
bar2; hive>set
foo;foo
=
bar2 hive>set
hivevar:foo; hivevar:foo=
bar2
As we can see, the hivevar:
prefix is optional. The --hivevar
flag is the same as the --define
flag.
Variable references in queries are replaced in the CLI before the
query is sent to the query processor. Consider the following hive
CLI session (v0.8.X only):
hive> create table toss1(
i int,${
hivevar
:
foo
}
string)
; hive> describe toss1; i int bar2 string hive> create table toss2(
i2 int,${
foo
}
string)
; hive> describe toss2; i2 int bar2 string hive> drop table toss1; hive> drop table toss2;
Let’s look at the --hiveconf
option, which is supported in Hive
v0.7.X. It is used for all properties that configure Hive behavior.
We’ll use it with a property hive.cli.print.current.db
that was added in
Hive v0.8.0. It turns on printing of the current working database name
in the CLI prompt. (See Databases in Hive for more on
Hive databases.) The default database is named default
. This property is false
by default:
$
hive --hiveconf hive.cli.print.current.db=
true
hive(
default)
>set
hive.cli.print.current.db; hive.cli.print.current.db=
true
hive(
default)
>set
hiveconf:hive.cli.print.current.db; hiveconf:hive.cli.print.current.db=
true
hive(
default)
>set
hiveconf:hive.cli.print.current.db=
false
; hive>set
hiveconf:hive.cli.print.current.db=
true
; hive(
default)
> ...
We can even add new hiveconf
entries, which is the only supported option for Hive versions earlier
than v0.8.0:
$
hive --hiveconfy
=
5 hive>set
y;y
=
5 hive> CREATE TABLE whatsit(
i int)
; hive> ... load data into whatsit ... hive> SELECT * FROM whatsit WHEREi
=
${
hiveconf
:
y
}
; ...
It’s also useful to know about the system
namespace, which provides read-write
access to Java system properties, and the env
namespace, which provides read-only access
to environment variables:
hive>set
system:user.name; system:user.name=
myusername hive>set
system:user.name=
yourusername; hive>set
system:user.name; system:user.name=
yourusername hive>set
env:HOME; env:HOME=
/home/yourusername hive>set
env:HOME; env:* variables can not be set.
Unlike hivevar
variables, you
have to use the system:
or env:
prefix with system properties and
environment variables.
The env
namespace is
useful as an alternative way to pass variable definitions to Hive,
especially for Hive v0.7.X. Consider the following example:
$ YEAR
=
2012 hive -e"SELECT * FROM mytable WHERE year = ${env:YEAR}"
;
The query processor will see the literal number 2012
in the WHERE
clause.
If you are using Hive v0.7.X, some of the examples in this book that use parameters and variables may not work as written. If so, replace the variable reference with the corresponding value.
The user may wish to run one or more queries (semicolon
separated) and then have the hive
CLI
exit immediately after completion. The CLI accepts a -e command
argument that enables this feature.
If mytable
has a string and integer
column, we might see the following output:
$
hive -e"SELECT * FROM mytable LIMIT 3"
; OK name1 10 name2 20 name3 30 Time taken: 4.955 seconds$
A quick and dirty technique is to use this feature to output the
query results to a file. Adding the -S
for silent mode removes the OK
and Time taken
...
lines, as well as other inessential output, as in this
example:
$
hive -S -e"select * FROM mytable LIMIT 3"
> /tmp/myquery$
cat /tmp/myquery name1 10 name2 20 name3 30
Note that hive
wrote the output
to the standard output and the shell command
redirected that output to the local filesystem, not to HDFS.
Finally, here is a useful trick for finding a property
name that you can’t quite remember, without having to scroll through the
list of the set
output. Suppose you
can’t remember the name of the property that specifies the “warehouse”
location for managed tables:
$
hive -S -e"set"
| grep warehouse hive.metastore.warehouse.dir=
/user/hive/warehouse hive.warehouse.subdir.inherit.perms=
false
Hive can execute one or more queries that were saved to a
file using the -f file
argument. By
convention, saved Hive query files use the .q or
.hql extension.
$
hive
-
f
/
path
/
to
/
file
/
withqueries
.
hql
If you are already inside the Hive shell you can use the SOURCE
command to execute a script file. Here
is an example:
$
cat /path/to/file/withqueries.hql SELECT x.* FROM src x;$
hive hive>source
/path/to/file/withqueries.hql; ...
By the way, we’ll occasionally use the name src
(“source”) for tables in queries when the
name of the table is irrelevant for the example. This convention is
taken from the unit tests in Hive’s source code;
first create a src
table before all
tests.
For example, when experimenting with a built-in function, it’s convenient to write a “query” that passes literal arguments to the function, as in the following example taken from later in the book, XPath-Related Functions:
hive
>
SELECT
xpath
(
'<a><b id="foo">b1</b><b id="bar">b2</b></a>'
,
'//@id'
)
>
FROM
src
LIMIT
1
;
[
foo
","
bar
]
The details for xpath
don’t
concern us here, but note that we pass string literals to the xpath
function and use FROM src LIMIT 1
to specify the required
FROM
clause and to limit the output.
Substitute src
with the name of a
table you have already created or create a dummy table named src
:
CREATE
TABLE
src
(
s
STRING
);
Also the source table must have at least one row of content in it:
$
echo
"one row"
> /tmp/myfile$
hive -e"
LOAD DATA LOCAL INPATH'/tmp/myfile'
INTO TABLE src;
The last CLI option we’ll discuss is the -i file
option, which lets you specify a file
of commands for the CLI to run as it starts, before showing you the
prompt. Hive automatically looks for a file named
.hiverc in your HOME
directory and runs the commands it
contains, if any.
These files are convenient for commands that you run frequently, such as setting system properties (see Variables and Properties) or adding Java archives (JAR files) of custom Hive extensions to Hadoop’s distributed cache (as discussed in Chapter 15).
The following shows an example of a typical $HOME/.hiverc file:
ADD JAR /path/to/custom_hive_extensions.jar;set
hive.cli.print.current.db=
true
;set
hive.exec.mode.local.auto=
true
;
The first line adds a JAR file to the Hadoop distributed cache. The second line modifies the CLI prompt to show the current working Hive database, as we described earlier in Variables and Properties. The last line “encourages” Hive to be more aggressive about using local-mode execution when possible, even when Hadoop is running in distributed or pseudo-distributed mode, which speeds up queries for small data sets.
The CLI supports a number of other useful features.
If you start typing and hit the Tab key, the CLI will
autocomplete possible keywords and function names. For example, if you
type SELE
and then the Tab key, the
CLI will complete the word SELECT
.
If you type the Tab key at the prompt, you’ll get this reply:
hive
>
Display
all
407
possibilities
?
(
y
or
n
)
If you enter y
, you’ll get a
long list of all the keywords and built-in functions.
A common source of error and confusion when pasting statements into the CLI occurs where some lines begin with a tab. You’ll get the prompt about displaying all possibilities, and subsequent characters in the stream will get misinterpreted as answers to the prompt, causing the command to fail.
You can use the up and down arrow keys to scroll through previous commands. Actually, each previous line of input is shown separately; the CLI does not combine multiline commands and queries into a single history entry. Hive saves the last 100,00 lines into a file $HOME/.hivehistory.
If you want to repeat a previous command, scroll to it and hit Enter. If you want to edit the line before entering it, use the left and right arrow keys to navigate to the point where changes are required and edit the line. You can hit Return to submit it without returning to the end of the line.
Most navigation keystrokes using the Control key work as they do for the bash shell (e.g., Control-A goes to the beginning of the line and Control-E goes to the end of the line). However, similar “meta,” Option, or Escape keys don’t work (e.g., Option-F to move forward a word at a time). Similarly, the Delete key will delete the character to the left of the cursor, but the Forward Delete key doesn’t delete the character under the cursor.
You don’t need to leave the hive
CLI to run simple bash shell commands.
Simply type !
followed by the command
and terminate the line with a semicolon (;
):
hive
>
!
/
bin
/
echo
"what up dog"
;
"what up dog"
hive
>
!
pwd
;
/
home
/
me
/
hiveplay
Don’t invoke interactive commands that require user input. Shell
“pipes” don’t work and neither do file “globs.” For example, ! ls *.hql;
will look for a file named
*.hql;, rather than all files that end with the
.hql extension.
You can run the hadoop dfs
...
commands from within the hive
CLI; just drop the hadoop
word from the command and add the
semicolon at the end:
hive
>
dfs
-
ls
/
;
Found
3
items
drwxr
-
xr
-
x
-
root
supergroup
0
2011
-
08
-
17
16
:
27
/
etl
drwxr
-
xr
-
x
-
edward
supergroup
0
2012
-
01
-
18
15
:
51
/
flag
drwxrwxr
-
x
-
hadoop
supergroup
0
2010
-
02
-
03
17
:
50
/
users
This method of accessing hadoop
commands is actually more efficient than using the hadoop dfs ...
equivalent at the bash shell,
because the latter starts up a new JVM instance each time, whereas Hive just runs
the same code in its current process.
You can see a full listing of help on the options supported by
dfs
using this command:
hive
>
dfs
-
help
;
See also http://hadoop.apache.org/common/docs/r0.20.205.0/file_system_shell.html or similar documentation for your Hadoop distribution.
As of Hive v0.8.0, you can embed lines of comments that
start with the string --
, for
example:
-- Copyright (c) 2012 Megacorp, LLC.
-- This is the best Hive script evar!!
SELECT
*
FROM
massive_table
;
...
As a final example that pulls together a few things we’ve
learned, let’s tell the CLI to print column headers, which is disabled
by default. We can enable this feature by setting the hiveconf
property hive.cli.print.header
to true
:
hive
>
set
hive
.
cli
.
.
header
=
true
;
hive
>
SELECT
*
FROM
system_logs
LIMIT
3
;
tstamp
severity
server
message
1335667117
.
337715
ERROR
server1
Hard
drive
hd1
is
90
%
full
!
1335667117
.
338012
WARN
server1
Slow
response
from
server2
.
1335667117
.
339234
WARN
server2
Uh
,
Dude
,
I
'
m
kinda
busy
right
now
...
If you always prefer seeing the headers, put the first line in your $HOME/.hiverc file.
[13] However, some vendors are starting to support Hadoop on other systems. Hadoop has been used in production on various Unix systems and it works fine on Mac OS X for development use.
[14] These are the current URLs at the time of this writing.
[15] At least that’s the current situation on Dean’s Mac. This discrepancy may actually reflect the fact that stewardship of the Mac OS X Java port is transitioning from Apple to Oracle as of Java 1.7.
[16] Unfortunately, the dfs
-ls
command only provides a “long listing” format. There
is no short format, like the default for the Linux ls
command.