Chapter 2. Getting Started

Let’s install Hadoop and Hive on our personal workstation. This is a convenient way to learn and experiment with Hadoop. Then we’ll discuss how to configure Hive for use on Hadoop clusters.

If you already use Amazon Web Services, the fastest path to setting up Hive for learning is to run a Hive-configured job flow on Amazon Elastic MapReduce (EMR). We discuss this option in Chapter 21.

If you have access to a Hadoop cluster with Hive already installed, we encourage you to skim the first part of this chapter and pick up again at What Is Inside Hive?.

Installing a Preconfigured Virtual Machine

There are several ways you can install Hadoop and Hive. An easy way to install a complete Hadoop system, including Hive, is to download a preconfigured virtual machine (VM) that runs in VMWare[11] or VirtualBox[12]. For VMWare, either VMWare Player for Windows and Linux (free) or VMWare Fusion for Mac OS X (inexpensive) can be used. VirtualBox is free for all these platforms, and also Solaris.

The virtual machines use Linux as the operating system, which is currently the only recommended operating system for running Hadoop in production.[13]

Note

Using a virtual machine is currently the only way to run Hadoop on Windows systems, even when Cygwin or similar Unix-like software is installed.

Most of the preconfigured virtual machines (VMs) available are only designed for VMWare, but if you prefer VirtualBox you may find instructions on the Web that explain how to import a particular VM into VirtualBox.

You can download preconfigured virtual machines from one of the websites given in Table 2-1.[14] Follow the instructions on these web sites for loading the VM into VMWare.

Table 2-1. Preconfigured Hadoop virtual machines for VMWare

ProviderURLNotes

Cloudera, Inc.

https://ccp.cloudera.com/display/SUPPORT/Cloudera’s+Hadoop+Demo+VM

Uses Cloudera’s own distribution of Hadoop, CDH3 or CDH4.

MapR, Inc.

http://www.mapr.com/doc/display/MapR/Quick+Start+-+Test+Drive+MapR+on+a+Virtual+Machine

MapR’s Hadoop distribution, which replaces HDFS with the MapR Filesystem (MapR-FS).

Hortonworks, Inc.

http://docs.hortonworks.com/HDP-1.0.4-PREVIEW-6/Using_HDP_Single_Box_VM/HDP_Single_Box_VM.htm

Based on the latest, stable Apache releases.

Think Big Analytics, Inc.

http://thinkbigacademy.s3-website-us-east-1.amazonaws.com/vm/README.html

Based on the latest, stable Apache releases.

Next, go to What Is Inside Hive?.

Detailed Installation

While using a preconfigured virtual machine may be an easy way to run Hive, installing Hadoop and Hive yourself will give you valuable insights into how these tools work, especially if you are a developer.

The instructions that follow describe the minimum necessary Hadoop and Hive installation steps for your personal Linux or Mac OS X workstation. For production installations, consult the recommended installation procedures for your Hadoop distributor.

Installing Java

Hive requires Hadoop and Hadoop requires Java. Ensure your system has a recent v1.6.X or v1.7.X JVM (Java Virtual Machine). Although the JRE (Java Runtime Environment) is all you need to run Hive, you will need the full JDK (Java Development Kit) to build examples in this book that demonstrate how to extend Hive with Java code. However, if you are not a programmer, the companion source code distribution for this book (see the Preface) contains prebuilt examples.

After the installation is complete, you’ll need to ensure that Java is in your path and the JAVA_HOME environment variable is set.

Linux-specific Java steps

On Linux systems, the following instructions set up a bash file in the /etc/profile.d/ directory that defines JAVA_HOME for all users. Changing environmental settings in this folder requires root access and affects all users of the system. (We’re using $ as the bash shell prompt.) The Oracle JVM installer typically installs the software in /usr/java/jdk-1.6.X (for v1.6) and it creates sym-links from /usr/java/default and /usr/java/latest to the installation:

$ /usr/java/latest/bin/java -version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)
$ sudo echo "export JAVA_HOME=/usr/java/latest" > /etc/profile.d/java.sh
$ sudo echo "PATH=$PATH:$JAVA_HOME/bin" >> /etc/profile.d/java.sh
$ . /etc/profile
$ echo $JAVA_HOME
/usr/java/latest

Note

If you’ve never used sudo (“super user do something”) before to run a command as a “privileged” user, as in two of the commands, just type your normal password when you’re asked for it. If you’re on a personal machine, your user account probably has “sudo rights.” If not, ask your administrator to run those commands.

However, if you don’t want to make permanent changes that affect all users of the system, an alternative is to put the definitions shown for PATH and JAVA_HOME in your $HOME/.bashrc file:

export JAVA_HOME=/usr/java/latest
export PATH=$PATH:$JAVA_HOME/bin

Mac OS X−specific Java steps

Mac OS X systems don’t have the /etc/profile.d directory and they are typically single-user systems, so it’s best to put the environment variable definitions in your $HOME/.bashrc. The Java paths are different, too, and they may be in one of several places.[15]

Here are a few examples. You’ll need to determine where Java is installed on your Mac and adjust the definitions accordingly. Here is a Java 1.6 example for Mac OS X:

$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
$ export PATH=$PATH:$JAVA_HOME/bin

Here is a Java 1.7 example for Mac OS X:

$ export JAVA_HOME=/Library/Java/JavaVirtualMachines/1.7.0.jdk/Contents/Home
$ export PATH=$PATH:$JAVA_HOME/bin

OpenJDK 1.7 releases also install under /Library/Java/JavaVirtualMachines.

Installing Hadoop

Hive runs on top of Hadoop. Hadoop is an active open source project with many releases and branches. Also, many commercial software companies are now producing their own distributions of Hadoop, sometimes with custom enhancements or replacements for some components. This situation promotes innovation, but also potential confusion and compatibility issues.

Keeping software up to date lets you exploit the latest performance enhancements and bug fixes. However, sometimes you introduce new bugs and compatibility issues. So, for this book, we’ll show you how to install the Apache Hadoop release v0.20.2. This edition is not the most recent stable release, but it has been the reliable gold standard for some time for performance and compatibility.

However, you should be able to choose a different version, distribution, or release without problems for learning and using Hive, such as the Apache Hadoop v0.20.205 or 1.0.X releases, Cloudera CDH3 or CDH4, MapR M3 or M5, and the forthcoming Hortonworks distribution. Note that the bundled Cloudera, MapR, and planned Hortonworks distributions all include a Hive release.

However, we don’t recommend installing the new, alpha-quality, “Next Generation” Hadoop v2.0 (also known as v0.23), at least for the purposes of this book. While this release will bring significant enhancements to the Hadoop ecosystem, it is too new for our purposes.

To install Hadoop on a Linux system, run the following commands. Note that we wrapped the long line for the wget command:

$ cd ~                 # or use another directory of your choice.
$ wget 
 http://www.us.apache.org/dist/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz
$ tar -xzf hadoop-0.20.2.tar.gz
$ sudo echo "export HADOOP_HOME=$PWD/hadoop-0.20.2" > /etc/profile.d/hadoop.sh
$ sudo echo "PATH=$PATH:$HADOOP_HOME/bin" >> /etc/profile.d/hadoop.sh
$ . /etc/profile

To install Hadoop on a Mac OS X system, run the following commands. Note that we wrapped the long line for the curl command:

$ cd ~                 # or use another directory of your choice.
$ curl -o 
 http://www.us.apache.org/dist/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz
$ tar -xzf hadoop-0.20.2.tar.gz
$ echo "export HADOOP_HOME=$PWD/hadoop-0.20.2" >> $HOME/.bashrc
$ echo "PATH=$PATH:$HADOOP_HOME/bin" >> $HOME/.bashrc
$ . $HOME/.bashrc

In what follows, we will assume that you added $HADOOP_HOME/bin to your path, as in the previous commands. This will allow you to simply type the hadoop command without the path prefix.

Local Mode, Pseudodistributed Mode, and Distributed Mode

Before we proceed, let’s clarify the different runtime modes for Hadoop. We mentioned above that the default mode is local mode, where filesystem references use the local filesystem. Also in local mode, when Hadoop jobs are executed (including most Hive queries), the Map and Reduce tasks are run as part of the same process.

Actual clusters are configured in distributed mode, where all filesystem references that aren’t full URIs default to the distributed filesystem (usually HDFS) and jobs are managed by the JobTracker service, with individual tasks executed in separate processes.

A dilemma for developers working on personal machines is the fact that local mode doesn’t closely resemble the behavior of a real cluster, which is important to remember when testing applications. To address this need, a single machine can be configured to run in pseudodistributed mode, where the behavior is identical to distributed mode, namely filesystem references default to the distributed filesystem and jobs are managed by the JobTracker service, but there is just a single machine. Hence, for example, HDFS file block replication is limited to one copy. In other words, the behavior is like a single-node “cluster.” We’ll discuss these configuration options in Configuring Your Hadoop Environment.

Because Hive uses Hadoop jobs for most of its work, its behavior reflects the Hadoop mode you’re using. However, even when running in distributed mode, Hive can decide on a per-query basis whether or not it can perform the query using just local mode, where it reads the data files and manages the MapReduce tasks itself, providing faster turnaround. Hence, the distinction between the different modes is more of an execution style for Hive than a deployment style, as it is for Hadoop.

For most of the book, it won’t matter which mode you’re using. We’ll assume you’re working on a personal machine in local mode and we’ll discuss the cases where the mode matters.

Tip

When working with small data sets, using local mode execution will make Hive queries much faster. Setting the property set hive.exec.mode.local.auto=true; will cause Hive to use this mode more aggressively, even when you are running Hadoop in distributed or pseudodistributed mode. To always use this setting, add the command to your $HOME/.hiverc file (see The .hiverc File).

Testing Hadoop

Assuming you’re using local mode, let’s look at the local filesystem two different ways. The following output of the Linux ls command shows the typical contents of the “root” directory of a Linux system:

$ ls /
bin   cgroup  etc   lib    lost+found  mnt   opt   root  selinux  sys  user  var
boot  dev     home  lib64  media       null  proc  sbin  srv      tmp  usr

Hadoop provides a dfs tool that offers basic filesystem functionality like ls for the default filesystem. Since we’re using local mode, the default filesystem is the local filesystem:[16]

$ hadoop dfs -ls /
Found 26 items
drwxrwxrwx   - root root      24576 2012-06-03 14:28 /tmp
drwxr-xr-x   - root root       4096 2012-01-25 22:43 /opt
drwx------   - root root      16384 2010-12-30 14:56 /lost+found
drwxr-xr-x   - root root          0 2012-05-11 16:44 /selinux
dr-xr-x---   - root root       4096 2012-05-23 22:32 /root
...

If instead you get an error message that hadoop isn’t found, either invoke the command with the full path (e.g., $HOME/hadoop-0.20.2/bin/hadoop) or add the bin directory to your PATH variable, as discussed in Installing Hadoop above.

Note

If you find yourself using the hadoop dfs command frequently, it’s convenient to define an alias for it (e.g., alias hdfs="hadoop dfs").

Hadoop offers a framework for MapReduce. The Hadoop distribution contains an implementation of the Word Count algorithm we discussed in Chapter 1. Let’s run it!

Start by creating an input directory (inside your current working directory) with files to be processed by Hadoop:

$ mkdir wc-in
$ echo "bla bla" > wc-in/a.txt
$ echo "bla wa wa " > wc-in/b.txt

Use the hadoop command to launch the Word Count application on the input directory we just created. Note that it’s conventional to always specify directories for input and output, not individual files, since there will often be multiple input and/or output files per directory, a consequence of the parallelism of the system.

If you are running these commands on your local installation that was configured to use local mode, the hadoop command will launch the MapReduce components in the same process. If you are running on a cluster or on a single machine using pseudodistributed mode, the hadoop command will launch one or more separate processes using the JobTracker service (and the output below will be slightly different). Also, if you are running with a different version of Hadoop, change the name of the examples.jar as needed:

$ hadoop jar $HADOOP_HOME/hadoop-0.20.2-examples.jar wordcount wc-in wc-out
12/06/03 15:40:26 INFO input.FileInputFormat: Total input paths to process : 2
...
12/06/03 15:40:27 INFO mapred.JobClient: Running job: job_local_0001
12/06/03 15:40:30 INFO mapred.JobClient:  map 100% reduce 0%
12/06/03 15:40:41 INFO mapred.JobClient:  map 100% reduce 100%
12/06/03 15:40:41 INFO mapred.JobClient: Job complete: job_local_0001

The results of the Word count application can be viewed through local filesystem commands:

$ ls wc-out/*
part-r-00000
$ cat wc-out/*
bla     3
wa      2

They can also be viewed by the equivalent dfs command (again, because we assume you are running in local mode):

$ hadoop dfs -cat wc-out/*
bla     3
wa      2

Note

For very big files, if you want to view just the first or last parts, there is no -more, -head, nor -tail subcommand. Instead, just pipe the output of the -cat command through the shell’s more, head, or tail. For example: hadoop dfs -cat wc-out/* | more.

Now that we have installed and tested an installation of Hadoop, we can install Hive.

Installing Hive

Installing Hive is similar to installing Hadoop. We will download and extract a tarball for Hive, which does not include an embedded version of Hadoop. A single Hive binary is designed to work with multiple versions of Hadoop. This means it’s often easier and less risky to upgrade to newer Hive releases than it is to upgrade to newer Hadoop releases.

Hive uses the environment variable HADOOP_HOME to locate the Hadoop JARs and configuration files. So, make sure you set that variable as discussed above before proceeding. The following commands work for both Linux and Mac OS X:

$ cd ~                 # or use another directory of your choice.
$ curl -o http://archive.apache.org/dist/hive/hive-0.9.0/hive-0.9.0-bin.tar.gz
$ tar -xzf hive-0.9.0.tar.gz
$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod a+rwx /user/hive/warehouse

As you can infer from these commands, we are using the latest stable release of Hive at the time of this writing, v0.9.0. However, most of the material in this book works with Hive v0.7.X and v0.8.X. We’ll call out the differences as we come to them.

You’ll want to add the hive command to your path, like we did for the hadoop command. We’ll follow the same approach, by first defining a HIVE_HOME variable, but unlike HADOOP_HOME, this variable isn’t really essential. We’ll assume it’s defined for some examples later in the book.

For Linux, run these commands:

$ sudo echo "export HIVE_HOME=$PWD/hive-0.9.0" > /etc/profile.d/hive.sh
$ sudo echo "PATH=$PATH:$HIVE_HOME/bin >> /etc/profile.d/hive.sh
$ . /etc/profile

For Mac OS X, run these commands:

$ echo "export HIVE_HOME=$PWD/hive-0.9.0" >> $HOME/.bashrc
$ echo "PATH=$PATH:$HIVE_HOME/bin" >> $HOME/.bashrc
$ . $HOME/.bashrc

What Is Inside Hive?

The core of a Hive binary distribution contains three parts. The main part is the Java code itself. Multiple JAR (Java archive) files such as hive-exec*.jar and hive-metastore*.jar are found under the $HIVE_HOME/lib directory. Each JAR file implements a particular subset of Hive’s functionality, but the details don’t concern us now.

The $HIVE_HOME/bin directory contains executable scripts that launch various Hive services, including the hive command-line interface (CLI). The CLI is the most popular way to use Hive. We will use hive (in lowercase, with a fixed-width font) to refer to the CLI, except where noted. The CLI can be used interactively to type in statements one at a time or it can be used to run “scripts” of Hive statements, as we’ll see.

Hive also has other components. A Thrift service provides remote access from other processes. Access using JDBC and ODBC are provided, too. They are implemented on top of the Thrift service. We’ll describe these features in later chapters.

All Hive installations require a metastore service, which Hive uses to store table schemas and other metadata. It is typically implemented using tables in a relational database. By default, Hive uses a built-in Derby SQL server, which provides limited, single-process storage. For example, when using Derby, you can’t run two simultaneous instances of the Hive CLI. However, this is fine for learning Hive on a personal machine and some developer tasks. For clusters, MySQL or a similar relational database is required. We will discuss the details in Metastore Using JDBC.

Finally, a simple web interface, called Hive Web Interface (HWI), provides remote access to Hive.

The conf directory contains the files that configure Hive. Hive has a number of configuration properties that we will discuss as needed. These properties control features such as the metastore (where data is stored), various optimizations, and “safety controls,” etc.

Starting Hive

Let’s finally start the Hive command-line interface (CLI) and run a few commands! We’ll briefly comment on what’s happening, but save the details for discussion later.

In the following session, we’ll use the $HIVE_HOME/bin/hive command, which is a bash shell script, to start the CLI. Substitute the directory where Hive is installed on your system whenever $HIVE_HOME is listed in the following script. Or, if you added $HIVE_HOME/bin to your PATH, you can just type hive to run the command. We’ll make that assumption for the rest of the book.

As before, $ is the bash prompt. In the Hive CLI, the hive> string is the hive prompt, and the indented > is the secondary prompt. Here is a sample session, where we have added a blank line after the output of each command, for clarity:

$ cd $HIVE_HOME
$ bin/hive
Hive history file=/tmp/myname/hive_job_log_myname_201201271126_1992326118.txt
hive> CREATE TABLE x (a INT);
OK
Time taken: 3.543 seconds

hive> SELECT * FROM x;
OK
Time taken: 0.231 seconds

hive> SELECT *
    > FROM x;
OK
Time taken: 0.072 seconds

hive> DROP TABLE x;
OK
Time taken: 0.834 seconds

hive> exit;
$

The first line printed by the CLI is the local filesystem location where the CLI writes log data about the commands and queries you execute. If a command or query is successful, the first line of output will be OK, followed by the output, and finished by the line showing the amount of time taken to run the command or query.

Note

Throughout the book, we will follow the SQL convention of showing Hive keywords in uppercase (e.g., CREATE, TABLE, SELECT and FROM), even though case is ignored by Hive, following SQL conventions.

Going forward, we’ll usually add the blank line after the command output for all sessions. Also, when starting a session, we’ll omit the line about the logfile. For individual commands and queries, we’ll omit the OK and Time taken:... lines, too, except in special cases, such as when we want to emphasize that a command or query was successful, but it had no other output.

At the successive prompts, we create a simple table named x with a single INT (4-byte integer) column named a, then query it twice, the second time showing how queries and commands can spread across multiple lines. Finally, we drop the table.

If you are running with the default Derby database for the metastore, you’ll notice that your current working directory now contains a new subdirectory called metastore_db that was created by Derby during the short hive session you just executed. If you are running one of the VMs, it’s possible it has configured different behavior, as we’ll discuss later.

Creating a metastore_db subdirectory under whatever working directory you happen to be in is not convenient, as Derby “forgets” about previous metastores when you change to a new working directory! In the next section, we’ll see how to configure a permanent location for the metastore database, as well as make other changes.

Configuring Your Hadoop Environment

Let’s dive a little deeper into the different Hadoop modes and discuss more configuration issues relevant to Hive.

You can skip this section if you’re using Hadoop on an existing cluster or you are using a virtual machine instance. If you are a developer or you installed Hadoop and Hive yourself, you’ll want to understand the rest of this section. However, we won’t provide a complete discussion. See Appendix A of Hadoop: The Definitive Guide by Tom White for the full details on configuring the different modes.

Local Mode Configuration

Recall that in local mode, all references to files go to your local filesystem, not the distributed filesystem. There are no services running. Instead, your jobs run all tasks in a single JVM instance.

Figure 2-1 illustrates a Hadoop job running in local mode.

Hadoop in local mode

Figure 2-1. Hadoop in local mode

If you plan to use the local mode regularly, it’s worth configuring a standard location for the Derby metastore_db, where Hive stores metadata about your tables, etc.

You can also configure a different directory for Hive to store table data, if you don’t want to use the default location, which is file:///user/hive/warehouse, for local mode, and hdfs://namenode_server/user/hive/warehouse for the other modes discussed next.

First, go to the $HIVE_HOME/conf directory. The curious may want to peek at the large hive-default.xml.template file, which shows the different configuration properties supported by Hive and their default values. Most of these properties you can safely ignore. Changes to your configuration are done by editing the hive-site.xml file. Create one if it doesn’t already exist.

Here is an example configuration file where we set several properties for local mode execution (Example 2-1).

Example 2-1. Local-mode hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/home/me/hive/warehouse</value>
    <description>
      Local or HDFS directory where Hive keeps table contents.
    </description>
  </property>
  <property>
    <name>hive.metastore.local</name>
    <value>true</value>
    <description>
      Use false if a production metastore server is used.
    </description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=/home/me/hive/metastore_db;create=true</value>
    <description>
      The JDBC connection URL.
    </description>
  </property>
</configuration>

You can remove any of these <property>...</property> tags you don’t want to change.

As the <description> tags indicate, the hive.metastore.warehouse.dir tells Hive where in your local filesystem to keep the data contents for Hive’s tables. (This value is appended to the value of fs.default.name defined in the Hadoop configuration and defaults to file:///.) You can use any directory path you want for the value. Note that this directory will not be used to store the table metadata, which goes in the separate metastore.

The hive.metastore.local property defaults to true, so we don’t really need to show it in Example 2-1. It’s there more for documentation purposes. This property controls whether to connect to a remote metastore server or open a new metastore server as part of the Hive Client JVM. This setting is almost always set to true and JDBC is used to communicate directly to a relational database. When it is set to false, Hive will communicate through a metastore server, which we’ll discuss in Metastore Methods.

The value for the javax.jdo.option.ConnectionURL property makes one small but convenient change to the default value for this property. This property tells Hive how to connect to the metastore server. By default, it uses the current working directory for the databaseName part of the value string. As shown in Example 2-1, we use databaseName=/home/me/hive/metastore_db as the absolute path instead, which is the location where the metastore_db directory will always be located. This change eliminates the problem of Hive dropping the metastore_db directory in the current working directory every time we start a new Hive session. Now, we’ll always have access to all our metadata, no matter what directory we are working in.

Distributed and Pseudodistributed Mode Configuration

In distributed mode, several services run in the cluster. The JobTracker manages jobs and the NameNode is the HDFS master. Worker nodes run individual job tasks, managed by a TaskTracker service on each node, and then hold blocks for files in the distributed filesystem, managed by DataNode services.

Figure 2-2 shows a typical distributed mode configuration for a Hadoop cluster.

Hadoop in distributed mode

Figure 2-2. Hadoop in distributed mode

We’re using the convention that *.domain.pvt is our DNS naming convention for the cluster’s private, internal network.

Pseudodistributed mode is nearly identical; it’s effectively a one-node cluster.

We’ll assume that your administrator has already configured Hadoop, including your distributed filesystem (e.g., HDFS, or see Appendix A of Hadoop: The Definitive Guide by Tom White). Here, we’ll focus on the unique configuration steps required by Hive.

One Hive property you might want to configure is the top-level directory for table storage, which is specified by the property hive.metastore.warehouse.dir, which we also discussed in Local Mode Configuration.

The default value for this property is /user/hive/warehouse in the Apache Hadoop and MapR distributions, which will be interpreted as a distributed filesystem path when Hadoop is configured for distributed or pseudodistributed mode. For Amazon Elastic MapReduce (EMR), the default value is /mnt/hive_0M_N/warehouse when using Hive v0.M.N (e.g., /mnt/hive_08_1/warehouse).

Specifying a different value here allows each user to define their own warehouse directory, so they don’t affect other system users. Hence, each user might use the following statement to define their own warehouse directory:

set hive.metastore.warehouse.dir=/user/myname/hive/warehouse;

It’s tedious to type this each time you start the Hive CLI or to remember to add it to every Hive script. Of course, it’s also easy to forget to define this property. Instead, it’s best to put commands like this in the $HOME/.hiverc file, which will be processed when Hive starts. See The .hiverc File for more details.

We’ll assume the value is /user/hive/warehouse from here on.

Metastore Using JDBC

Hive requires only one extra component that Hadoop does not already have; the metastore component. The metastore stores metadata such as table schema and partition information that you specify when you run commands such as create table x..., or alter table y..., etc. Because multiple users and systems are likely to need concurrent access to the metastore, the default embedded database is not suitable for production.

Note

If you are using a single node in pseudodistributed mode, you may not find it useful to set up a full relational database for the metastore. Rather, you may wish to continue using the default Derby store, but configure it to use a central location for its data, as described in Local Mode Configuration.

Any JDBC-compliant database can be used for the metastore. In practice, most installations of Hive use MySQL. We’ll discuss how to use MySQL. It is straightforward to adapt this information to other JDBC-compliant databases.

Warning

The information required for table schema, partition information, etc., is small, typically much smaller than the large quantity of data stored in Hive. As a result, you typically don’t need a powerful dedicated database server for the metastore. However because it represents a Single Point of Failure (SPOF), it is strongly recommended that you replicate and back up this database using the standard techniques you would normally use with other relational database instances. We won’t discuss those techniques here.

For our MySQL configuration, we need to know the host and port the service is running on. We will assume db1.mydomain.pvt and port 3306, which is the standard MySQL port. Finally, we will assume that hive_db is the name of our catalog. We define these properties in Example 2-2.

Example 2-2. Metastore database configuration in hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://db1.mydomain.pvt/hive_db?createDatabaseIfNotExist=true</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>database_user</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>database_pass</value>
  </property>
</configuration>

You may have noticed the ConnectionURL property starts with a prefix of jdbc:mysql. For Hive to be able to connect to MySQL, we need to place the JDBC driver in our classpath. Download the MySQL JDBC driver (Jconnector) from http://www.mysql.com/downloads/connector/j/. The driver can be placed in the Hive library path, $HIVE_HOME/lib. Some teams put all such support libraries in their Hadoop lib directory.

With the driver and the configuration settings in place, Hive will store its metastore information in MySQL.

The Hive Command

The $HIVE_HOME/bin/hive shell command, which we’ll simply refer to as hive from now on, is the gateway to Hive services, including the command-line interface or CLI.

We’ll also assume that you have added $HIVE_HOME/bin to your environment’s PATH so you can type hive at the shell prompt and your shell environment (e.g., bash) will find the command.

Command Options

If you run the following command, you’ll see a brief list of the options for the hive command. Here is the output for Hive v0.8.X and v0.9.X:

$ bin/hive --help
Usage ./hive <parameters> --service serviceName <service parameters>
Service List: cli help hiveserver hwi jar lineage metastore rcfilecat
Parameters parsed:
  --auxpath : Auxiliary jars
  --config : Hive configuration directory
  --service : Starts specific service/component. cli is default
Parameters used:
  HADOOP_HOME : Hadoop install directory
  HIVE_OPT : Hive options
For help on a particular service:
  ./hive --service serviceName --help
Debug help:  ./hive --debug --help

Note the Service List. There are several services available, including the CLI that we will spend most of our time using. You can invoke a service using the --service name option, although there are shorthand invocations for some of the services, as well. Table 2-2 describes the most useful services.

Table 2-2. Hive services

OptionNameDescription

cli

Command-line interface

Used to define tables, run queries, etc. It is the default service if no other service is specified. See The Command-Line Interface.

hiveserver

Hive Server

A daemon that listens for Thrift connections from other processes. See Chapter 16 for more details.

hwi

Hive Web Interface

A simple web interface for running queries and other commands without logging into a cluster machine and using the CLI.

jar

An extension of the hadoop jar command for running an application that also requires the Hive environment.

metastore

Start an external Hive metastore service to support multiple clients (see also Metastore Using JDBC).

rcfilecat

A tool for printing the contents of an RCFile (see RCFile).

The --auxpath option lets you specify a colon-separated list of “auxiliary” Java archive (JAR) files that contain custom extensions, etc., that you might require.

The --config directory is mostly useful if you have to override the default configuration properties in $HIVE_HOME/conf in a new directory.

The Command-Line Interface

The command-line interface or CLI is the most common way to interact with Hive. Using the CLI, you can create tables, inspect schema and query tables, etc.

CLI Options

The following command shows a brief list of the options for the CLI. Here we show the output for Hive v0.8.X and v0.9.X:

$ hive --help --service cli
usage: hive
 -d,--define <key=value>          Variable substitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
 -h <hostname>                    connecting to Hive Server on remote host
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable substitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

A shorter version of this command is hive -h. However, that’s technically an unsupported option, but it produces the help output with an additional line that complains about Missing argument for option: h.

For Hive v0.7.X, the -d, --hivevar, and -p options are not supported.

Let’s explore these options in more detail.

Variables and Properties

The --define key=value option is effectively equivalent to the --hivevar key=value option. Both let you define on the command line custom variables that you can reference in Hive scripts to customize execution. This feature is only supported in Hive v0.8.0 and later versions.

When you use this feature, Hive puts the key-value pair in the hivevar “namespace” to distinguish these definitions from three other built-in namespaces, hiveconf, system, and env.

Note

The terms variable or property are used in different contexts, but they function the same way in most cases.

The namespace options are described in Table 2-3.

Table 2-3. Hive namespaces for variables and properties

NamespaceAccessDescription

hivevar

Read/Write

(v0.8.0 and later) User-defined custom variables.

hiveconf

Read/Write

Hive-specific configuration properties.

system

Read/Write

Configuration properties defined by Java.

env

Read only

Environment variables defined by the shell environment (e.g., bash).

Hive’s variables are internally stored as Java Strings. You can reference variables in queries; Hive replaces the reference with the variable’s value before sending the query to the query processor.

Inside the CLI, variables are displayed and changed using the SET command. For example, the following session shows the value for one variable, in the env namespace, and then all variable definitions! Here is a Hive session where some output has been omitted and we have added a blank line after the output of each command for clarity:

$ hive
hive> set env:HOME;
env:HOME=/home/thisuser

hive> set;
... lots of output including these variables:
hive.stats.retries.wait=3000
env:TERM=xterm
system:user.timezone=America/New_York
...

hive> set -v;
... even more output!...

Without the -v flag, set prints all the variables in the namespaces hivevar, hiveconf, system, and env. With the -v option, it also prints all the properties defined by Hadoop, such as properties controlling HDFS and MapReduce.

The set command is also used to set new values for variables. Let’s look specifically at the hivevar namespace and a variable that is defined for it on the command line:

$ hive --define foo=bar
hive> set foo;
foo=bar;

hive> set hivevar:foo;
hivevar:foo=bar;

hive> set hivevar:foo=bar2;

hive> set foo;
foo=bar2

hive> set hivevar:foo;
hivevar:foo=bar2

As we can see, the hivevar: prefix is optional. The --hivevar flag is the same as the --define flag.

Variable references in queries are replaced in the CLI before the query is sent to the query processor. Consider the following hive CLI session (v0.8.X only):

hive> create table toss1(i int, ${hivevar:foo} string);

hive> describe toss1;
i       int
bar2    string

hive> create table toss2(i2 int, ${foo} string);

hive> describe toss2;
i2      int
bar2    string

hive> drop table toss1;
hive> drop table toss2;

Let’s look at the --hiveconf option, which is supported in Hive v0.7.X. It is used for all properties that configure Hive behavior. We’ll use it with a property hive.cli.print.current.db that was added in Hive v0.8.0. It turns on printing of the current working database name in the CLI prompt. (See Databases in Hive for more on Hive databases.) The default database is named default. This property is false by default:

$ hive --hiveconf hive.cli.print.current.db=true
hive (default)> set hive.cli.print.current.db;
hive.cli.print.current.db=true

hive (default)> set hiveconf:hive.cli.print.current.db;
hiveconf:hive.cli.print.current.db=true

hive (default)> set hiveconf:hive.cli.print.current.db=false;

hive> set hiveconf:hive.cli.print.current.db=true;

hive (default)> ...

We can even add new hiveconf entries, which is the only supported option for Hive versions earlier than v0.8.0:

$ hive --hiveconf y=5
hive> set y;
y=5

hive> CREATE TABLE whatsit(i int);

hive> ... load data into whatsit ...

hive> SELECT * FROM whatsit WHERE i = ${hiveconf:y};
...

It’s also useful to know about the system namespace, which provides read-write access to Java system properties, and the env namespace, which provides read-only access to environment variables:

hive> set system:user.name;
system:user.name=myusername

hive> set system:user.name=yourusername;

hive> set system:user.name;
system:user.name=yourusername

hive> set env:HOME;
env:HOME=/home/yourusername

hive> set env:HOME;
env:* variables can not be set.

Unlike hivevar variables, you have to use the system: or env: prefix with system properties and environment variables.

The env namespace is useful as an alternative way to pass variable definitions to Hive, especially for Hive v0.7.X. Consider the following example:

$ YEAR=2012 hive -e "SELECT * FROM mytable WHERE year = ${env:YEAR}";

The query processor will see the literal number 2012 in the WHERE clause.

Warning

If you are using Hive v0.7.X, some of the examples in this book that use parameters and variables may not work as written. If so, replace the variable reference with the corresponding value.

Note

All of Hive’s built-in properties are listed in $HIVE_HOME/conf/hive-default.xml.template, the “sample” configuration file. It also shows the default values for each property.

Hive “One Shot” Commands

The user may wish to run one or more queries (semicolon separated) and then have the hive CLI exit immediately after completion. The CLI accepts a -e command argument that enables this feature. If mytable has a string and integer column, we might see the following output:

$ hive -e "SELECT * FROM mytable LIMIT 3";
OK
name1 10
name2 20
name3 30
Time taken: 4.955 seconds
$

A quick and dirty technique is to use this feature to output the query results to a file. Adding the -S for silent mode removes the OK and Time taken ... lines, as well as other inessential output, as in this example:

$ hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery
$ cat /tmp/myquery
name1 10
name2 20
name3 30

Note that hive wrote the output to the standard output and the shell command redirected that output to the local filesystem, not to HDFS.

Finally, here is a useful trick for finding a property name that you can’t quite remember, without having to scroll through the list of the set output. Suppose you can’t remember the name of the property that specifies the “warehouse” location for managed tables:

$ hive -S -e "set" | grep warehouse
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=false

It’s the first one.

Executing Hive Queries from Files

Hive can execute one or more queries that were saved to a file using the -f file argument. By convention, saved Hive query files use the .q or .hql extension.

$ hive -f /path/to/file/withqueries.hql

If you are already inside the Hive shell you can use the SOURCE command to execute a script file. Here is an example:

$ cat /path/to/file/withqueries.hql
SELECT x.* FROM src x;
$ hive
hive> source /path/to/file/withqueries.hql;
...

By the way, we’ll occasionally use the name src (“source”) for tables in queries when the name of the table is irrelevant for the example. This convention is taken from the unit tests in Hive’s source code; first create a src table before all tests.

For example, when experimenting with a built-in function, it’s convenient to write a “query” that passes literal arguments to the function, as in the following example taken from later in the book, XPath-Related Functions:

hive> SELECT xpath('<a><b id="foo">b1</b><b id="bar">b2</b></a>','//@id')
    > FROM src LIMIT 1;
[foo","bar]

The details for xpath don’t concern us here, but note that we pass string literals to the xpath function and use FROM src LIMIT 1 to specify the required FROM clause and to limit the output. Substitute src with the name of a table you have already created or create a dummy table named src:

CREATE TABLE src(s STRING);

Also the source table must have at least one row of content in it:

$ echo "one row" > /tmp/myfile
$ hive -e "LOAD DATA LOCAL INPATH '/tmp/myfile' INTO TABLE src;

The .hiverc File

The last CLI option we’ll discuss is the -i file option, which lets you specify a file of commands for the CLI to run as it starts, before showing you the prompt. Hive automatically looks for a file named .hiverc in your HOME directory and runs the commands it contains, if any.

These files are convenient for commands that you run frequently, such as setting system properties (see Variables and Properties) or adding Java archives (JAR files) of custom Hive extensions to Hadoop’s distributed cache (as discussed in Chapter 15).

The following shows an example of a typical $HOME/.hiverc file:

ADD JAR /path/to/custom_hive_extensions.jar;
set hive.cli.print.current.db=true;
set hive.exec.mode.local.auto=true;

The first line adds a JAR file to the Hadoop distributed cache. The second line modifies the CLI prompt to show the current working Hive database, as we described earlier in Variables and Properties. The last line “encourages” Hive to be more aggressive about using local-mode execution when possible, even when Hadoop is running in distributed or pseudo-distributed mode, which speeds up queries for small data sets.

Warning

An easy mistake to make is to forget the semicolon at the end of lines like this. When you make this mistake, the definition of the property will include all the text from all the subsequent lines in the file until the next semicolon.

More on Using the Hive CLI

The CLI supports a number of other useful features.

Autocomplete

If you start typing and hit the Tab key, the CLI will autocomplete possible keywords and function names. For example, if you type SELE and then the Tab key, the CLI will complete the word SELECT.

If you type the Tab key at the prompt, you’ll get this reply:

hive>
Display all 407 possibilities? (y or n)

If you enter y, you’ll get a long list of all the keywords and built-in functions.

Warning

A common source of error and confusion when pasting statements into the CLI occurs where some lines begin with a tab. You’ll get the prompt about displaying all possibilities, and subsequent characters in the stream will get misinterpreted as answers to the prompt, causing the command to fail.

Command History

You can use the up and down arrow keys to scroll through previous commands. Actually, each previous line of input is shown separately; the CLI does not combine multiline commands and queries into a single history entry. Hive saves the last 100,00 lines into a file $HOME/.hivehistory.

If you want to repeat a previous command, scroll to it and hit Enter. If you want to edit the line before entering it, use the left and right arrow keys to navigate to the point where changes are required and edit the line. You can hit Return to submit it without returning to the end of the line.

Note

Most navigation keystrokes using the Control key work as they do for the bash shell (e.g., Control-A goes to the beginning of the line and Control-E goes to the end of the line). However, similar “meta,” Option, or Escape keys don’t work (e.g., Option-F to move forward a word at a time). Similarly, the Delete key will delete the character to the left of the cursor, but the Forward Delete key doesn’t delete the character under the cursor.

Shell Execution

You don’t need to leave the hive CLI to run simple bash shell commands. Simply type ! followed by the command and terminate the line with a semicolon (;):

hive> ! /bin/echo "what up dog";
"what up dog"
hive> ! pwd;
/home/me/hiveplay

Don’t invoke interactive commands that require user input. Shell “pipes” don’t work and neither do file “globs.” For example, ! ls *.hql; will look for a file named *.hql;, rather than all files that end with the .hql extension.

Hadoop dfs Commands from Inside Hive

You can run the hadoop dfs ... commands from within the hive CLI; just drop the hadoop word from the command and add the semicolon at the end:

hive> dfs -ls / ;
Found 3 items
drwxr-xr-x   - root   supergroup          0 2011-08-17 16:27 /etl
drwxr-xr-x   - edward supergroup          0 2012-01-18 15:51 /flag
drwxrwxr-x   - hadoop supergroup          0 2010-02-03 17:50 /users

This method of accessing hadoop commands is actually more efficient than using the hadoop dfs ... equivalent at the bash shell, because the latter starts up a new JVM instance each time, whereas Hive just runs the same code in its current process.

You can see a full listing of help on the options supported by dfs using this command:

hive> dfs -help;

See also http://hadoop.apache.org/common/docs/r0.20.205.0/file_system_shell.html or similar documentation for your Hadoop distribution.

Comments in Hive Scripts

As of Hive v0.8.0, you can embed lines of comments that start with the string --, for example:

-- Copyright (c) 2012 Megacorp, LLC.
-- This is the best Hive script evar!!

SELECT * FROM massive_table;
...

Warning

The CLI does not parse these comment lines. If you paste them into the CLI, you’ll get errors. They only work when used in scripts executed with hive -f script_name.

Query Column Headers

As a final example that pulls together a few things we’ve learned, let’s tell the CLI to print column headers, which is disabled by default. We can enable this feature by setting the hiveconf property hive.cli.print.header to true:

hive> set hive.cli.print.header=true;

hive> SELECT * FROM system_logs LIMIT 3;
tstamp severity server message
1335667117.337715 ERROR server1 Hard drive hd1 is 90% full!
1335667117.338012 WARN  server1 Slow response from server2.
1335667117.339234 WARN  server2 Uh, Dude, I'm kinda busy right now...

If you always prefer seeing the headers, put the first line in your $HOME/.hiverc file.



[13] However, some vendors are starting to support Hadoop on other systems. Hadoop has been used in production on various Unix systems and it works fine on Mac OS X for development use.

[14] These are the current URLs at the time of this writing.

[15] At least that’s the current situation on Dean’s Mac. This discrepancy may actually reflect the fact that stewardship of the Mac OS X Java port is transitioning from Apple to Oracle as of Java 1.7.

[16] Unfortunately, the dfs -ls command only provides a “long listing” format. There is no short format, like the default for the Linux ls command.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset