Chapter 12. Developing

Hive won’t provide everything you could possibly need. Sometimes a third-party library will fill a gap. At other times, you or someone else who is a Java developer will need to write user-defined functions (UDFs; see Chapter 13), SerDes (see Record Formats: SerDes), input and/or output formats (see Chapter 15), or other enhancements.

This chapter explores working with the Hive source code itself, including the new Plugin Developer Kit introduced in Hive v0.8.0.

Changing Log4J Properties

Hive can be configured with two separate Log4J configuration files found in $HIVE_HOME/conf. The hive-log4j.properties file controls the logging of the CLI or other locally launched components. The hive-exec-log4j.properties file controls the logging inside the MapReduce tasks. These files do not need to be present inside the Hive installation because the default properties come built inside the Hive JARs. In fact, the actual files in the conf directory have the .template extension, so they are ignored by default. To use either of them, copy it with a name that removes the .template extension and edit it to taste:

$ cp conf/hive-log4j.properties.template conf/hive-log4j.properties
$ ... edit file ...

It is also possible to change the logging configuration of Hive temporarily without copying and editing the Log4J files. The hiveconf switch can be specified on start-up with definitions of any properties in the log4.properties file. For example, here we set the default logger to the DEBUG level and send output to the console appender:

$ bin/hive -hiveconf hive.root.logger=DEBUG,console
12/03/27 08:46:01 WARN conf.HiveConf: hive-site.xml not found on CLASSPATH
12/03/27 08:46:01 DEBUG conf.Configuration: java.io.IOException: config()

Connecting a Java Debugger to Hive

When enabling more verbose output does not help find the solution to the problem you are troubleshooting, attaching a Java debugger will give you the ability to step through the Hive code and hopefully find the problem.

Remote debugging is a feature of Java that is manually enabled by setting specific command-line properties for the JVM. The Hive shell script provides a switch and help screen that makes it easy to set these properties (some output truncated for space):

$ bin/hive --help --debug
Allows to debug Hive by connecting to it via JDI API
Usage: hive --debug[:comma-separated parameters list]

Parameters:

recursive=<y|n>     Should child JVMs also be started in debug mode. Default: y
port=<port_number>  Port on which main JVM listens for debug connection. Defaul...
mainSuspend=<y|n>   Should main JVM wait with execution for the debugger to con...
childSuspend=<y|n>  Should child JVMs wait with execution for the debugger to c...
swapSuspend         Swaps suspend options between main and child JVMs

Building Hive from Source

Running Apache releases is usually a good idea, however you may wish to use features that are not part of a release, or have an internal branch with nonpublic customizations.

Hence, you’ll need to build Hive from source. The minimum requirements for building Hive are a recent Java JDK, Subversion, and ANT. Hive also contains components such as Thrift-generated classes that are not built by default. Rebuilding Hive requires a Thrift compiler, too.

The following commands check out a Hive release and builds it, produces output in the hive-trunk/build/dist directory:

$ svn co http://svn.apache.org/repos/asf/hive/trunk hive-trunk
$ cd hive-trunk
$ ant package
$ ls build/dist/
bin   examples  LICENSE  README.txt         scripts
conf  lib       NOTICE   RELEASE_NOTES.txt

Running Hive Test Cases

Hive has a unique built-in infrastructure for testing. Hive does have traditional JUnit tests, however the majority of the testing happens by running queries saved in .q files, then comparing the results with a previous run saved in Hive source.[20] There are multiple directories inside the Hive source folder. “Positive” tests are those that should pass, while “negative” tests should fail.

An example of a positive test is a well-formed query. An example of a negative test is a query that is malformed or tries doing something that is not allowed by HiveQL:

$ ls -lah ql/src/test/queries/
total 76K
drwxrwxr-x. 7 edward edward 4.0K May 28  2011 .
drwxrwxr-x. 8 edward edward 4.0K May 28  2011 ..
drwxrwxr-x. 3 edward edward  20K Feb 21 20:08 clientnegative
drwxrwxr-x. 3 edward edward  36K Mar  8 09:17 clientpositive
drwxrwxr-x. 3 edward edward 4.0K May 28  2011 negative
drwxrwxr-x. 3 edward edward 4.0K Mar 12 09:25 positive

Take a look at ql/src/test/queries/clientpositive/cast1.q. The first thing you should know is that a src table is the first table automatically created in the test process. It is a table with two columns, key and value, where key is an INT and value is a STRING. Because Hive does not currently have the ability to do a SELECT without a FROM clause, selecting a single row from the src table is the trick used to test out functions that don’t really need to retrieve table data; inputs can be “hard-coded” instead.

As you can see in the following example queries, the src table is never referenced in the SELECT clauses:

hive> CREATE TABLE dest1(c1 INT, c2 DOUBLE, c3 DOUBLE,
    > c4 DOUBLE, c5 INT, c6 STRING, c7 INT) STORED AS TEXTFILE;

hive> EXPLAIN
    > FROM src INSERT OVERWRITE TABLE dest1
    > SELECT 3 + 2, 3.0 + 2, 3 + 2.0, 3.0 + 2.0,
    > 3 + CAST(2.0 AS INT) + CAST(CAST(0 AS SMALLINT) AS INT),
    > CAST(1 AS BOOLEAN), CAST(TRUE AS INT) WHERE src.key = 86;

hive> FROM src INSERT OVERWRITE TABLE dest1
    > SELECT 3 + 2, 3.0 + 2, 3 + 2.0, 3.0 + 2.0,
    > 3 + CAST(2.0 AS INT) + CAST(CAST(0 AS SMALLINT) AS INT),
    > CAST(1 AS BOOLEAN), CAST(TRUE AS INT) WHERE src.key = 86;

hive> SELECT dest1.* FROM dest1;

The results of the script are found here: ql/src/test/results/clientpositive/cast1.q.out. The result file is large and printing the complete results inline will kill too many trees. However, portions of the file are worth noting.

This command invokes a positive and a negative test case for the Hive client:

ant test -Dtestcase=TestCliDriver -Dqfile=mapreduce1.q
ant test -Dtestcase=TestNegativeCliDriver -Dqfile=script_broken_pipe1.q

The two particular tests only parse queries. They do not actually run the client. They are now deprecated in favor of clientpositive and clientnegative.

You can also run multiple tests in one ant invocation to save time (the last -Dqfile=… string was wrapped for space; it’s all one string):

ant test -Dtestcase=TestCliDriver -Dqfile=avro_change_schema.q,avro_joins.q,
avro_schema_error_message.q,avro_evolved_schemas.q,avro_sanity_test.q,
avro_schema_literal.q

Execution Hooks

PreHooks and PostHooks are utilities that allow user code to hook into parts of Hive and execute custom code. Hive’s testing framework uses hooks to echo commands that produce no output, so that the results show up inside tests:

PREHOOK: query: CREATE TABLE dest1(c1 INT, c2 DOUBLE, c3 DOUBLE,
c4 DOUBLE, c5 INT, c6 STRING, c7 INT) STORED AS TEXTFILE
PREHOOK: type: CREATETABLE
POSTHOOK: query: CREATE TABLE dest1(c1 INT, c2 DOUBLE, c3 DOUBLE,
c4 DOUBLE, c5 INT, c6 STRING, c7 INT) STORED AS TEXTFILE

Setting Up Hive and Eclipse

Eclipse is an open source IDE (Integrated Development Environment). The following steps allow you to use Eclipse to work with the Hive source code:

$ ant clean package eclipse-files
$ cd metastore
$ ant model-jar
$ cd ../ql
$ ant gen-test

Once built, you can import the project into Eclipse and use it as you normally would.

Create a workspace in Eclipse, as normal. Then use the File Import command and then select General Existing Projects into Workspace. Select the directory where Hive is installed.

When the list of available projects is shown in the wizard, you’ll see one named hive-trunk, which you should select and click Finish.

Figure 12-1 shows how to start the Hive Command CLI Driver from within Eclipse.

Hive Command CLI Driver, starting from within EclipseEclipse, open source IDEstarting Hive Command CLI Driver from withinStarting the Hive Command CLI Driver from within Eclipse

Figure 12-1. Starting the Hive Command CLI Driver from within Eclipse

Hive in a Maven Project

You can set up Hive as a dependency in Maven builds. The Maven repository http://mvnrepository.com/artifact/org.apache.hive/hive-service contains the most recent releases. This page also lists the dependencies hive-service requires.

Here is the top-level dependency definition for Hive v0.9.0, not including the tree of transitive dependencies, which is quite deep:

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-service</artifactId>
  <version>0.9.0</version>
</dependency>

The pom.xml file for hive_test, which we discuss next, provides a complete example of the transitive dependencies for Hive v0.9.0. You can find that file at https://github.com/edwardcapriolo/hive_test/blob/master/pom.xml.

Unit Testing in Hive with hive_test

The optimal way to write applications to work with Hive is to access Hive with Thrift through the HiveService. However, the Thrift service was traditionally difficult to bring up in an embedded environment due to Hive’s many JAR dependencies and the metastore component.

Hive_test fetches all the Hive dependencies from Maven, sets up the metastore and Thrift service locally, and provides test classes to make unit testing easier. Also, because it is very lightweight and unit tests run quickly, this is in contrast to the elaborate test targets inside Hive, which have to rebuild the entire project to execute any unit test.

Hive_test is ideal for testing code such as UDFs, input formats, SerDes, or any component that only adds a pluggable feature for the language. It is not useful for internal Hive development because all the Hive components are pulled from Maven and are external to the project.

In your Maven project, create a pom.xml and include hive_test as a dependency, as shown here:

<dependency>
  <groupId>com.jointhegrid</groupId>
  <artifactId>hive_test</artifactId>
  <version>3.0.1-SNAPSHOT</version>
</dependency>

Then create a version of hive-site.xml:

$ cp $HIVE_HOME/conf/* src/test/resources/
$ vi src/test/resources/hive-site.xml

Unlike a normal hive-site.xml, this version should not save any data to a permanent place. This is because unit tests are not supposed to create or preserve any permanent state. javax.jdo.option.ConnectionURL is set to use a feature in Derby that only stores the database in main memory. The warehouse directory hive.metastore.warehouse.dir is set to a location inside /tmp that will be deleted on each run of the unit test:

<configuration>

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:memory:metastore_db;create=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/tmp/warehouse</value>
    <description>location of default database for the warehouse</description>
  </property>

</configuration>

Hive_test provides several classes that extend JUnit test cases. HiveTestService set up the environment, cleared out the warehouse directory, and launched a metastore and HiveService in-process. This is typically the component to extend for testing. However, other components, such as HiveTestEmbedded are also available:

package com.jointhegrid.hive_test;

import java.io.BufferedWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;

/* Extending HiveTestService creates and initializes
the metastore and thrift service in an embedded mode */
public class ServiceHiveTest extends HiveTestService {

  public ServiceHiveTest() throws IOException {
    super();
  }

  public void testExecute() throws Exception {

    /* Use the Hadoop filesystem API to create a
    data file */
    Path p = new Path(this.ROOT_DIR, "afile");
    FSDataOutputStream o = this.getFileSystem().create(p);
    BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(o));
    bw.write("1
");
    bw.write("2
");
    bw.close();

    /* ServiceHive is a component that connections
    to an embedded or network HiveService based
    on the constructor used */
    ServiceHive sh = new ServiceHive();

    /* We can now interact through the HiveService
    and assert on results */
    sh.client.execute("create table atest (num int)");
    sh.client.execute("load data local inpath '"
      + p.toString() + "' into table atest");
    sh.client.execute("select count(1) as cnt from atest");
    String row = sh.client.fetchOne();
    assertEquals("2", row);
    sh.client.execute("drop table atest");

  }
}

The New Plugin Developer Kit

Hive v0.8.0 introduced a Plugin Developer Kit (PDK). Its intent is to allow developers to build and test plug-ins without the Hive source. Only Hive binary code is required.

The PDK is relatively new and has some subtle bugs of its own that can make it difficult to use. If you want to try using the PDK anyway, consult the wiki page, https://cwiki.apache.org/Hive/plugindeveloperkit.html, but note that this page has a few errors, at least at the time of this writing.



[20] That is, they are more like feature or acceptance tests.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset