Hive won’t provide everything you could possibly need. Sometimes a third-party library will fill a gap. At other times, you or someone else who is a Java developer will need to write user-defined functions (UDFs; see Chapter 13), SerDes (see Record Formats: SerDes), input and/or output formats (see Chapter 15), or other enhancements.
This chapter explores working with the Hive source code itself, including the new Plugin Developer Kit introduced in Hive v0.8.0.
Hive can be configured with two separate Log4J configuration files found in $HIVE_HOME/conf. The hive-log4j.properties file controls the logging of the CLI or other locally launched components. The hive-exec-log4j.properties file controls the logging inside the MapReduce tasks. These files do not need to be present inside the Hive installation because the default properties come built inside the Hive JARs. In fact, the actual files in the conf directory have the .template extension, so they are ignored by default. To use either of them, copy it with a name that removes the .template extension and edit it to taste:
$
cp conf/hive-log4j.properties.template conf/hive-log4j.properties$
... edit file ...
It is also possible to change the logging configuration of Hive
temporarily without copying and editing the Log4J files. The hiveconf
switch can be specified on start-up
with definitions of any properties in the
log4.properties file. For example, here we set the
default logger to the DEBUG
level and
send output to the console appender:
$
bin/hive -hiveconf hive.root.logger=
DEBUG,console 12/03/27 08:46:01 WARN conf.HiveConf: hive-site.xml not found on CLASSPATH 12/03/27 08:46:01 DEBUG conf.Configuration: java.io.IOException: config()
When enabling more verbose output does not help find the solution to the problem you are troubleshooting, attaching a Java debugger will give you the ability to step through the Hive code and hopefully find the problem.
Remote debugging is a feature of Java that is manually enabled by setting specific command-line properties for the JVM. The Hive shell script provides a switch and help screen that makes it easy to set these properties (some output truncated for space):
$
bin/hive --help --debug Allows to debug Hive by connecting to it via JDI API Usage: hive --debug[
:comma-separated parameters list]
Parameters:recursive
=
<y|n> Should child JVMs also be started in debug mode. Default: yport
=
<port_number> Port on which main JVM listensfor
debug connection. Defaul...mainSuspend
=
<y|n> Should main JVMwait
with executionfor
the debugger to con...childSuspend
=
<y|n> Should child JVMswait
with executionfor
the debugger to c... swapSuspend Swapssuspend
options between main and child JVMs
Running Apache releases is usually a good idea, however you may wish to use features that are not part of a release, or have an internal branch with nonpublic customizations.
Hence, you’ll need to build Hive from source. The minimum requirements for building Hive are a recent Java JDK, Subversion, and ANT. Hive also contains components such as Thrift-generated classes that are not built by default. Rebuilding Hive requires a Thrift compiler, too.
The following commands check out a Hive release and builds it, produces output in the hive-trunk/build/dist directory:
$
svn co http://svn.apache.org/repos/asf/hive/trunk hive-trunk$
cd
hive-trunk$
ant package
$
ls build/dist/
bin examples LICENSE README.txt scripts
conf lib NOTICE RELEASE_NOTES.txt
Hive has a unique built-in infrastructure for testing. Hive does have traditional JUnit tests, however the majority of the testing happens by running queries saved in .q files, then comparing the results with a previous run saved in Hive source.[20] There are multiple directories inside the Hive source folder. “Positive” tests are those that should pass, while “negative” tests should fail.
An example of a positive test is a well-formed query. An example of a negative test is a query that is malformed or tries doing something that is not allowed by HiveQL:
$
ls -lah ql/src/test/queries/
total 76K
drwxrwxr-x. 7 edward edward 4.0K May 28 2011 .
drwxrwxr-x. 8 edward edward 4.0K May 28 2011 ..
drwxrwxr-x. 3 edward edward 20K Feb 21 20:08 clientnegative
drwxrwxr-x. 3 edward edward 36K Mar 8 09:17 clientpositive
drwxrwxr-x. 3 edward edward 4.0K May 28 2011 negative
drwxrwxr-x. 3 edward edward 4.0K Mar 12 09:25 positive
Take a look at
ql/src/test/queries/clientpositive/cast1.q. The
first thing you should know is that a src
table is the first table automatically
created in the test process. It is a table with two columns, key
and value
, where key
is an INT
and value
is a STRING
. Because Hive does not currently have
the ability to do a SELECT
without a
FROM
clause, selecting a single row
from the src
table is the trick used
to test out functions that don’t really need to retrieve table data;
inputs can be “hard-coded” instead.
As you can see in the following example queries, the src
table is never referenced in the SELECT
clauses:
hive
>
CREATE
TABLE
dest1
(
c1
INT
,
c2
DOUBLE
,
c3
DOUBLE
,
>
c4
DOUBLE
,
c5
INT
,
c6
STRING
,
c7
INT
)
STORED
AS
TEXTFILE
;
hive
>
EXPLAIN
>
FROM
src
INSERT
OVERWRITE
TABLE
dest1
>
SELECT
3
+
2
,
3
.
0
+
2
,
3
+
2
.
0
,
3
.
0
+
2
.
0
,
>
3
+
CAST
(
2
.
0
AS
INT
)
+
CAST
(
CAST
(
0
AS
SMALLINT
)
AS
INT
),
>
CAST
(
1
AS
BOOLEAN
),
CAST
(
TRUE
AS
INT
)
WHERE
src
.
key
=
86
;
hive
>
FROM
src
INSERT
OVERWRITE
TABLE
dest1
>
SELECT
3
+
2
,
3
.
0
+
2
,
3
+
2
.
0
,
3
.
0
+
2
.
0
,
>
3
+
CAST
(
2
.
0
AS
INT
)
+
CAST
(
CAST
(
0
AS
SMALLINT
)
AS
INT
),
>
CAST
(
1
AS
BOOLEAN
),
CAST
(
TRUE
AS
INT
)
WHERE
src
.
key
=
86
;
hive
>
SELECT
dest1
.
*
FROM
dest1
;
The results of the script are found here: ql/src/test/results/clientpositive/cast1.q.out. The result file is large and printing the complete results inline will kill too many trees. However, portions of the file are worth noting.
This command invokes a positive and a negative test case for the Hive client:
anttest
-Dtestcase=
TestCliDriver -Dqfile=
mapreduce1.q anttest
-Dtestcase=
TestNegativeCliDriver -Dqfile=
script_broken_pipe1.q
The two particular tests only parse queries. They do not actually
run the client. They are now deprecated in favor of clientpositive
and clientnegative
.
You can also run multiple tests in one ant
invocation to save time (the last -Dqfile=…
string was wrapped for space; it’s
all one string):
anttest
-Dtestcase=
TestCliDriver -Dqfile=
avro_change_schema.q,avro_joins.q, avro_schema_error_message.q,avro_evolved_schemas.q,avro_sanity_test.q, avro_schema_literal.q
PreHooks and PostHooks are utilities that allow user code to hook into parts of Hive and execute custom code. Hive’s testing framework uses hooks to echo commands that produce no output, so that the results show up inside tests:
PREHOOK
:
query
:
CREATE
TABLE
dest1
(
c1
INT
,
c2
DOUBLE
,
c3
DOUBLE
,
c4
DOUBLE
,
c5
INT
,
c6
STRING
,
c7
INT
)
STORED
AS
TEXTFILE
PREHOOK
:
type
:
CREATETABLE
POSTHOOK
:
query
:
CREATE
TABLE
dest1
(
c1
INT
,
c2
DOUBLE
,
c3
DOUBLE
,
c4
DOUBLE
,
c5
INT
,
c6
STRING
,
c7
INT
)
STORED
AS
TEXTFILE
Eclipse is an open source IDE (Integrated Development Environment). The following steps allow you to use Eclipse to work with the Hive source code:
$
ant clean package eclipse-files$
cd
metastore$
ant model-jar$
cd
../ql$
ant gen-test
Once built, you can import the project into Eclipse and use it as you normally would.
Create a workspace in Eclipse, as normal. Then use the File → Import command and then select General → Existing Projects into Workspace. Select the directory where Hive is installed.
When the list of available projects is shown in the wizard, you’ll see one named hive-trunk, which you should select and click Finish.
Figure 12-1 shows how to start the Hive Command CLI Driver from within Eclipse.
You can set up Hive as a dependency in Maven builds. The
Maven repository http://mvnrepository.com/artifact/org.apache.hive/hive-service
contains the most recent releases. This page also lists the dependencies
hive-service
requires.
Here is the top-level dependency definition for Hive v0.9.0, not including the tree of transitive dependencies, which is quite deep:
<dependency>
<groupId>
org.apache.hive</groupId>
<artifactId>
hive-service</artifactId>
<version>
0.9.0</version>
</dependency>
The pom.xml file for hive_test
, which we discuss next, provides a
complete example of the transitive dependencies for Hive v0.9.0. You can
find that file at https://github.com/edwardcapriolo/hive_test/blob/master/pom.xml.
The optimal way to write applications to work with Hive is
to access Hive with Thrift through the HiveService
. However, the Thrift service was
traditionally difficult to bring up in an embedded environment due to
Hive’s many JAR dependencies and the metastore component.
Hive_test
fetches all the Hive
dependencies from Maven, sets up the metastore and
Thrift service locally, and provides test classes to make unit testing
easier. Also, because it is very lightweight and unit tests run quickly,
this is in contrast to the elaborate test targets inside Hive, which have
to rebuild the entire project to execute any unit test.
Hive_test
is ideal for testing
code such as UDFs, input formats, SerDes, or any component that only adds
a pluggable feature for the language. It is not useful for internal Hive
development because all the Hive components are pulled from Maven and are
external to the project.
In your Maven project, create a pom.xml
and include hive_test
as a dependency,
as shown here:
<dependency>
<groupId>
com.jointhegrid</groupId>
<artifactId>
hive_test</artifactId>
<version>
3.0.1-SNAPSHOT</version>
</dependency>
Then create a version of hive-site.xml:
$
cp$HIVE_HOME
/conf/* src/test/resources/$
vi src/test/resources/hive-site.xml
Unlike a normal
hive-site.xml, this version should not save any data
to a permanent place. This is
because unit tests are not supposed to create or preserve any permanent state. javax.jdo.option.ConnectionURL
is set to use a
feature in Derby that only stores the database in main memory. The
warehouse directory hive
.metastore.warehouse.dir
is set to a
location inside /tmp that will be deleted on
each run of the unit test:
<configuration>
<property>
<name>
javax.jdo.option.ConnectionURL</name>
<value>
jdbc:derby:memory:metastore_db;create=true</value>
<description>
JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>
hive.metastore.warehouse.dir</name>
<value>
/tmp/warehouse</value>
<description>
location of default database for the warehouse</description>
</property>
</configuration>
Hive_test
provides
several classes that extend JUnit test cases. HiveTestService
set up the environment, cleared
out the warehouse directory, and launched a metastore and HiveService
in-process. This is typically the
component to extend for testing. However, other components, such as
HiveTestEmbedded
are also
available:
package
com
.
jointhegrid
.
hive_test
;
import
java.io.BufferedWriter
;
import
java.io.IOException
;
import
java.io.OutputStreamWriter
;
import
org.apache.hadoop.fs.FSDataOutputStream
;
import
org.apache.hadoop.fs.Path
;
/* Extending HiveTestService creates and initializes
the metastore and thrift service in an embedded mode */
public
class
ServiceHiveTest
extends
HiveTestService
{
public
ServiceHiveTest
()
throws
IOException
{
super
();
}
public
void
testExecute
()
throws
Exception
{
/* Use the Hadoop filesystem API to create a
data file */
Path
p
=
new
Path
(
this
.
ROOT_DIR
,
"afile"
);
FSDataOutputStream
o
=
this
.
getFileSystem
().
create
(
p
);
BufferedWriter
bw
=
new
BufferedWriter
(
new
OutputStreamWriter
(
o
));
bw
.
write
(
"1 "
);
bw
.
write
(
"2 "
);
bw
.
close
();
/* ServiceHive is a component that connections
to an embedded or network HiveService based
on the constructor used */
ServiceHive
sh
=
new
ServiceHive
();
/* We can now interact through the HiveService
and assert on results */
sh
.
client
.
execute
(
"create table atest (num int)"
);
sh
.
client
.
execute
(
"load data local inpath '"
+
p
.
toString
()
+
"' into table atest"
);
sh
.
client
.
execute
(
"select count(1) as cnt from atest"
);
String
row
=
sh
.
client
.
fetchOne
();
assertEquals
(
"2"
,
row
);
sh
.
client
.
execute
(
"drop table atest"
);
}
}
Hive v0.8.0 introduced a Plugin Developer Kit (PDK). Its intent is to allow developers to build and test plug-ins without the Hive source. Only Hive binary code is required.
The PDK is relatively new and has some subtle bugs of its own that can make it difficult to use. If you want to try using the PDK anyway, consult the wiki page, https://cwiki.apache.org/Hive/plugindeveloperkit.html, but note that this page has a few errors, at least at the time of this writing.