Hive has an optional component known as HiveServer or HiveThrift that allows access to Hive over a single port. Thrift is a software framework for scalable cross-language services development. See http://thrift.apache.org/ for more details. Thrift allows clients using languages including Java, C++, Ruby, and many others, to programmatically access Hive remotely.
The CLI is the most common way to access Hive. However, the design of the CLI can make it difficult to use programmatically. The CLI is a fat client; it requires a local copy of all the Hive components and configuration as well as a copy of a Hadoop client and its configuration. Additionally, it works as an HDFS client, a MapReduce client, and a JDBC client (to access the metastore). Even with the proper client installation, having all of the correct network access can be difficult, especially across subnets or datacenters.
To Get started with the HiveServer, start it in the
background using the service
knob for
hive
:
$
cd
$HIVE_HOME
$
bin/hive --service hiveserver & Starting Hive Thrift Server
A quick way to ensure the HiveServer is running is to use the
netstat
command to determine if port 10,000 is open and
listening for connections:
$
netstat -nl | grep 10000
tcp 0 0 :::10000 :::* LISTEN
(Some whitespace removed.) As mentioned, the HiveService uses Thrift. Thrift provides an interface language. With the interface, the Thrift compiler generates code that creates network RPC clients for many languages. Because Hive is written in Java, and Java bytecode is cross-platform, the clients for the Thrift server are included in the Hive release. One way to use these clients is by starting a Java project with an IDE and including these libraries or fetching them through Maven.
For this example we will use Groovy. Groovy is an agile and dynamic language for the Java Virtual Machine. Groovy is ideal for prototyping because it integrates with Java and provides a read-eval-print-loop (REPL) for writing code on the fly:
$
curl -o http://dist.groovy.codehaus.org/distributions/groovy-binary-1.8.6.zip$
unzip groovy-binary-1.8.6.zip
Next, add all Hive JARs to Groovy’s classpath
by editing the groovy-starter.conf
. This will allow Groovy to
communicate with Hive without having to manually load JAR files each
session:
# load required libraries load !{groovy.home}/lib/*.jar # load user specific libraries load !{user.home}/.groovy/lib/*.jar # tools.jar for ant tasks load ${tools.jar} load /home/edward/hadoop/hadoop-0.20.2_local/*.jar load /home/edward/hadoop/hadoop-0.20.2_local/lib/*.jar load /home/edward/hive-0.9.0/lib/*.jar
Groovy has an @grab
annotation that can fetch
JAR files from Maven web repositories, but currently some packaging
issues with Hive prevent this from working correctly.
Groovy provides a shell found inside the distribution at
bin/groovysh. Groovysh
provides a
REPL for interactive programming. Groovy code is similar to Java code,
although it does have other forms including closures. For the most part,
you can write Groovy as you would write Java.
From the REPL, import Hive- and Thrift-related classes. These classes are used to connect to Hive and create an instance of HiveClient. HiveClient has the methods users will typically use to interact with Hive:
$
$HOME
/
groovy
/
groovy
-
1.8
.
0
/
bin
/
groovysh
Groovy
Shell
(
1.8
.
0
,
JVM:
1.6
.
0
_23
)
Type
'
help
'
or
'h'
for
help
.
groovy:
000
>
import
org.apache.hadoop.hive.service.*
;
groovy:
000
>
import
org.apache.thrift.protocol.*
;
groovy:
000
>
import
org.apache.thrift.transport.*
;
groovy:
000
>
transport
=
new
TSocket
(
"localhost"
,
10000
);
groovy:
000
>
protocol
=
new
TBinaryProtocol
(
transport
);
groovy:
000
>
client
=
new
HiveClient
(
protocol
);
groovy:
000
>
transport
.
open
();
groovy:
000
>
client
.
execute
(
"show tables"
);
The getClusterStatus
method retrieves information from the Hadoop JobTracker. This can be used
to collect performance metrics and can also be used to wait for a lull to
launch a job:
groovy:
000
>
client
.
getClusterStatus
()
===>
HiveClusterStatus
(
taskTrackers:
50
,
mapTasks:
52
,
reduceTasks:
40
,
maxMapTasks:
480
,
maxReduceTasks:
240
,
state:
RUNNING
)
After executing a query, you can get the schema of the
result set using the getSchema()
method. If you call this method before a query, it may return a
null
schema:
groovy:
000
>
client
.
getSchema
()
===>
Schema
(
fieldSchemas:
null
,
properties:
null
)
groovy:
000
>
client
.
execute
(
"show tables"
);
===>
null
groovy:
000
>
client
.
getSchema
()
===>
Schema
(
fieldSchemas:
[
FieldSchema
(
name:
tab_name
,
type:
string
,
comment:
from
deserializer
)],
properties:
null
)
After a query is run, you can fetch results with the
fetchOne()
method. Retrieving large result sets with
the Thrift interface is not suggested. However, it does offer several
methods to retrieve data using a one-way cursor. The fetchOne()
method retrieves an entire
row:
groovy:
000
>
client
.
fetchOne
()
===>
cookjar_small
Instead of retrieving rows one at a time, the entire result set can
be retrieved as a string array using the fetchAll()
method:
groovy:
000
>
client
.
fetchAll
()
===>
[
macetest
,
missing_final
,
one
,
time_to_serve
,
two
]
Also available is fetchN
, which fetches N rows at
a time.
After a query is started, the getQueryPlan()
method is used to retrieve status
information about the query. The information includes information on
counters and the state of the job:
groovy:
000
>
client
.
execute
(
"SELECT * FROM time_to_serve"
);
===>
null
groovy:
000
>
client
.
getQueryPlan
()
===>
QueryPlan
(
queries:
[
Query
(
queryId:
hadoop_20120218180808_
...-
aedf367ea2f3
,
queryType:
null
,
queryAttributes:
{
queryString
=
SELECT
*
FROM
time_to_serve
},
queryCounters:
null
,
stageGraph:
Graph
(
nodeType:
STAGE
,
roots:
null
,
adjacencyList:
null
),
stageList:
null
,
done:
true
,
started:
true
)],
done:
false
,
started:
false
)
(A long number was elided.)
The Hive service also connects to the Hive metastore
via Thrift. Generally, users should
not call metastore
methods that modify
directly and should only interact with Hive via the HiveQL language. Users
should utilize the read-only methods that provide meta-information about tables. For example,
the get_partition_names
(String,String,short)
method can be used to
determine which partitions are available to a query:
groovy:
000
>
client
.
get_partition_names
(
"default"
,
"fracture_act"
,
(
short
)
0
)
[
hit_date
=
20120218
/
mid
=
001839
,
hit_date
=
20120218
/
mid
=
001842
,
hit_date
=
20120218
/
mid
=
001846
]
It is important to remember that while the metastore API is relatively stable in terms of changes, the methods inside, including their signatures and purpose, can change between releases. Hive tries to maintain compatibility in the HiveQL language, which masks changes at these levels.
The ability to access the metastore programmatically provides the capacity to monitor and enforce conditions across your deployment. For example, a check can be written to ensure that all tables use compression, or that tables with names that start with zz should not exist longer than 10 days. These small “Hive-lets” can be written quickly and executed remotely, if necessary.
By default, managed tables store their data inside the warehouse directory, which is /user/hive/warehouse by default. Usually, external tables do not use this directory, but there is nothing that prevents you from putting them there. Enforcing a rule that managed tables should only be inside the warehouse directory will keep the environment sane.
In the following application, the outer loop iterates through
the list returned from get
_all_databases()
. The inner loop iterates
through the list returned from get
_all_tables(database)
. The Table
object returned from get_table(database,table)
has all the
information about the table in the metastore. We determine the
location of the table and check that the type matches the string
MANAGED_TABLE
. External tables have
a type EXTERNAL
. A list of “bad”
table names is returned:
public
List
<
String
>
check
(){
List
<
String
>
bad
=
new
ArrayList
<
String
>();
for
(
String
database:
client
.
get_all_databases
()
){
for
(
String
table:
client
.
get_all_tables
(
database
)
){
try
{
Table
t
=
client
.
get_table
(
database
,
table
);
URI
u
=
new
URI
(
t
.
getSd
().
getLocation
());
if
(
t
.
getTableType
().
equals
(
"MANAGED_TABLE"
)
&&
!
u
.
getPath
().
contains
(
"/user/hive/warehouse"
)
){
System
.
out
.
println
(
t
.
getTableName
()
+
" is a non external table mounted inside /user/hive/warehouse"
);
bad
.
add
(
t
.
getTableName
());
}
}
catch
(
Exception
ex
){
System
.
err
.
println
(
"Had exception but will continue "
+
ex
);
}
}
}
return
bad
;
}
The Hive CLI creates local artifacts like the
.hivehistory file along with entries in
/tmp and hadoop.tmp.dir
. Because the HiveService becomes
the place where Hadoop jobs launch from, there are some considerations
when deploying it.
HiveService is a good alternative to having the entire Hive client install local to the machine that launches the job. Using it in production does bring up some added issues that need to be addressed. The work that used to be done on the client machine, in planning and managing the tasks, now happens on the server. If you are launching many clients simultaneously, this could cause too much load for a single HiveService. A simple solution is to use a TCP load balancer or proxy to alternate connections between a pool of backend servers.
There are several ways to do TCP load balancing and you
should consult your network administrator for the best solution. We
suggest a simple solution that uses the haproxy
tool
to balance connections between backend ThriftServers.
First, inventory your physical ThriftServers and document the virtual server that will be your proxy (Tables 16-1 and 16-2).
Table 16-1. Physical server inventory
Short name | Hostname and port |
---|---|
HiveService1 | hiveservice1.example.pvt:10000 |
HiveService2 | hiveservice2.example.pvt:10000 |
Install ha-proxy (HAP). Depending on your operating system and distribution these steps may be different. Example assumes a RHEL/CENTOS distribution:
$sudo
yum install haproxy
Use the inventory prepared above to build the configuration file:
$ more /etc/haproxy/haproxy.cfg listen hiveprimary 10.10.10.100:10000 balance leastconn mode tcp server hivethrift1 hiveservice1.example.pvt:10000 check server hivethrift2 hiveservice1.example.pvt:10000 check
Start HAP via the system init
script. After you
have confirmed it is working, add it to the default system start-up with
chkconfig
:
$
sudo /etc/init.d/haproxy start$
sudo chkconfig haproxy on
Hive offers the configuration variable hive.start.cleanup.scratchdir
, which is set to
false. Setting it to true will cause the service to clean up its scratch
directory on restart:
<property>
<name>
hive.start.cleanup.scratchdir</name>
<value>
true</value>
<description>
To clean up the Hive scratchdir while starting the Hive server</description>
</property>
Typically, a Hive session connects directly to a JDBC database, which it uses as a metastore. Hive provides an optional component known as the ThriftMetastore. In this setup, the Hive client connects to the ThriftMetastore, which in turn communicates to the JDBC Metastore. Most deployments will not require this component. It is useful for deployments that have non-Java clients that need access to information in the metastore. Using the metastore will require two separate configurations.
The ThriftMetastore should be set up to communicate with the actual metastore using JDBC. Then start up the metastore in the following manner:
$
cd
~$
bin/hive --service metastore &[
1]
17096 Starting Hive Metastore Server
Confirm the metastore is running using the
netstat
command:
$
netstat -an | grep 9083
tcp 0 0 :::9083 :::* LISTEN
Clients like the CLI should communicate with the metastore directory:
<property>
<name>
hive.metastore.local</name>
<value>
false</value>
<description>
controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
</property>
<property>
<name>
hive.metastore.uris</name>
<value>
thrift://metastore_server:9083</value>
<description>
controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
</property>
This change should be seamless from the user experience. Although, there are some nuances with Hadoop Security and the metastore having to do work as the user.