Chapter 16. Hive Thrift Service

Hive has an optional component known as HiveServer or HiveThrift that allows access to Hive over a single port. Thrift is a software framework for scalable cross-language services development. See http://thrift.apache.org/ for more details. Thrift allows clients using languages including Java, C++, Ruby, and many others, to programmatically access Hive remotely.

The CLI is the most common way to access Hive. However, the design of the CLI can make it difficult to use programmatically. The CLI is a fat client; it requires a local copy of all the Hive components and configuration as well as a copy of a Hadoop client and its configuration. Additionally, it works as an HDFS client, a MapReduce client, and a JDBC client (to access the metastore). Even with the proper client installation, having all of the correct network access can be difficult, especially across subnets or datacenters.

Starting the Thrift Server

To Get started with the HiveServer, start it in the background using the service knob for hive:

$ cd $HIVE_HOME
$ bin/hive --service hiveserver &
Starting Hive Thrift Server

A quick way to ensure the HiveServer is running is to use the netstat command to determine if port 10,000 is open and listening for connections:

$ netstat -nl | grep 10000
tcp  0  0 :::10000         :::*          LISTEN

(Some whitespace removed.) As mentioned, the HiveService uses Thrift. Thrift provides an interface language. With the interface, the Thrift compiler generates code that creates network RPC clients for many languages. Because Hive is written in Java, and Java bytecode is cross-platform, the clients for the Thrift server are included in the Hive release. One way to use these clients is by starting a Java project with an IDE and including these libraries or fetching them through Maven.

Setting Up Groovy to Connect to HiveService

For this example we will use Groovy. Groovy is an agile and dynamic language for the Java Virtual Machine. Groovy is ideal for prototyping because it integrates with Java and provides a read-eval-print-loop (REPL) for writing code on the fly:

$ curl -o http://dist.groovy.codehaus.org/distributions/groovy-binary-1.8.6.zip
$ unzip groovy-binary-1.8.6.zip

Next, add all Hive JARs to Groovy’s classpath by editing the groovy-starter.conf. This will allow Groovy to communicate with Hive without having to manually load JAR files each session:

# load required libraries
load !{groovy.home}/lib/*.jar

# load user specific libraries
load !{user.home}/.groovy/lib/*.jar

# tools.jar for ant tasks
load ${tools.jar}

load /home/edward/hadoop/hadoop-0.20.2_local/*.jar
load /home/edward/hadoop/hadoop-0.20.2_local/lib/*.jar
load /home/edward/hive-0.9.0/lib/*.jar

Note

Groovy has an @grab annotation that can fetch JAR files from Maven web repositories, but currently some packaging issues with Hive prevent this from working correctly.

Groovy provides a shell found inside the distribution at bin/groovysh. Groovysh provides a REPL for interactive programming. Groovy code is similar to Java code, although it does have other forms including closures. For the most part, you can write Groovy as you would write Java.

Connecting to HiveServer

From the REPL, import Hive- and Thrift-related classes. These classes are used to connect to Hive and create an instance of HiveClient. HiveClient has the methods users will typically use to interact with Hive:

$ $HOME/groovy/groovy-1.8.0/bin/groovysh
Groovy Shell (1.8.0, JVM: 1.6.0_23)
Type 'help' or 'h' for help.
groovy:000> import org.apache.hadoop.hive.service.*;
groovy:000> import org.apache.thrift.protocol.*;
groovy:000> import org.apache.thrift.transport.*;
groovy:000> transport = new TSocket("localhost" , 10000);
groovy:000> protocol = new TBinaryProtocol(transport);
groovy:000> client = new HiveClient(protocol);
groovy:000> transport.open();
groovy:000> client.execute("show tables");

Getting Cluster Status

The getClusterStatus method retrieves information from the Hadoop JobTracker. This can be used to collect performance metrics and can also be used to wait for a lull to launch a job:

groovy:000> client.getClusterStatus()
===> HiveClusterStatus(taskTrackers:50, mapTasks:52, reduceTasks:40,
maxMapTasks:480, maxReduceTasks:240, state:RUNNING)

Result Set Schema

After executing a query, you can get the schema of the result set using the getSchema() method. If you call this method before a query, it may return a null schema:

groovy:000> client.getSchema()
===> Schema(fieldSchemas:null, properties:null)
groovy:000> client.execute("show tables");
===> null
groovy:000> client.getSchema()
===> Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string,
comment:from deserializer)], properties:null)

Fetching Results

After a query is run, you can fetch results with the fetchOne() method. Retrieving large result sets with the Thrift interface is not suggested. However, it does offer several methods to retrieve data using a one-way cursor. The fetchOne() method retrieves an entire row:

groovy:000> client.fetchOne()
===> cookjar_small

Instead of retrieving rows one at a time, the entire result set can be retrieved as a string array using the fetchAll() method:

groovy:000> client.fetchAll()
===> [macetest, missing_final, one, time_to_serve, two]

Also available is fetchN, which fetches N rows at a time.

Retrieving Query Plan

After a query is started, the getQueryPlan() method is used to retrieve status information about the query. The information includes information on counters and the state of the job:

groovy:000> client.execute("SELECT * FROM time_to_serve");
===> null
groovy:000> client.getQueryPlan()
===> QueryPlan(queries:[Query(queryId:hadoop_20120218180808_...-aedf367ea2f3,
queryType:null, queryAttributes:{queryString=SELECT * FROM time_to_serve},
queryCounters:null, stageGraph:Graph(nodeType:STAGE, roots:null,
adjacencyList:null), stageList:null, done:true, started:true)],
done:false, started:false)

(A long number was elided.)

Metastore Methods

The Hive service also connects to the Hive metastore via Thrift. Generally, users should not call metastore methods that modify directly and should only interact with Hive via the HiveQL language. Users should utilize the read-only methods that provide meta-information about tables. For example, the get_partition_names(String,String,short) method can be used to determine which partitions are available to a query:

groovy:000> client.get_partition_names("default", "fracture_act", (short)0)
[ hit_date=20120218/mid=001839,hit_date=20120218/mid=001842,
hit_date=20120218/mid=001846 ]

It is important to remember that while the metastore API is relatively stable in terms of changes, the methods inside, including their signatures and purpose, can change between releases. Hive tries to maintain compatibility in the HiveQL language, which masks changes at these levels.

Example Table Checker

The ability to access the metastore programmatically provides the capacity to monitor and enforce conditions across your deployment. For example, a check can be written to ensure that all tables use compression, or that tables with names that start with zz should not exist longer than 10 days. These small “Hive-lets” can be written quickly and executed remotely, if necessary.

Finding tables not marked as external

By default, managed tables store their data inside the warehouse directory, which is /user/hive/warehouse by default. Usually, external tables do not use this directory, but there is nothing that prevents you from putting them there. Enforcing a rule that managed tables should only be inside the warehouse directory will keep the environment sane.

In the following application, the outer loop iterates through the list returned from get_all_databases(). The inner loop iterates through the list returned from get_all_tables(database). The Table object returned from get_table(database,table) has all the information about the table in the metastore. We determine the location of the table and check that the type matches the string MANAGED_TABLE. External tables have a type EXTERNAL. A list of “bad” table names is returned:

  public List<String> check(){
    List<String> bad = new ArrayList<String>();
    for (String database: client.get_all_databases() ){
      for (String table: client.get_all_tables(database) ){
        try {
          Table t = client.get_table(database,table);
          URI u = new URI(t.getSd().getLocation());
          if (t.getTableType().equals("MANAGED_TABLE") &&
            ! u.getPath().contains("/user/hive/warehouse") ){
            System.out.println(t.getTableName()
              + " is a non external table mounted inside /user/hive/warehouse" );
            bad.add(t.getTableName());
          }
        } catch (Exception ex){
          System.err.println("Had exception but will continue " +ex);
        }
      }
    }
    return bad;
  }

Administrating HiveServer

The Hive CLI creates local artifacts like the .hivehistory file along with entries in /tmp and hadoop.tmp.dir. Because the HiveService becomes the place where Hadoop jobs launch from, there are some considerations when deploying it.

Productionizing HiveService

HiveService is a good alternative to having the entire Hive client install local to the machine that launches the job. Using it in production does bring up some added issues that need to be addressed. The work that used to be done on the client machine, in planning and managing the tasks, now happens on the server. If you are launching many clients simultaneously, this could cause too much load for a single HiveService. A simple solution is to use a TCP load balancer or proxy to alternate connections between a pool of backend servers.

There are several ways to do TCP load balancing and you should consult your network administrator for the best solution. We suggest a simple solution that uses the haproxy tool to balance connections between backend ThriftServers.

First, inventory your physical ThriftServers and document the virtual server that will be your proxy (Tables 16-1 and 16-2).

Table 16-1. Physical server inventory

Short nameHostname and port

HiveService1

hiveservice1.example.pvt:10000

HiveService2

hiveservice2.example.pvt:10000

Table 16-2. Proxy Configuration

HostnameIP

hiveprimary.example.pvt

10.10.10.100

Install ha-proxy (HAP). Depending on your operating system and distribution these steps may be different. Example assumes a RHEL/CENTOS distribution:

$sudo yum install haproxy

Use the inventory prepared above to build the configuration file:

$ more /etc/haproxy/haproxy.cfg
listen hiveprimary 10.10.10.100:10000
balance leastconn
mode tcp
server hivethrift1 hiveservice1.example.pvt:10000 check
server hivethrift2 hiveservice1.example.pvt:10000 check

Start HAP via the system init script. After you have confirmed it is working, add it to the default system start-up with chkconfig:

$ sudo /etc/init.d/haproxy start
$ sudo chkconfig haproxy on

Cleanup

Hive offers the configuration variable hive.start.cleanup.scratchdir, which is set to false. Setting it to true will cause the service to clean up its scratch directory on restart:

<property>
  <name>hive.start.cleanup.scratchdir</name>
  <value>true</value>
  <description>To clean up the Hive scratchdir while
  starting the Hive server</description>
</property>

Hive ThriftMetastore

Typically, a Hive session connects directly to a JDBC database, which it uses as a metastore. Hive provides an optional component known as the ThriftMetastore. In this setup, the Hive client connects to the ThriftMetastore, which in turn communicates to the JDBC Metastore. Most deployments will not require this component. It is useful for deployments that have non-Java clients that need access to information in the metastore. Using the metastore will require two separate configurations.

ThriftMetastore Configuration

The ThriftMetastore should be set up to communicate with the actual metastore using JDBC. Then start up the metastore in the following manner:

$ cd ~
$ bin/hive --service metastore &
[1] 17096
Starting Hive Metastore Server

Confirm the metastore is running using the netstat command:

$ netstat -an | grep 9083
tcp  0  0 :::9083     :::*         LISTEN

Client Configuration

Clients like the CLI should communicate with the metastore directory:

<property>
  <name>hive.metastore.local</name>
  <value>false</value>
  <description>controls whether to connect to remove metastore server
   or open a new metastore server in Hive Client JVM</description>
</property>

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://metastore_server:9083</value>
  <description>controls whether to connect to remove metastore server
  or open a new metastore server in Hive Client JVM</description>
</property>

This change should be seamless from the user experience. Although, there are some nuances with Hadoop Security and the metastore having to do work as the user.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset