Impala

Impala is a new member of the Hadoop ecosystem. Its beta version first became available in 2012 and the first stable release was done in June 2013. Even though Impala is a new project and still has lots of things that need to be improved, the significance of the goal that it is trying to achieve makes it worth mentioning in this book. Impala's goal is very ambitious—bringing real-time queries to Hadoop. Hive made it possible to use a SQL-like language to query data in Hadoop, but it was still limited by the MapReduce framework when it comes to performance. It is worth mentioning that projects like Stinger (http://hortonworks.com/labs/stinger/) are dedicated to significantly improve Hive performance, but it is still in development.

Impala bypasses MapReduce and operates on the data directly in HDFS to achieve significant performance improvements. Impala is written mostly in C++. It uses RAM buffers to cache data and generally operates more like parallel relational databases. Impala was designed to be compatible with Hive. It uses the same HiveQL language and even utilizes existing Hive Metastore to get information about tables and columns. If you have existing Hive tables in your cluster, you will be able to query them with Impala without any changes.

Impala architecture

Impala comprises of several components. There is an Impala server process, which should be running on each DataNode in the cluster. Its responsibility is to process queries from clients, read and write data into HDFS, perform join and aggregation tasks, and so on. Since queries are executed in parallel on many servers, the daemon that received the user request becomes a query coordinator and synchronizes the work of all other nodes for this particular query.

As DataNodes in Hadoop cluster or individual Impala daemons can go down from time to time, there is a need to keep a constant track of the state of all daemons in the cluster. For this purpose, Impala uses an Impala state store. If some of the Impala servers do not report back to the state store, they get marked as dead and are excluded for further queries attempts. Impala servers can operate without a state store, but the response may be delayed if there are failed nodes in the cluster.

Additionally, Impala utilizes the same Metastore service that we have configured for Hive in the Installing Hive Metastore section. Impala provides both, a command-line interface and JDBC/ODBC access.

There are no additional hardware requirements for Impala services. Impala servers are always running on DataNodes, because they rely on local data access. Impala state store can be co-located with existing Hadoop services, such as JobTracker or standby NameNode.

Installing Impala state store

For our cluster, we will install state store on the JobTracker node. Impala is not included in the standard CDH repository, so you will need to add a new repository by copying the http://archive.cloudera.com/impala/redhat/6/x86_64/impala/cloudera-impala.repo file into the /etc/yum.repos.d/ directory on all Hadoop nodes, except NameNodes.

Note

The link to the Impala repository may change. Please refer to the Cloudera documentation at the following website for up-to-date information: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_prereqs.html

Once this is done, you can install Impala state store by running the following yum command:

# yum install impala-state-store

State store package contains only startup scripts and Impala binaries will be installed as part of dependent impala packages.

The state store doesn't require much configuration. Impala doesn't pick up existing Hadoop configuration files, but relies on HDFS for data access. This means that you will need to manually copy the core-site.xml and hdfs-site.xml files from your Hadoop configuration directory into /etc/imala/conf directory.

You can adjust the state store port and logging directory by editing the /etc/default/impala file. By default, state store uses port 24000. We will keep it as it is.

To start state store, run the following command:

# service impala-state-store start

You need to check the log file in /var/log/impala to make sure the service has started properly.

Installing the Impala server

To install the Impala server on DataNodes, run the following command:

# yum install impala-server

This will install Impala binaries, as well as the server start stop script. Before we can start the Impala server, some additional tuning needs to be applied to the DataNode configuration.

Impala uses an HDFS feature called short circuit reads. It allows Impala to bypass the standard DataNode level and be able to read file blocks directly from the local storage. To enable this feature, you need to add the following options to the hdfs-site.xml file:

<name>dfs.client.read.shortcircuit</name>
<value>true</value>

<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>

Local HDFS clients like Impala will use the socket path to communicate with DataNode. You need to leave the value of dfs.domain.socket.path as it is shown here. Hadoop will substitute it with the DataNode port number.

Additionally, Impala will try and track the location of file blocks independently from DataNodes. Previously, the NameNode service was only able to provide information about which block resides on which server. Impala takes this one step further and keeps track of which file blocks are stored on which disk for this DataNode. To do that, a new API was introduced in CDH 4.1. To enable it, you need to add the following options to the hdfs-site.xml file:

<name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
<value>true</value>

<name>dfs.client.file-block-storage-locations.timeout</name>
<value>3000</value>

You will need to apply the preceding changes on all DataNodes and restart DataNodes daemons before you can proceed with the Impala configuration.

The next thing that we need to do is to point the Impala server to the state store, which we have configured on our JobTracker earlier. To do that, change the following variables in the /etc/default/impala configuration file:

IMPALA_STATE_STORE_HOST=jt1.hadoop.test.com
IMPALA_STATE_STORE_PORT=24000

Impala is a very memory-intensive process. Currently, it has to perform operations such as table joins completely in memory. If there is not enough memory available, the query will be aborted. Additionally, Impala will need to cache metadata about existing tables, files, and blocks' locations. To control the upper limit of Impala memory allocation, add the –mem_limit=X option to the IMPALA_SERVER_ARGS variable in /etc/default/impala. X will be the percentage of the available physical RAM on the server.

To properly allocate resources for both MapReduce tasks and Impala queries, you need to know what are your cluster workload profiles. If you know that there will be a MapReduce job running concurrently with Impala queries, you may need to limit the Impala memory to 40 percent of the available RAM, or even less. Be prepared that some Impala queries will not be able to complete in this case. If you are planning to use Impala as a primary data-processing tool, you can allocate as much as 70-80 percent of RAM to it.

The Impala server, as well as the Impala state store does not re-use existing Hadoop configuration files. Before you start the Impala server, you need to copy the core-site.xml and hdfs-site.xml files into the /etc/impala/conf directory.

To start the Impala server, run the following command:

# service impala-server start

To use Impala in a command line, you will need to install the Impala command-line interface:

# yum install impala-shell

To launch the Impala shell, execute the following command:

# imapla-shell

To connect to the Impala server, run the following command in the Impala shell:

> connect dn1.hadoop.test.com;
Connected to dn1.hadoop.test.com:21000
Server version: impalad version 1.0.1 RELEASE (build df844fb967cec8740f08dfb8b21962bc053527ef)

Note

Keep in mind, that currently Impala doesn't handle automatic metadata updates. For example, if you have created a new table using Hive, you will need to run a REFRESH command in Impala shell to see the changes.

You now have a fully functional Impala server running on one of your DataNodes. To complete the installation, you need to propagate this setup across all your DataNodes. Each Impala server can accept client requests from shell or applications. The server, which received the requests, becomes a query coordinator, so it is a good idea to distribute requests equally among the nodes in the cluster.

Impala is still in the active development phase. Make sure you check the documentation for every new release, because many aspects of the Impala configuration and general behavior may change in future.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset