Sqoop

Sqoop is a tool, which connects two worlds: relational databases and Hadoop. Importing data from relational databases into Hadoop and the other way around is a very common task. For many organizations, Hadoop is a data integration platform. Let's imagine you have lots of application logfiles produced by your website. These logs have very valuable information about how users interact with your application, but the only user information you can capture in logs is the username. On the other hand, you have a very detailed user profile data in your relational database. Being able to easily export this data into Hadoop and perform some analysis on how different classes of users interact with a website is extremely important. On the other hand, results produced by complex long running MapReduce jobs can be relatively small in size. Very often this data is loaded into RDBMS for real-time data access or integration with BI tools.

Installing and configuring Sqoop

Sqoop is a command-line tool, though future versions will provide a client-server type of access. Sqoop is almost always deployed on the servers outside the Hadoop cluster.

Sqoop needs to be able to access all the Hadoop nodes, as well as relational databases that you plan to use. It is easier to provide such a level of access only to a few selected Sqoop nodes.

It is easy to install Sqoop using the CDH repository:

# yum install sqoop

If you are installing it on a client server, where no CDH packages were yet installed, you will need to install several dependencies, including Hadoop-client and Hadoop-MapReduce packages.

The Sqoop package comes with several configuration files:

# ls -lh /etc/sqoop/conf/
-rwxr-xr-x 1 root root 1.4K Apr 22 20:38 sqoop-env-template.sh
-rwxr-xr-x 1 root root 5.0K Apr 22 20:38 sqoop-site-template.xml
-rwxr-xr-x 1 root root 5.0K Apr 22 20:38 sqoop-site.xml

The template files provide a list of options that you can use to tune Sqoop. Fortunately, if you are using CDH distribution, you will not need to change lots of things there. The main Sqoop configuration file sqoop-site.xml comes with a good set of defaults and Sqoop can work right out of the box. There are several options related to the Metastore configuration that you may want to change. We will discuss them later.

Sqoop relies on JDBC drivers to connect to different database servers. These drivers are not shipped with Sqoop due to licensing differences and have to be installed separately. To demonstrate the basic import/export functionality of Sqoop, we will use a MySQL database as a source. You can download the MySQL JDBC driver from mysql.com. The driver comes in a form of a .jar file and needs to be copied under /usr/lib/sqoop/lib to be picked up by Sqoop automatically.

Sqoop import example

For the following examples, we will use the MySQL "world" database, which can be found at http://dev.mysql.com/doc/index-other.html. To execute the examples, you need to have the MySQL server up and running, load the world database into it, create a user for Sqoop, and make sure that the machine running Sqoop can access the remote (or local) MySQL server.

To import a table from MySQL into Hadoop, you can run the following command:

# sqoop import --connect jdbc:mysql://mysql-db.server.example.com/world --table Country --username sqoop_user –P
Enter password:

The options that are passed to Sqoop in this case are pretty straightforward. We need to specify the JDBC connection string, the table that we want to import (Country in this case), and the MySQL user, we have created for this purpose. You can also specify which JDBC driver to use using the --driver option. Keep in mind that JDBC connection string formats vary between databases, so always refer to the database documentation.

It is important to understand how Sqoop performs the import. Sqoop analyzes the table that you are trying to import and generates a Java MapReduce code to be executed on a Hadoop cluster. Each map task will connect to the MySQL database, read a portion of the data using SELECT statements, and write it into HDFS. Sqoop is trying to utilize the parallelism that Hadoop provides, to speed up the import process. This is great for performance, but one needs to keep in mind some implications of such an approach. One of the most important implications is the fact that import tasks are running from several cluster servers and are completely independent from the database perspective. This means that there is no transactional consistency guaranteed. You can control the number of parallel tasks with the --num-mappers option, which is 4, by default.

Note

Be careful when using the --num-mappers option. It is easy to overload your database with too many concurrent connections coming from multiple Hadoop nodes.

Let's take a look at the files that got imported. By default, Sqoop will create a new directory in the HDFS home directory of the user running the job. The new directory name will match the table name that is being imported. You can control the destination directory with the --warehouse-dir option. Here is what you will see if you check your HDFS home directory after the import (output stripped for brevity):

# hdfs dfs -ls /user/alice/Country
Found 6 items
0 2013-06-05 20:56 /user/alice/Country/_SUCCESS
0 2013-06-05 20:54 /user/alice/Country/_logs
9919 2013-06-05 20:56 /user/alice/Country/part-m-00000
6990 2013-06-05 20:56 /user/alice/Country/part-m-00001
7069 2013-06-05 20:56 /user/alice/Country/part-m-00002
7512 2013-06-05 20:56 /user/alice/Country/part-m-00003

The _SUCCESS and _logs files should be familiar to you from the Wordcount MapReduce program that we ran in Chapter 2, Installing and Configuring Hadoop. This shows very clearly that each Sqoop import is, in fact, a MapReduce job. You will also notice that there are four result files. This is because there were 4 separate map tasks used for this import. Each of them generated a separate plain text file. You can now inspect those files using the hdfs command, run MapReduce jobs over them, and so on.

Sqoop export example

To demonstrate the reverse process of loading data into MySQL (or any other relational database), create a sample CSV file and save it into HDFS /user/alice/tomysql directory:

# hdfs dfs -cat /user/alice/tomysql/result

To export data into a MySQL table, run the following command:

# sqoop export --connect jdbc:mysql://mysql-db.server.example.com/test --table hadoop_results --username sqoop_user -P --export-dir /user/root/tomysql/

This is not very different from the import command. Here, we need to specify which MySQL table will be used for data load with the --test option and from where Sqoop will read the data with --export-dir. After the job is finished, you can verify if the data was properly loaded using the simple SELECT statement.

These examples only show very basic operations that Sqoop can perform. Sqoop is much more flexible and powerful than what was demonstrated. It allows for controlling which data will be imported from relational databases using custom conditions. Sqoop can perform incremental imports, keeping track of what rows were already imported. You can specify which file formats and compression types to use and much more. Detailed documentation on Sqoop can be found at http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html.

Note

The new version of Sqoop 2 will address several limitations of the current version. This includes running in client-server mode, providing reusable database connections, having a proper permissions model, and so on. As of now, Sqoop 2 still doesn't have a stable release.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset