Chapter 3. Configuring the Hadoop Ecosystem

Hadoop is a powerful distributed data processing system. The cluster which we configured in the previous chapter is a ready-to-use system, but if you start using Hadoop in this configuration for any real-life applications, you will very soon discover that MapReduce provides a very low-level way to access and process the data. You will need to figure out lots of things on your own. You will need to decide how to export data from external sources and upload it into Hadoop in the most efficient way. You will need to figure out what format to store data in and write the Java code to implement data processing in the MapReduce paradigm. The Hadoop ecosystem includes a number of side projects that have been created to address different aspects of loading, processing, and extracting data. In this chapter, we will go over setting up and configuring several popular and important Hadoop ecosystem projects:

  • Sqoop for extracting data from external data sources
  • Hive for high-level, SQL-like access to data
  • Impala for real-time data processing

There are many more Hadoop related projects, but we will focus on those that will instantly improve the Hadoop cluster usability for end users.

Hosting the Hadoop ecosystem

Often, additional Hadoop components are not hosted within the cluster itself. Most of these projects act as clients to HDFS and MapReduce and can be executed on separate servers. Such servers were marked as Hadoop clients, as shown in the cluster diagram from Chapter 1, Setting Up Hadoop Cluster – from Hardware to Distribution. The main reason for separating Hadoop nodes from clients physically and on a network level is security. Hadoop client servers are supposed to be accessed by different people within your organization. If you decide to run clients on the same servers as Hadoop, you will have to put in lots of effort providing a proper level of access to every user. Separating those instances logically and physically simplifies the task. Very often, Hadoop clients are deployed on virtual machines, since resource requirements are modest.

Note

Keep in mind those clients who read and write HDFS data need to be able to access NameNodes, as well as all DataNodes in the cluster.

If you are running a small cluster or have a limited number of users and are not too concerned with security issues, you can host most of the client programs on the same nodes as NameNodes or DataNodes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset