What is Hadoop?

Hadoop is an open-source framework for working with large quantities of data spread across a single computer to thousands of computers. Hadoop is composed of four modules:

  • Hadoop Core
  • Hadoop Distributed File System (HDFS)
  • Yet Another Resource Negotiator (YARN)
  • MapReduce

The Hadoop Core makes up the components needed to run the other three modules. HDFS is a Java-based file system that has been designed to be distributed and is capable of storing large files across many machines. By large files, we are talking terabytes. YARN manages the resources and scheduling in your Hadoop framework. The MapReduce engine allows you to process data in parallel.

There are several other projects that can be installed to work with the Hadoop framework. In this chapter, you will use Hive and Ambari. Hive allows you to read and write data using SQL. You will use Hive to run the spatial queries on your data at the end of the chapter. Ambari provides a web user interface to Hadoop and Hive. In this chapter, you will use it to upload files and to enter your queries.  

Now that you have an overview of Hadoop, the next section will show you how to set up your environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset