Chapter 1 – The Concepts of Hadoop

“Let me once again explain the rules. Hadoop Rules!”

- Tera-Tom Coffing

What is Hadoop All About?

image

Hadoop is all about lower costs and better value! Hadoop leverages inexpensive commodity hardware servers and inexpensive disk storage. In all previous systems the servers were in the same location, but Hadoop allows for the servers to be scattered around the world. The disks are called JBOD (Just a Bunch of Disks) because they are just unsophisticated disks attached to the commodity hardware. This approach enables incredible capabilities while keeping costs down.

There is a Named Node and Up to 4000 Data Nodes

image

Hadoop is all about parallel processing, full table scans, unstructured data and commodity hardware. There is a single server that is called a “Named Node”. Its job is to keep track of all of the data files on the “Data Nodes”. The named node sends out a heartbeat each minute and the data nodes respond, or they are deemed dead. The Named Node holds a master directory of all databases created, delegates which tables reside on which data nodes, and directs where each data block of a table resides.

The Named Node's Directory Tree

image

The named node keeps the directory tree (seen above) of all files in the Hadoop Distributed File System (HDFS), and tracks where across the cluster the file data is kept. It also sends out a heartbeat and keeps track of the health of the data nodes. It also helps client for reads/writes by receiving their requests and redirecting them to the appropriate data nodes. The named node acts as the host and the data nodes read and write the data as requested.

The Data Nodes

image

The named node sends out a heartbeat each minute and the data nodes respond, or they are deemed dead. The data nodes read and write the data that they are assigned. They also make a copy of each block of data they have and send it to two other nodes in the cluster as a backup in case they are deemed dead or they have a disk failure. There are three copies of every block in a Hadoop cluster as a failsafe mechanism. The data nodes also send a block report to the named node.

Hive MetaStore

Hadoop places data in files on commodity hardware that can be
structured or unstructured. Data stored does not have to be defined.

image

The Hive MetaStore stores table definitions and metadata.This allows
users to define table structures on data as applications need them.

Hive has the Hive Metastore store for all table definitions and related metadata. Hive uses an Object Relational Mapper (ORM) to access relational databases, referred to as ORM. Valid Hive metastore database are growing and currently consist of MySQL, Oracle, PostgreSQL and Derby.

Data Layout and Protection – Step 1

image

The Named Node holds a master directory of all databases created, delegates which tables reside on which data nodes, and directs where each data block of a table resides. Watch exactly what happens when the Sales_Table and others are built. The first step is that the named node has determined that the Sales_Table has one block of data and that it will be written to node 1. It is written.

Data Layout and Protection – Step 2

image

Data node 1 has written a block of the Sales_Table to its disk. Data node 1 will now communicate directly with two other data nodes in order to backup its Sales_Table block in case of a disaster. The block is copied to two other data nodes.

Data Layout and Protection – Step 3

image

At a timed interval, all of the data nodes will provide a current block report to the named node. The named node will place this in its directory tree. The Sales_Table block is now stored in triplicate, just in case there is a disaster or a disk failure.

Data Layout and Protection – Step 4

image

When the named node sent out a heartbeat to check on all of the nodes, node 1 failed to report and it was deemed dead. The named node sends out a message to data nodes 2 and 4 and one of them will have the block copied to another node. The block reports are sent back to the named node and the named node updates its system tree.

How are Blocks Distributed Amongst the Cluster?

image

The table above is 1 GB. By default, the system put 16
(64 MB) blocks across the cluster, which equals 1 GB

Size of data matters. If you have a table with less than 64 MB of data, then it will only be stored in one block (replicated twice for disaster recovery). If the default block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes the named node to manage an enormous amount of metadata. That is why Apache Hadoop defaults the block size to 64 MB and in the Cloudera Hadoop the default is 128 MB. Large blocks are distributed across the data nodes and managed by the named node easily, and parallel processed at just the right level of efficiency.

What is Parallel Processing?

“After enlightenment, the laundry”

- Zen Proverb

image

“After parallel processing the laundry, enlightenment!”

- Hadoop Zen Proverb

Two guys were having fun one Saturday night when one said, “I’ve got to go and do my laundry.” The other asked, “What?!” The man explained that if he went to the laundry mat the next morning, he would be lucky to get one machine and then he would be there all day. But, if he went on Saturday night he could get all the machines. Then, he could do all his washing and drying in two hours. Now that’s parallel processing mixed in with a little dry humor!

The Basics of a Single Computer

image

“When you are courting a nice girl, an hour seems like a second. When you sit on a red-hot cinder, a second seems like an hour. That’s relativity.”

– Albert Einstein

Data on disk does absolutely nothing. When data is requested, the computer moves the data one block at a time from disk into memory. Once the data is in memory, it is processed by the CPU at lightning speed. All computers work this way. The "Achilles Heel" of every computer is the slow process of moving data from disk to memory. The real theory of relativity is to find out how to get blocks of data from the disk into memory faster!

Data in Memory is Fast as Lightning

image

“You can observe a lot by watching.”

– Yogi Berra

Once the data block is moved off of the disk and into memory, the processing of that block happens as fast as lightning. It is the movement of the block from disk into memory that slows down every computer. Data being processed in memory is so fast that even Yogi Berra couldn't catch it!

Parallel Processing of Data

image

"If the facts don't fit the theory, change the facts."

-Albert Einstein

Big Data is all about parallel processing. Parallel processing is all about taking the rows of a table and spreading them among many parallel processing units. In Hadoop, these parallel processing units are referred to as nodes. Above, we can see a table called Orders. There are 16 rows in the table. Each node holds four rows. Now they can process the data in parallel and it will be four times as fast. What Albert Einstein meant to say was, “If the theory doesn't fit the dimension table, change it to a fact." Each node shares nothing and holds a portion of the rows of a table.

Introduction to Hive

Hadoop uses MapReduce to query and return a data set from files that
are stored on commodity hardware.

Hive is a system that provides an SQL interface and a relational model
on top of Hadoop. Hive supports databases, tables and SQL and it is
surprisingly, quite robust, as you will see from this book.

Hive takes SQL, converts the SQL to a MapReduce program, and the
MapReduce program runs in the YARN framework. Hive will
communicate with the YARN (Yet Another Resource Negotiator)
Manager for starting the MapReduce job.

Additional Hive capabilities and features:

Allows queries, inserts and appends, but not updates or deletes.

Data can be separated into partitions and buckets.

Supports cubes, dimensions and star schemas.

Hive is designed for batch processing, but it can process real-time queries in a reasonable time frame. It is not as fast as your Teradata, Oracle, DB2 or SQL Server system. Hive utilizes a query language called HiveQL. Hive allows data stored in the Hadoop Distributed File System (HDFS) to be accessed from within Hadoop or from outside data warehouses through connectors and via the Nexus Chameleon tool. Hadoop uses parallel processing and a distributed database to generate a result using a highly scalable and highly available infrastructure. Hive does not support transactions, updates or deletes of data.

Commodity Hardware Servers are Configured for Hadoop

image

Hadoop allows you to utilize commodity hardware servers called a Data Node.
Hadoop also allows you to configure your parallel processes called Reducers. The
number of Reducers per Host is usually defined by the number of CPU's the
Processing Data Node contains. The Processing Hosts are connected together to create
a Massively Parallel Processing (MPP) system from each SMP.

Hadoop allows commodity hardware to be utilized to create one giant Hadoop cluster.

Commodity Hardware Allows Nodes to Scale Forever (Linear)

image

Hadoop provides incredible speeds with inexpensive costs by allowing customers to purchase commodity hardware. Double the nodes and double your speeds forever!

The Named Node

image

When a user logs into Hadoop, the named node will log them in and be
responsible for their session.

The named node sends a heartbeat our and gathers a block report from the
data nodes. Any data node that doesn't respond is deemed dead.

The host uses system statistics and statistics from the ANALYZE command
to build the best plan.

The host isn't designed to hold data, but instead holds the Global System
Catalog and the name space (directory tree – always in memory)
.

The host is the boss and the nodes are the workers. Who doesn't love their boss? Users login to the host and never communicate directly with the nodes. The host builds a plan for the nodes to follow that is delivered in plan slices. Each slice instructs the nodes what to do. When the nodes have done their work they return it to the host.
This is the way the pros do it.

The Data Node's Responsibilities

image

Data nodes are responsible for storing and retrieving
rows from their assigned disk (Virtual disk).

Data nodes lock the tables and rows.

Data nodes sort rows and do all aggregation.

Data nodes handle all the join processing.

Data nodes handle all space management and space
accounting.

Each data nodes have parallel processes called reducers to maximize parallel processing. Data nodes read, write and process the data they are responsible for. They also communicate among themselves to ensure each of their blocks are copied to two other nodes in the cluster. The data nodes respond to the heartbeat sent out by the named node and they present the named node with block reports. This allows the named node to keep track of the blocks in the named space (directory tree), which always resides in memory on the named node.

All Reducers, Some Reducers or a Single Reducer

image

On most queries the Named Node will broadcast the plan to each node simultaneously, but if you use the cluster key column in the WHERE clause of your SQL with an equality statement, then only a node will be contacted.

A Table has Columns and Rows

image

The table above has 9 rows. Our small system above has three parallel processing units called nodes. Each node holds three rows. Double your nodes and double your speed and power. The idea of parallel processing is to take the rows of a table and spread them across the nodes so each node can process their portion of the data in parallel.

Hadoop has Linear Scalability

image

"A Journey of a thousand miles begins with a single step."

- Lao Tzu

Hadoop was born to be parallel. With each query, a single step is performed in parallel by each node. A Hadoop system consists of a series of nodes that will work in parallel to store and process your data. This design allows you to start small and grow infinitely. If your node system provides you with an excellent Return On Investment (ROI), then continue to invest by purchasing more nodes. Most companies start small, but after seeing what Hadoop can do, they continue to grow their ROI from the single step of implementing a Hadoop system to millions of dollars in profits. Double your nodes and double your speeds. . . . Forever. The Hadoop Data Warehouse actually provides a journey of a thousand smiles!

The Architecture of a Hadoop Data Warehouse

image

“Be the change that you want to see in the world.”

- Mahatma Gandhi

The Named Node (host) is the brains behind the entire operation. The user logs into the host, and for each SQL query, the host will come up with a plan to retrieve the data. It passes that plan to each Data Node, and each of the reducers process their portion of the data. If the data is spread evenly, parallel processing works perfectly. This technology is relatively inexpensive. It might not "be the change", but it will help your company "keep the change" because costs are low. Each Data Node is a Symmetric Multi-Processing (SMP), but then many Data Nodes are lined together to become one big MPP system. Depending on the commodity hardware being used and the number of processors this will determine the number of reducers per reducer node. Above, we can see 8 reducers per Data Node.

How to Find All Databases in the System?

image

Use the SHOW DATABASES command to find all the databases in your system.

Setting Your Default Database with the USE Command

image

SQL Server allows you to set your default database. Above, we have set our default database to SQL_Class. If we run a query without specifying the database, then SQL Server will assume the database is SQL_Class.

List the Tables in a Database with the Show Tables Command

image

Use the SHOW TABLES command to list all of the tables in a database.

Show Basic Table Information with the Describe Command

image

Use the DESCRIBE command to get basic information about a table.

Show Detailed Table Information Using Describe Extended

image

Use DESCRIBE EXTENDED to get additional detailed information about a table (not everything listed above).

The Show Functions Command Lists all System Functions

image

The Show Functions command lists all system functions in the system (not everything listed above).

Describe Function Command Provides Function Information

image

The Describe Function command provides function syntax and basic information about a function.

Describe Function Extended Command Provides Details

image

The Describe Function Extended command provides function syntax and detailed information about a function.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset