Hadoop for big data

Apache Hadoop is a 100 percent open source software framework used for two important fundamental tasks: storing and processing big data. It has been the leading big data tool for distributed parallel processing of data stored across multiple servers and is able to scale without limits. Because of its scalability, flexibility, fault tolerance, and low-cost features, many cloud-based solution vendors, financial institutions, and enterprises use Hadoop for their big data needs.

The Hadoop framework contains modules that are critical to its functions: the Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), and MapReduce (MapR).

HDFS

HDFS is a file system unique to Hadoop that is designed to be scalable and portable, and allows large amounts of file storage over multiple nodes in a Hadoop cluster spanning gigabytes or terabytes of data. Data in a cluster is split into smaller blocks of 128 Megabytes typically and distributed throughout the cluster. The MapReduce data processing functions are performed on smaller subsets of the larger datasets, thereby providing the scalability needed for big data processing. The HDFS file system uses TCP/IP socket communications to serve data over the network as one big file system.

YARN

YARN is a resource management and scheduling platform that manages the CPU, memory, and storage for applications running on a Hadoop cluster. It contains the components responsible for: allocating resources among applications running within the same cluster while obeying constraints, such as queue capacities and user limits; scheduling tasks based on the resource requirements of each application; negotiating appropriate resources from the scheduler; and tracking and monitoring progress of the running applications and their resource usage.

MapReduce

MapReduce is the software programming paradigm of Hadoop used for processing and generating large datasets. For a developer, MapReduce is the probably the most important programming component in Hadoop. It is made up of two functions: map and reduce. The map function processes a key-value pair to generate an intermediate key-value pair. The reduce function merges all intermediate values with the same intermediate key and produces a result. MapReduce eliminates the need for moving data over the network to be processed by the software, and instead brings the processing software to the data.

MapReduce uses Java predominantly. Other languages, such as SQL and Python, can be implemented for MapReduce using the Hadoop streaming utility.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset