Apache Hadoop

The first, and probably still most widely used, framework for big data processing is Apache Hadoop. Its foundation is the Hadoop Distributed File System (HDFS). Developed at Yahoo! in the 2000s, it originally served as an open source alternative to Google File System (GFS), a distributed filesystem that was serving Google's needs for distributed storage of its search index.

Hadoop also implemented a MapReduce alternative to Google's proprietary system, Hadoop MapReduce. Together with HDFS, they constitute a framework for distributed storage and computations. Written in Java, with bindings for most programming languages and many projects that provide abstracted and simple functionality, and sometimes based on SQL querying, it is a system that can reliably be used to store and process terabytes, or even petabytes, of data.

In later versions, Hadoop became more modularized by introducing Yet Another Resource Negotiator (YARN), which provides the abstraction for applications to be developed on top of Hadoop. This has enabled several applications to be deployed on top of Hadoop, such as Storm, Tez, OpenMPI, Giraph, and, of course, Apache Spark, as we will see in the following sections.

Hadoop MapReduce is a batch-oriented system, meaning that it relies on processing data in batches, and is not designed for real-time use cases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset