Hadoop distributions

Hadoop comes in many different flavors. There are many different versions and many different distributions available from a number of companies. There are several key players in this area today and we will discuss what options they provide.

Hadoop versions

Hadoop releasing a versioning system is, to say the least, confusing. There are several branches with different stable versions available and it is important to understand what features each branch provides (or excludes). As of now, these are the following Hadoop versions available: 0.23, 1.0, and 2.0. Surprisingly, higher versions do not always include all the features from the lower versions. For example, 0.23 includes NameNode High Availability and NameNode Federation, but drops support for the traditional MaprReduce framework (MRv1) in favor of a new YARN framework (MRv2).

MRv2 is compatible with MRv1 on the API level, but a daemon's setup and configuration, and concepts are different. Version 1.0 still includes MRv1, but lacks NameNode HA and Federation features, which many consider critical for production usage. Version 2.0 is actually based on 0.23 and has the same feature set, but will be used for future development and releases. One of the reasons that Hadoop released versions seem not to follow straightforward logic, is that Hadoop is still a relatively new technology and many features that are highly desirable by some users can introduce instability and sometimes they require significant code changes and approach changes, such as in a case with YARN. This leads to lots of different code branches with different stable release versions and lots of confusion to the end user. Since the purpose of this book is to guide you through planning and implementing the production Hadoop cluster, we will focus on stable Hadoop versions that provide proven solutions such as MRv1, but will also include important availability features for the NameNode. As you can see, this will narrow down the choice of a Hadoop release version right away.

Choosing Hadoop distribution

Apache Hadoop is not the only distribution available. There are several other companies that maintain their own forks of the project, both free and proprietary. You probably have already started seeing why this would make sense: streamlining the release process for Hadoop and combining different features from several Hadoop branches makes it much easier for the end user to implement a cluster. One of the most popular non-Apache distributions of Hadoop is Cloudera Hadoop Distribution or CDH.

Cloudera Hadoop distribution

Cloudera is the company that provides commercial support, professional services, and advanced tools for Hadoop. Their CDH distribution is free and open source under the same Apache 2.0 license. What makes CDH appealing to the end user is that there are fewer code branches, version numbers are aligned, and critical bug fixes are backported to older versions. At this time, the latest major CDH release version is CDH4, which combines features from Apache 2.0 and 1.0 releases. It includes NameNode HA and Federation, supports both MRv1 and MRv2, which none of the Apache releases does at the moment. Another valuable feature that CDH provides, is integration of different Hadoop ecosystem projects. HDFS and MapReduce are core components of Hadoop, but over time many new projects were built on top of these components. These projects make Hadoop more user-friendly, speed up development cycles, build multitier MapReduce jobs easily, and so on.

One of the projects available in CDH that is gaining a lot of attention is Impala, which allows running real-time queries on Hadoop, bypassing MapReduce layer completely and accessing data directly from HDFS. Having dozens of ecosystem components, each with its own compatibility requirements and a variety of Apache Hadoop branches, does not make integration an easy task. CDH solves this problem for you by providing core Hadoop and most of the popular ecosystem projects that are compatible and tested with each other in one distribution. This is a big advantage for the user and it made CDH the most popular Hadoop distribution at the moment (according to Google Trends). In addition to CDH, Cloudera also distributes Cloudera Manager—a web based management tool to provision, configure, and monitor your Hadoop cluster. Cloudera Manager comes in both free and paid enterprise versions.

Hortonworks Hadoop distribution

Another popular Hadoop distribution is Hortonworks Data Platform (HDP), by Hortonworks. Similarly to Cloudera, Hortonworks provides a pre-packaged distribution of core and ecosystem Hadoop projects, as well as commercial support and services for it. As of now, the latest stable version of HDP 1.2 and 2.0 is in Alpha stage; both are based on Apache Hadoop 1.0 and 2.0 accordingly. HDP 1.2 provides several features that are not included in the CDH or Apache distribution. Hortonworks implemented NameNode HA on Hadoop 1.0, not by back porting JournalNodes and Quorum-based storage from Apache Hadoop 2.0, but rather by implementing cold cluster failover based on Linux HA solutions. HDP also includes HCatalog—a service that provides an integration point for projects like Pig and Hive. Hortonworks makes a bet on integrating Hadoop with traditional BI tools, an area that has lots of interest from existing and potential Hadoop users. HDP includes an ODBC driver for Hive, which is claimed to be compatible with most existing BI tools. Another unique HDP feature is its availability on the Windows platform. Bringing Hadoop to the Windows world will have a big impact on the platform's adoption rates and can make HDP a leading distribution for this operating system, but unfortunately this is still in alpha version and can't be recommended for the production usage at the moment. When it comes to cluster management and monitoring, HDP includes Apache Ambari, which is a web-based tool, similar to Cloudera Manager, but is 100 percent free and open source with no distinction between free and enterprise versions.

MapR

While Cloudera and Hortonworks provide the most popular Hadoop distributions, they are not the only companies that use Hadoop as a foundation for their products. There are several projects that should be mentioned here. MapR is a company that provides a Hadoop-based platform. There are several different versions of their product: M3 is a free version with limited features, and M5 and M7 are Enterprise level commercial editions. MapR takes a different approach than Cloudera or Hortonworks. Their software is not free, but has some features that can be appealing to the Enterprise users. The major difference of the MapR platform from Apache Hadoop is that instead of HDFS, a different proprietary filesystem called MapR-FS is used. MapR-FS is implemented in C++ and provides lower latency and higher concurrency access than Java-based HDFS. It is compatible with Hadoop on an API level, but it's a completely different implementation. Other MapR-FS features include the ability to mount Hadoop cluster as an NFS volume, cluster-wide snapshots, and cluster mirroring. Obviously, all these features rely on the MapR-FS implementation.

As you can see, the modern Hadoop landscape is far from being plain. There are many options to choose from. It is easy to narrow down the list of available options when you consider requirements for production cluster. Production Hadoop version needs to be stable and well tested. It needs to include important components, such as NameNode HA and proved MRv1 framework. For you, as a Hadoop administrator, it is important to be able to easily install Hadoop on multiple nodes, without a need to handpick required components and worry about compatibility. These requirements will quickly draw your attention to distributions like CDH or HDP. The rest of this book will be focused around CDH distribution as it is the most popular choice for production installations right now. CDH also provides a rich features set and good stability. It is worth mentioning that Hadoop 2 got its first GA release while this book was in progress. Hadoop 2 brings in many new features such as NameNode High Availability, which were previously available only in CDH.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset