Introduction

RHadoop is a collection of R packages that enables users to process and analyze big data with Hadoop. Before understanding how to set up RHadoop and put it in to practice, we have to know why we need to use machine learning to scale big-data.

In the previous chapters, we have mentioned how useful R is when performing data analysis and machine learning. In traditional statistical analysis, the focus is on performing analysis on historical samples (small data), which may ignore rarely occurring but valuable events and results to uncertain conclusions.

The emergence of cloud technology has made real-time interaction between customers and businesses much more frequent; therefore, the focus of machine learning has now shifted to the development of accurate predictions for various customers. For example, businesses can provide real-time personal recommendations or online advertisements based on personal behavior via the use of a real-time prediction model.

However, if the data (for example, behaviors of all online users) is too large to fit in the memory of a single machine, you have no choice but to use a supercomputer or some other scalable solution. The most popular scalable big-data solution is Hadoop, which is an open source framework able to store and perform parallel computations across clusters. As a result, you can use RHadoop, which allows R to leverage the scalability of Hadoop, helping to process and analyze big data. In RHadoop, there are five main packages, which are:

  • rmr: This is an interface between R and Hadoop MapReduce, which calls the Hadoop streaming MapReduce API to perform MapReduce jobs across Hadoop clusters. To develop an R MapReduce program, you only need to focus on the design of the map and reduce functions, and the remaining scalability issues will be taken care of by Hadoop itself.
  • rhdfs: This is an interface between R and HDFS, which calls the HDFS API to access the data stored in HDFS. The use of rhdfs is very similar to the use of the Hadoop shell, which allows users to manipulate HDFS easily from the R console.
  • rhbase: This is an interface between R and HBase, which accesses Hbase and is distributed in clusters through a Thrift server. You can use rhbase to read/write data and manipulate tables stored within HBase.
  • plyrmr: This is a higher-level abstraction of MapReduce, which allows users to perform common data manipulation in a plyr-like syntax. This package greatly lowers the learning curve of big data manipulation.
  • ravro: This allows users to read avro files in R, or write avro files. It allows R to exchange data with HDFS.

In this chapter, we will start by preparing the Hadoop environment, so that you can install RHadoop. We then cover the installation of three main packages: rmr, rhdfs, and plyrmr. Next, we will introduce how to use rmr to perform MapReduce from R, operate an HDFS file through rhdfs, and perform a common data operation using plyrmr. Further, we will explore how to perform machine learning using RHadoop. Lastly, we will look at how to deploy multiple RHadoop clusters on Amazon EC2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset