Appendix A: Install and Configure Giraph and Hadoop

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

APPENDIX A

Install and Configure Giraph and Hadoop

This chapter covers

System requirements for running Giraph and Hadoop
Installation methods for Giraph and Hadoop
Different modes for running Hadoop for Giraph applications
Configuring Hadoop for all three different running modes

Throughout this book, it was assumed that you had a working Hadoop installation available to run examples and experiment with Giraph. Since Giraph runs on top of Hadoop, having a working Hadoop environment is a fundamental prerequisite. This appendix begins by looking into system dependencies for Hadoop and Giraph deployments. It proceeds to describe various methods of installing Hadoop and Giraph, showing their relative strengths and weaknesses. Next, it looks into the different types of Hadoop deployments and how Giraph deals with the different versions of Hadoop ecosystem projects that it needs to leverage. Finally, this appendix outlines the basics of configuration management for both Hadoop and Giraph and discusses which Hadoop configuration is required for running it in different execution modes.

System Requirements

Both Giraph and Hadoop are implemented in the Java programming language and can run on Unix-based, Max OS X, and Windows systems. Giraph requires Oracle JDK version 7 or higher. Since Giraph jobs are executed by the same JVM that runs the Hadoop framework, this puts a lower bound on the JDK version that can be used for Hadoop deployment. What this means is that if you have an existing Hadoop cluster that you need to use for running Giraph applications, you have to make sure that it was deployed using JDK 7 or above.

You also need JDK installed on the host where Giraph applications will be launched and on the host that is going to be used for Giraph application development. Both Giraph and Hadoop have been wildly tested on Oracle’s JDK, with OpenJDK (an open source, community-driven version of Oracle’s JDK) a close second choice for deployment. If you don’t have JDK installed and you want to use Oracle’s version, go to www.java.com/jdk and follow the installation procedures for your operating system. If, on the other hand, you decide to go the OpenJDK route, you may find it bundled for you operating system by a vendor.

Regardless of how you install JDK, make sure that the location of installation tree is available to all your applications via environment variable JAVA_HOME. If you want that location to also supply the binaries for all Java command-line utilities (including launching JMV itself), you may want to update your PATH with that setting. On Unix platforms, it is often convenient to set up those values in a global shell startup file such as /etc/profile, ~/.bashrc or ~/.bash_profile, similar to what is shown in Listing A-1.

Listing A-1. Setting up JAVA_HOME and Adding Java Binaries to Your PATH

JAVA_HOME=/usr/lib/jvm/default-java
export JAVA_HOME

PATH=$JAVA_HOME/bin:$PATH

Note When it comes to the operating systems, various flavors of Linux are the most widely tested deployment platforms. While it is possible to run Giraph and Hadoop on Mac OS X and even Windows, these platforms are mostly used for development.

Hadoop Installation

Unless you already have a Hadoop cluster available to run Giraph applications on (either on-premises or as a cloud-based service offering), your first step is to decide which major version of Hadoop you would like to use. Currently, there are two major versions available: Hadoop 1.x and Hadoop 2.x. Both of these are considered stable and can be safely used in production. The difference between the two boils down to Hadoop 1.x slowly transitioning into a maintenance mode with very little development activity going on. In contrast to that, Hadoop 2.x development activity remains very high, with bug fixes and new feature development progressing at a brisk pace. Another fundamental difference between these two versions is the architecture of the MapReduce framework. While Hadoop 1.x offers a faithful implementation of MapReduce framework as it is described in the original Google paper, Hadoop 2.x takes it one step further, essentially providing a MapReduce v2 implementation as an application sitting on top of a general-purpose resource scheduler called YARN.

What was wrong with MapReduce v1? The original implementation of MapReduce (now known as MapReduce v1) made an architectural decision of conflating MapReduce-specific logic with lower-level cluster resource management and scheduling. While this provided the fastest route to a functional implementation (and paved the way for Hadoop’s world domination), it also suffered from a number of technical limitations: scalability concerns, failure tolerance, and difficulty in running non-MapReduce frameworks on the same clusters among the top issues. Indeed, as you saw in Chapter 6, Giraph implementation has to trick MapReduce v1 into thinking that it is running as a generic map-reduce application, while in reality, Giraph applications don’t map at all into the generic MapReduce model.

Hadoop 2.x tries to fix these limitations by providing low-level cluster resource management and scheduling capabilities as an independent layer called YARN (Yet Another Resource Negotiator) and running the MapReduce framework on top of YARN. This architecture makes it possible for other distributed frameworks to run side by side with MapReduce applications, without the need to pretend to map into the MapReduce model. It is worth repeating that this is purely an implementation change; all of the existing MapReduce v1 applications remain compatible with MapReduce v2 APIs. Even though MapReduce workloads still dominate Hadoop 2.x clusters today, there’s a robust interest in porting other distributed computation frameworks to run on top of YARN: Giraph, Apache Spark (in-memory engine for data processing), and Hamster (OpenMPI) are just a few examples.

Throughout this book, you used Hadoop 1.x implementation to run examples via map-only MapReduce jobs. The very same workflow still applies to running on top of Hadoop 2.x MapReduce implementation. Even though you haven’t explored running on YARN (since it is still considered somewhat experimental), if you decide on the Hadoop 2.x installation, you can experiment with Giraph as a YARN client in addition to running Giraph via MapReduce. Regardless of which version of Hadoop you choose, your next step is installing the binary Hadoop distribution on your host(s).

Unless you have to worry about installing Hadoop on a large cluster of Linux hosts, the easiest way to get the binary distribution of Hadoop is to download the stable release packaged as a gzipped tar file from the Apache Software Foundation Hadoop release page at http://hadoop.apache.org/releases.html.

Unpack the resulting file in a subdirectory somewhere in your filesystem (you will use that same subdirectory later for installing Giraph), as shown in Listing A-2.

Listing A-2. Installing Hadoop 1.2.1 on a Unix or Max OS X Workstation

$ mkdir ~/dist
$ cd ~/dist
$ tar xzf hadoop-1.2.1.tar.gz

At this point, you have a binary installation of Hadoop and you need to expose the location of that installation to all the other command-line utilities that you are going to run. This is achieved by setting a few environment variables, as shown in Listing A-3 (don’t forget that you can add them to your profile files the same way you could have added JAVA_HOME earlier).

Listing A-3. Setting up an Environment to Run Hadoop Applications

$ HADOOP_HOME=~/dist/hadoop-1.2.1
$ export HADOOP_HOME
$ PATH=$HADOOP_HOME/bin:$PATH
$ hadoop version
Hadoop 1.2.1
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2
Compiled by mattf on Mon Jul 22 15:23:09 PDT 2013
From source with checksum 6923c86528809c4e7e6f493b6b413a9a
This command was run using /Users/shapor/dist/hadoop-1.2.1/hadoop-core-1.2.1.jar

The last command in Listing A-3 proves that the Hadoop installation was successful by running the hadoop command-line utility and seeing the expected output. This is the easiest way to make sure that your Hadoop was installed correctly and can find Java on your system.

Note Install Hadoop as described on all machines in the cluster.

Even though the preceding method of installing Hadoop is extremely easy and it should work on most operating systems, you may want to install from a binary distribution using a package manager. Doing so will guarantee that both Giraph and Hadoop binaries are coming from the same distribution and that they were integrated and tested together to work side by side. The following sections provide additional information on what it takes to go this route.

Giraph Installation

Now that you’ve completed your Hadoop installation, it may be tempting to think that installing Giraph on your systems should be even simpler. The good news is that since Giraph happens to be a client-side-only application, it doesn’t need to be installed on all the hosts of your cluster. It is sufficient to make Giraph installation available on a machine in a cluster from which MapReduce or YARN jobs are typically submitted (these machines are known as gateway or edge nodes). The bad news is that unlike with Hadoop, choosing the ready-made Giraph binary distribution that is compatible with the rest of the Hadoop ecosystem deployed on your cluster may be non-trivial.

The thorny issue here is dependencies. At the very minimum, Giraph has a fundamental dependency on Hadoop; but realistically, its set of dependencies is at least as big as the number of Hadoop ecosystem projects Giraph I/O formats have to interface with. Apache Hadoop ecosystem projects are fortunate enough to have many contributions to their code base from different stakeholders in the open source and business communities. The high pace of project development resulted in quite a few major refactoring cycles of the code base and also produced quite a few commercial offerings based on different points in the Hadoop evolution history.

In general, the Hadoop ecosystem development community needs to be praised for paying a lot of attention to backward compatibility at the public API level. Just like the Linux kernel made it taboo to break user-land application by introducing changes to the public APIs, Hadoop has had a good track record in not overly upsetting the writers of the applications. Modulo bugs, a MapReduce application written against Hadoop 1.x (and not using any private or evolving APIs) should be able to run on Hadoop 2.x unmodified. The catch, however, is that it requires a recompilation of application Java code.

What this means for binary releases of Giraph is that at build-time they need to target the exact same version of Hadoop (and Hadoop ecosystem projects) that will be deployed on the cluster at runtime. At the very minimum, Giraph has to publish two binary releases: one built against Hadoop 1.x and another build against Hadoop 2.x. That, however, still doesn’t account for differences in other dependencies, such as Hive, HBase, and so forth.

This particular issue is not specific to Giraph; it is known as combinatorial explosion of dependencies: the number of permutations of various versions of dependencies grow exponentially with the number of dependencies.

So far, the software engineering community has developed two different strategies for dealing with the combinatorial explosion of dependencies:

Building binaries from source code as part of the software installation process
Releasing complete stacks (or binary software distributions) of tightly integrated components instead of providing independent binary artifacts of individual components and expecting any combination of versions to work with each other

Before you deep-dive into the detailed descriptions of these strategies, let’s consider the fact that the first strategy is really a subset of the second one. The decision of building your Giraph installation from source effectively means that you’re embarking on a mission of producing your own binary software distribution with exact versions of components deployed to your cluster.

Software projects vs. software stacks It is interesting to note that, in general, Apache Software Foundation tries to stay away from binary distributions of its projects, leaving this responsibility to downstream packagers. For the established projects (e.g., Apache HTTP server), these packagers are typically distributors of the various operating systems (Linux, etc.). OS distributors make sure that all of the various ASF software projects that end up in their particular version of an OS distribution work smoothly with each other. Hadoop and its ecosystem projects, however, haven’t been on the agenda of operating system packagers.

To fill this void, various commercial vendors of Hadoop distributions began offering fully integrated sets of packages that don’t require recompilation and are known to work well with each other. Most of these commercial distributions are based on the work done at Apache Bigtop: a 100% open source, community-driven Big Data management distribution of Apache Hadoop. For most of the users of Hadoop and its ecosystem projects, installing a fully integrated distribution from either Apache Bigtop or a commercial vendor is the easiest way to get started. For those with existing clusters, compiling Giraph from source code to match the exact versions of dependencies deployed on the cluster may be the only option.

Installing the Binary Release of Giraph

As mentioned, installing a binary release is the easiest option, but also the most limiting one. If you decide to go this route, you are essentially making the Giraph binary release dictate which versions of Hadoop and the Hadoop ecosystem components you’ll be using. While it is unlikely to be useful for practical work—outside of quickly setting up an environment for running examples in this book, it is simple and very similar to the Hadoop installation process described at the beginning of this appendix.

As with Hadoop, there are two binary releases of Giraph available: one built against the latest version of Hadoop 1.x and the other against the latest version of Hadoop 2.x. And just like a binary release of Hadoop, binary releases of Giraph are packaged as gzipped tar files. Each released version of Giraph includes two binary artifacts: a Giraph binary release that is targeting Hadoop 1.x (available for download as giraph-dist-X.Y.Z-bin.tar.gz) and a Giraph binary release targeting Hadoop 2.x (available for download as giraph-dist-X.Y.Z-hadoop2-bin.tar.gz). Both of these gzipped tar files are available for download from the Giraph project website at http://giraph.apache.org/releases.html.

Because this book has been using examples of Giraph 1.1.0 running on Hadoop 1.2.1, the binary you need to grab is giraph-dist-1.1.0-bin.tar.gz. Once you download the binary, make sure to unpack it on the same workstation that you previously installed Hadoop. Note how the top-level folder that is created after you unpack the archive has an exact version of Hadoop 1.x embedded in its name, which looks like giraph-1.1.0-for-hadoop-1.2.1. Also, remember that on a real cluster, this is a gateway node. If you don’t know where your gateway node is, ask your Hadoop administrator. Regardless of whether you are installing on your laptop or a gateway node, you have to go through the series of steps shown in Listing A-4.

Listing A-4. Installing Giraph 1.1.0 Built to Run on Hadoop 1.2.1

$ cd ~/dist
$ tar xzf giraph-dist-1.1.0-bin.tar.gz
$ GIRAPH_HOME=~/dist/giraph-1.1.0-for-hadoop-1.2.1
$ export GIRAPH_HOME
$ PATH=$GIRAPH_HOME/bin:$PATH
$ giraph
   Usage: giraph [-D<Hadoop property>] <jar containing vertex>
                 <parameters to jar>
   At a minimum one must provide a path to the jar containing the vertex to be executed.

As you can see, Listing A-4 is very similar to what you have done with Hadoop (see Listing A-3), and as with Hadoop, the last line in the listing proves that Giraph is installed and ready to go.

Installing Giraph As Part of a Packaged Hadoop Distribution

Almost all real-world deployments of Hadoop clusters happen on the Linux OS in a form of native Linux packages (DEB or RPM) that come from a binary software distribution of Hadoop. There are two sources for those packages: commercial vendors of Hadoop distributions and Apache Bigtop. Given that almost all commercial Hadoop distributions are derived from Apache Bigtop, you will focus on it in this section. The difference between Bigtop and commercial distributions boils down to support (you cannot buy support for Bigtop) and how quickly new versions of Hadoop ecosystem components are incorporated into the packaging (Bigtop tends to run ahead of the commercial distributions).

Installing Hadoop and its ecosystem projects as part of a packaged Hadoop distribution on Linux guarantees that every component has been integrated and is known to work with every other component coming from the same distribution. Not only that, but the packages are also guaranteed to be well-integrated with the underlying Linux OS by following the packaging guidelines of a given flavor of Linux. The end result is that working with Hadoop installed this way is no different from working with any piece of system software that came bundled with the Linux OS. The only downside to this method of installation is the fact that it is limited to Linux (although the Max OS X brew port is in the works) and that it requires elevated superuser privileges. Make sure to talk to your system administrator (or consult your Linux OS documentation) so that your account can run commands under elevated privileges using sudo(8).

The first step in enabling Hadoop installation from Apache Bigtop distribution consists of telling your Linux repository manager the URL where the packages can be found. Navigate to http://archive.apache.org/dist/bigtop/stable/repos/. Make sure that you find the folder corresponding to your Linux OS flavor and download the repository definition file named bigtop.XXX (where XXX is an extension specific to the Linux flavor that you are using). Once you get the file, make sure to copy it to the location where repository manager of your Linux OS looks for definitions of external repositories (don’t forget to use sudo(8) so that you can copy into the system location). Table A-1 summarizes the location of repository definition files for various Linux flavors.

Table A-1. Locations of Repository Definitions

Linux Flavor	Folder Where Repo File Needs to Be Copied
Debian and Ubuntu	/etc/apt/sources.list.d
CentOS, RHEL and Fedora	/etc/yum/repos.d
SUSE and OpenSUSE	/etc/zypp/repos.d

Once you add the Bigtop repository definition file to one of the folders listed in Table A-1, the next step is to import a repository key. Importing the key allows the package manager to make sure it is installing genuine packages. Since the repository key establishes trust, always make sure to download it from the secure https://dist.apache.org/repos/dist/release/bigtop/KEYS and store it in the current directory. Once that is done, all that is left is adding the key and refreshing the repository definition cached locally. After that, you can install Hadoop and Giraph using the usual means of package installation on Linux. The required steps are summarized in the Table A-2, with the rows again corresponding to different flavors of Linux.

Table A-2. Installing Giraph via Linux Binary Packages

Debian and Ubuntu	CentOS, RHEL and Fedora	SUSE and OpenSUSE
$ sudo apt-key add < KEYS $ sudo apt-get update $ sudo apt-get install giraph	$ sudo rpm --import KEYS $ sudo yum clean metadata $ sudo yum install giraph	$ sudo rpm --import KEYS $ sudo zypper update $ sudo install giraph

$ sudo apt-key add < KEYS

$ sudo apt-get update

$ sudo apt-get install giraph

$ sudo rpm --import KEYS

$ sudo yum clean metadata

$ sudo yum install giraph

$ sudo rpm --import KEYS

$ sudo zypper update

$ sudo install giraph

An interesting side effect of running that last command is that the dependency between Giraph and Hadoop is properly recognized and the right Hadoop package is implicitly installed on your system. In a way, once you decide to install Giraph via Linux packages, your best (and easiest!) option for installing Hadoop is to install it in a packaged form as well. In fact, you don’t even have to execute that step explicitly: the correct package of Hadoop is fetched.

In case you are wondering what you need to do to set environment variables HADOOP_HOME and GIRAPH_HOME to after installing Giraph via Linux packages, the good news is that you don’t have to worry about those anymore. You can simply run the hadoop and giraph command-line utilities the same way you would run any executable on your Linux system: just type its name.

There is no denying that installing from Linux packages is by far the easiest way to install both Giraph and Hadoop. If you choose one of the commercial vendors, you can sign up for professional support. The downside, however, is that by installing the prepackaged bits, you are giving up your right to have a precise combination of Giraph and Hadoop versions that are unique to your environment. If you find yourself needing to match Giraph to the custom versions of Hadoop and Hadoop ecosystems components, installing Giraph by building from source code may be your only option.

Installing Giraph by Building from Source Code

The source code of Giraph is packaged as a gzipped tar file and available for at the same project web site that you used for downloading binaries: http://giraph.apache.org/releases.html. For the Giraph version 1.1.0, get the file named giraph-dist-1.1.0-src.tar.gz and unpack it somewhere on your development workstation.

The Giraph build infrastructure is managed by Apache Maven. If you decide to install Giraph by building it from the source code, you have to make sure that Maven version 3 or higher is available in your environment. If you don’t have Maven available, make sure to follow installation instructions provided by the project’s web site at http://maven.apache.org. The rest of this appendix assumes that you have set up Maven and that you can successfully run the mvn command-line utility.

Note You will install Giraph by building it from source code in situations where you have a preexisting Hadoop cluster with the versions of dependencies (including Hadoop itself) not matching those of packaged Giraph binaries. Although building Giraph against arbitrary versions of dependencies is possible, be warned that it may very well be that such a combination has never been tested.

Giraph’s build infrastructure provides a convenient way for specifying exact version of dependencies that Giraph needs to be built for. You don’t need to download any of these dependencies or otherwise make them available in your environment—Maven does it for you. You are expected to use Maven build profiles for specifying the major version of Hadoop and when you want to build Giraph as a YARN client rather than a map-only MapReduce application. The most commonly used build profiles are summarized in Table A-3.

Table A-3. Commonly Used Maven Build Profiles

Profile Name	Profile Effect
hadoop_1	Produces a build that is compatible with Hadoop 1.x
hadoop_2	Produces a build that is compatible with Hadoop 2.x
hadoop_yarn	Produces a YARN-based Giraph build (assumes Hadoop 2.x)

Using a Maven build profile lets you preset the majority of properties to the desired values, while still allowing surgical overrides to match minor and micro versions of dependencies. Whereas there are dozens of properties that you can tweak while building Giraph from source code, the most commonly used ones are summarized in Table A-4.

Table A-4. Commonly Used Properties for Specifying Exact Versions of Giraph Dependencies

Property Name	Property Effect
hadoop.version	Used for specifying the exact version of Hadoop dependency
dep.accumulo.version	Used for specifying the exact version of Accumulo dependency
dep.hbase.version	Used for specifying the exact version of HBase dependency
dep.hcatalog.version	Used for specifying the exact version of HCatalog dependency
dep.hive.version	Used for specifying the exact version of Hive dependency
dep.zookeeper.version	Used for specifying the exact version of ZooKeeper dependency

Starting with specifying a profile and then all the desired version properties allows you to have very fine-grained control over the resulting Giraph binary. For example, suppose your cluster was running Hadoop 2.7.1 and HBase 0.98. The way to build YARN-aware Giraph compatible with your cluster dependencies is outlined in Listing A-5. Before running this command, however, keep in mind that a particular set of dependencies that you are requesting may not have been tested. The implications of this could range from build failure, unit test failures, or runtime failures. Once you deviate from the beaten path, you are on your own, although it may just work. Thus, don’t be dismayed if unit tests fail. Some of those unit tests happen to be sensitive to the versions of dependencies they have been developed against; they can be turned off by passing –DskipTests to the Maven build.

Listing A-5. An Example of Building Giraph Targeting Hadoop 2.7.1 and HBase 0.98

$ mvn –Phadoop_yarn –Dhadoop.version=2.7.1 –Ddep.hbase.version=0.98 -DskipTests

After the build is done, you find the binary artifact very similar to the one you downloaded in previous chapters under the giraph-dist/target folder. Look for the gzipped tar file there and follow the steps described in the previous section to untar and install that custom version of Giraph on your gateway node. Don’t forget to set up GIRAPH_HOME and to add Giraph binaries to your $PATH variable the same way that you did in Listing A-4.

As you have seen, so far there are a number of different ways to install Hadoop and Giraph on a system. Of course, after you install them, you then have to configure them. You can’t really use the bits for much of anything before you configure. This is the subject of the next section.

Fundamentals of Hadoop and Hadoop Ecosystem Projects Configuration

Almost all members of the Hadoop ecosystem (including Hadoop itself) share the common configuration management system that is based on flat XML files encoding a set of key-value property settings. The folders where these configuration files are located happen to be project and installation method specific. For any project installed via Linux packages (DEB or RPM), the location of configuration files is under /etc/<project name>/conf. If, however, the installation was manual, the same configuration files are located under the project’s HOME folder in the conf subfolder. Where you will find configuration files for Hadoop and Giraph depend on the installation method, which is summarized in Table A-5.

Table A-5. Location of Configuration Files by Installation Method

Installation Method	Hadoop Configuration Folder	Giraph Configuration Folder
Linux packages	/etc/hadoop/conf	/etc/giraph/conf
Manual	$HADOOP_HOME/conf	$GIRAPH_HOME/conf

Inside of these folders you will find one or more XML files, similar to the example outlined in Listing A-6. Note that whereas Giraph requires just a single configuration file called giraph-site.xml, Hadoop spreads its configuration over at least three different files—core-site.xml, hdfs-site.xml, and mapred-site.xml—with yarn-site.xml being an additional file for configuring YARN as part of Hadoop 2.x.

Listing A-6. Example Hadoop Configuration File

<configuration>
   <property>
    <name>prop.name</name>
    <value>prop.value</value>
    <description>Free form description of what this property does</description>
  </property>
</configuration>

Configuring Giraph

Strictly speaking, by default, Giraph doesn’t actually require any settings in its XML configuration file. Anything that may be a required configuration option (such as specifying the number of workers) can be passed to Giraph via its command-line utility. Pretty soon, though, constantly typing all of the required options on the command line gets frustrating, at which point you can start leveraging giraph-site.xml to store the ones specific to your environment (just make sure to put it in the folder listed in Table A-5). For example, if you find yourself running Giraph applications on top of Hadoop configured in local mode, you may find it useful to have a Giraph configuration file similar to the one in Listing A-7.

Listing A-7. Example Giraph Configuration File giraph-site.xml

<configuration>
   <property>
      <name>giraph.SplitMasterWorker</name>
      <value>false</value>
      <description>This lets Giraph app run in a single task</description>
   </property>
</configuration>

Even though Giraph has its own configuration file, it still depends on the Hadoop client to be correctly configured via Hadoop-specific configuration files. Configuring Hadoop and matching the Hadoop client is the subject of the next section.

Configuring Hadoop

Fundamentally, Hadoop jobs can run in one of three modes:

On a fully distributed Hadoop cluster with a client configured on the gateway node
On a pseudo-distributed Hadoop cluster with a client co-located
In a local Hadoop configuration with the client mocking the cluster-specific Hadoop machinery

The vast majority of examples in this book assumed you were running the Giraph application on top of Hadoop configured in local mode. Local mode is the simplest one to set up. It doesn’t require any active processes running anywhere and it makes sure that the Giraph job runs within a single JVM with everything else. Instead of using HDFS, local mode reads and writes files to a local filesystem, which makes it even easier to use. You don’t even have to configure anything to enable local mode, since it also happens to be the default (thus no XML configuration files or empty files would enable it).

Local mode is great for development and debugging your Giraph application, but it also has a few limitations. First of all, it can only run one task at a time, which requires you to limit the number of Giraph workers to one, and to combine master and worker into the same task. Perhaps an even bigger limitation is that local mode execution doesn’t hit the same code path that is normally triggered when running on a real Hadoop cluster. Keep this in mind when debugging Giraph applications—you may see differences in behavior. Finally, even though local mode is supposed to be the fastest (at least on tiny datasets), there’s still quite a bit of work that Hadoop has to do to bootstrap the MapReduce framework. This can lead up to a 20-second delay before Giraph code starts to execute.

The total opposite of a local mode is, of course, a fully distributed Hadoop cluster. This is the usual way of running a Giraph application in production; it assumes that a fully distributed Hadoop cluster is available. Giraph acts as a pure client and you need to submit Giraph jobs from a cluster gateway node. The Hadoop configuration available on a gateway node needs to match the configuration that is being used by Hadoop running on all the other nodes of the cluster. This is typically done by a Hadoop cluster administrator; it doesn’t require any explicit configuration by the users of the cluster. All the administrator needs to make sure of is that the contents of the Hadoop configuration folder is kept in sync on all the nodes in the cluster, including the gateway node.

Configuring Hadoop in Pseudo-Distributed Mode

A really nice compromise between a somewhat limiting local mode and a heavyweight, fully distributed mode requiring a lot of setup is Hadoop’s pseudo-distributed configuration. In this mode, you can run your jobs in an environment as close to the real cluster as possible, but without the hassle of setting up a multinode cluster (although as you’ve seen in Chapter 12, that hassle can be mitigated by using Hadoop cloud services). Pseudo-distributed mode runs all the processes (daemons) that a real cluster would have on the same host. These processes run in different JVMs and communicate over the loopback network interface. Other than that, the configuration is identical to what you would see on a real cluster. Pseudo-distributed mode is a nice middle ground between local and fully distributed modes.

The easiest way to enable a pseudo-distributed mode is by installing Hadoop on Linux from packages. There is a special package called hadoop-conf-pseudo that pulls all the right dependencies and presets the configuration parameters needed for pseudo distributed mode in /etc/hadoop/conf. Simply installing that package is all that is needed.

In the situations where installing Hadoop from Linux packages is not an option, an alternative is to manually install the Hadoop binary under the $HADOOP_HOME folder (the way that it was described earlier in this appendix), and make sure that the properties summarized in Table A-6 are set in the appropriate configuration files.

Table A-6. Hadoop Configuration Required for Pseudo-Distributed Mode

TabA-6

Since pseudo-distributed mode required a few processes to run, the final step after configuration is to launch the necessary services and initialize their state. First, you need to start HDFS. Use the instructions provided in Table A-7, depending on whether you installed from Linux packages or from a binary gzipped tar file. Of course, if you didn’t install Hadoop from packages, don’t forget to add $HADOOP_HOME to your $PATH, as was shown in Listing A-3.

Table A-7. Starting HDFS in Pseudo-Distributed Mode

Hadoop Installed from Linux Packages	Hadoop Installed from Binary gzipped tar File
$ sudo service hadoop-hdfs-namenode format $ sudo service hadoop-hdfs-namenode start $ sudo service hadoop-hdfs-datanode start $ sudo –u hdfs hadoop fs –chmod 777 /	$ hadoop namenode –format $ hadoop-daemon.sh start namenode $ hadoop-daemon.sh start datanode $ hadoop fs –chmod 777 /

$ sudo service hadoop-hdfs-namenode format

$ sudo service hadoop-hdfs-namenode start

$ sudo service hadoop-hdfs-datanode start

$ sudo –u hdfs hadoop fs –chmod 777 /

$ hadoop namenode –format

$ hadoop-daemon.sh start namenode

$ hadoop-daemon.sh start datanode

$ hadoop fs –chmod 777 /

Note The last command makes the root subdirectory of HDFS readable and writable by anyone. This would be a huge security issue on a real, fully distributed, multitenant cluster. On a pseudo-distributed setup, however, it allows you to shortcut a few setup steps without compromising the functionality of HDFS.

At this point, you should have a fully functional HDFS. You can verify that it is up and running by issuing the commands shown in Listing A-8.

Listing A-8. Verifying That You Can Copy Files into and out of Pseudo-Distributed HDFS

$ hadoop fs –put /etc/hosts /test.file
$ hadoop fs –cat /test.file
$ hadoop fs –rm /test.file

As long as all the commands ran without an error, you can be sure your pseudo-distributed HDFS setup is fine. The final step in setting up your pseudo-distributed Hadoop cluster is starting MapReduce services (for Hadoop 1.x) or YARN services (for Hadoop 2.x). The commands needed are summarized in Tables A-8 and A-9, respectively.

Table A-8. Starting MapReduce in Pseudo-Distributed Mode

Hadoop Installed from Linux packages	Hadoop Installed from Binary gzipped tar File
$ sudo service hadoop-mapred-jobtracker start $ sudo service hadoop-mapred-tasktracker start $ cd /usr/lib/hadoop	$ hadoop-daemon.sh start jobtracker $ hadoop-daemon.sh start tasktracker $ cd $HADOOP_HOME

$ sudo service hadoop-mapred-jobtracker start

$ sudo service hadoop-mapred-tasktracker start

$ cd /usr/lib/hadoop

$ hadoop-daemon.sh start jobtracker

$ hadoop-daemon.sh start tasktracker

$ cd $HADOOP_HOME

Table A-9. Starting YARN in Pseudo-Distributed Mode

Hadoop Installed from Linux Packages	Hadoop Installed from Binary gzipped tar File
$ sudo service hadoop-yarn-resourcemanager start $ sudo service hadoop-yarn-nodemanager start $ cd /usr/lib/hadoop-mapreduce	$ hadoop-daemon.sh start resourcemanager $ hadoop-daemon.sh start nodemanager $ cd $HADOOP_HOME/share/hadoop/mapreduce

$ sudo service hadoop-yarn-resourcemanager start

$ sudo service hadoop-yarn-nodemanager start

$ cd /usr/lib/hadoop-mapreduce

$ hadoop-daemon.sh start resourcemanager

$ hadoop-daemon.sh start nodemanager

$ cd $HADOOP_HOME/share/hadoop/mapreduce

Regardless of whether you are setting up Hadoop 1.x or Hadoop 2.x, once you are done with these commands, make sure to run a test MapReduce job while you are still located in the current working directory (where the last cd command put you). If the command shown in Listing A-9 runs successfully to completion, it means your pseudo-distributed Hadoop cluster is fully set and ready for your Giraph applications.

Listing A-9. Verifying That You Run MapReduce Jobs on Pseudo-Distributed Hadoop Cluster

$ hadoop jar *examples*jar pi 10 1000
....
Job Finished in 35.513 seconds
Estimated value of Pi is 3.14080000000000000000

Summary

You can install Hadoop and Giraph in a number of different ways. The key issue to keep in mind is that whichever version of Hadoop you install, it has to match the version Giraph binary that you will be using. If you plan to use I/O formats connecting Giraph to other members of the Hadoop ecosystem (such as Hive, HBase, etc.), those versions have to match as well. After installation is done, the last step is to configure both. Giraph doesn’t require any configuration by default. Hadoop has to be configured to run in one of the three modes: local, fully distributed, or pseudo-distributed. In this appendix you looked at the following topics:

System requirements for running Hadoop and Giraph: JDK 7+ on Linux or Mac OS X
The steps involved in installing Hadoop using two main methods: a binary gzipped tar file or Linux binary packages
The steps involved in installing Giraph using three different methods of installation: a binary gzipped tar file, Linux binary packages, or building from source code
Configuring Hadoop and Giraph using local flat XML configuration files
Executing simple command lines to make sure that your Giraph and Hadoop deployments are functioning correctly

Although this appendix offered a number of alternative options for accomplishing two basic tasks— installation and configuration, if you’re installing everything from scratch, you can simply choose whichever is the easiest for your environment. If, however, you are dealing with existing Hadoop clusters, you may find some of the alternative methods helpful.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix A: Install and Configure Giraph and Hadoop

Create new playlist

Sign In

Sign Up

Table of Contents for
Appendix A: Install and Configure Giraph and Hadoop