CHAPTER 12

image

Giraph in the Cloud

This chapter contains

  • A high-level architectural overview of cloud computing
  • An analysis of specific requirements graph processing presents in the cloud
  • A deep dive into using Giraph on a public cloud provided by Amazon

As you learned in previous chapters, Giraph leverages Hadoop clusters as its main execution framework and can fetch data from multiple different sources, not just HDFS. This makes Giraph applications ideally suited for running on a shared infrastructure provided by various public and private cloud computing platforms. Utilizing cloud computing services lets you focus on graph processing, while the vendors take care of things like deploying and operating scalable Hadoop clusters. This chapter begins by providing an overview of cloud computing services, outlining the ones ideally suited for making Giraph applications run as seamlessly as possible.

Since not all cloud computing vendors go all the way to providing Hadoop clusters as a service, we will briefly overview the more basic, but still useful infrastructure virtualization services. These types of services also come in handy in cases where precise control over the version of Hadoop available to Giraph is required.

Even though there are plenty of public cloud providers available today and the ecosystem of private cloud implementations is as robust as ever, this chapter mostly focuses on Amazon Web Services (AWS). We are focusing on AWS for two reasons. First of all, it is one of the most popular public clouds available. On top of that, having experience in running graph processing jobs on AWS is going to make using other public and private clouds a much more natural experience. Within AWS, we are going to focus on a set of APIs known as Elastic MapReduce, which lets you dynamically spin up an arbitrary number of virtual servers that appear to your Giraph application as a bona fide Hadoop cluster. Even though Giraph doesn’t come preinstalled, it is still the easiest way to get your application from a single node execution to as many nodes as you wish (or willing to pay for). All the complexities of installing, configuring, and managing Apache Hadoop are hidden from you.

A Quick Introduction to Cloud Computing

Consider for a moment the amount of effort involved in putting any graph processing application into production use at your company. At the very minimum, you would have to buy and install as much storage and computer hardware as required to accommodate your potential datasets and processing needs. Before you can turn that hardware into a fully functioning Hadoop cluster, you would have to spend time and money on managing power, cooling, and networking. And after that, you would still have to install and configure operating systems and Hadoop on every single node in your cluster. Only then could you start collecting data and running Giraph applications. Of course, to keep everything up and running, you must monitor anything (be it hardware or software) that could go wrong and maintain your cluster accordingly. In short, it means a lot of upfront investment and not much scalability. In other words, when your requirements for the size of datasets change, you are still stuck with the same cluster.

One way to avoid this initial investment and complexity is to run your Giraph applications in a cloud computing environment where someone else is taking care of setting up and managing all the infrastructure.

Simply put, cloud computing is a set of services built around the idea of providing various computing, software, and informational resources in a scalable and elastic fashion over the network. Another key aspect of cloud computing is the concept of self-service: something that allows any customer to request as many resources as an application requires without any human intervention. No longer do you have to wait for IT personnel procurement and deployment of physical hardware in company datacenters: you can simply request resources online via well-defined API calls. A few API calls, for example, can give you a fully functioning Hadoop cluster accessible over the network.

WHAT KIND OF SERVICES CAN I EXPECT OUT OF CLOUD COMPUTING?

All the services provided by cloud computing vendors fall into three broad categories, each of which builds off the previous one:

  • Infrastructure as a Service (IaaS): A set of services delivering virtualized instances of the physical infrastructure, such as servers, storage, load balancers, network equipment, and so forth.
  • Platform as a Service (PaaS): A set of services delivering fundamental APIs on which applications can be built. Examples include execution runtimes, database instances, web and application servers, and so forth.
  • Software as a Service (SaaS): A set of services delivering end-user software applications over the network. The range of applications could be as vast as what is available on a desktop, but the most common examples are e-mail, communications, games, and virtual desktops.

Most cloud providers give you a combination of services from all three categories. For example, Amazon Web Services (AWS) can be utilized either as an IaaS, letting you create virtual datacenters of any size, or as a PaaS, giving you not just the bare servers, but things like scalable databases or Hadoop clusters. Running Giraph in the cloud requires IaaS capabilities, but not necessarily PaaS or SaaS. However, the presence of either PaaS or SaaS capabilities may provide additional sources of data to be analyzed by Giraph.

Cloud computing is typically associated with public clouds: services available to anybody over the public Internet. However, the agility and self-service nature of cloud computing makes it appear in enterprise companies’ private datacenters at a rapid pace. It used to be that an enterprise datacenter was considered as racks of servers running various operating systems. Today, chances are that it is considered a cloud computing service providing at least IaaS-level APIs. This is what is known as a private cloud.

It may be tempting to think that the APIs offered by different cloud computing solutions are compatible, and anything that you do on one cloud you could do on a different one. The good news is that conceptually this is more or less the case (and this is why this chapter only focuses on AWS). The bad news is that the set of APIs needed to achieve the same goal differs between clouds. Every cloud computing solution (public or private) typically comes with software libraries and tools that let you create and manage various resources.

Giraph on the Amazon Web Services Cloud

One of the primary services that the Amazon Web Services (AWS) cloud provides is the Elastic Compute Cloud (EC2). EC2 lets anyone instantiate any number of virtual servers from static virtual machine (VM) images; it can be considered the fundamental building block of the AWS IaaS layer. The VM images are called Amazon Machine Image (AMI). You can instantiate any number of virtual servers from a single AMI by associating virtual resources (CPU, memory, I/O bandwidth, etc.) with an AMI. Those servers are self-contained, virtualized operating systems that run somewhere in one of the AWS datacenters. They are called instances. Your account is charged based on the number of instances that are running each hour and the amount of resources (CPU, memory, local storage, network bandwidth, etc.) that each instance utilizes. The only way to interact with these instances is over the Internet. For example, you can decide to log in to them directly using SSH or run any kind of networked service on them, such as HTTP and so forth. All you need to know is the public IP address of each instance.

EC2 is a very simple, but at the same time an extremely scalable service that you can use for deploying Hadoop clusters in the cloud manually. After all, once EC2 returns a set of IP addresses corresponding to each of the instances you created, you can deploy a Hadoop cluster on them exactly the same way you would do with physical hosts (refer to the Appendix for Hadoop deployment tips). In fact, when AWS first came out in 2006, doing it yourself was the only way of organizing instances into any kind of compute cluster.

Things changed when Amazon introduced a higher-level service called Elastic MapReduce (EMR) into its portfolio of AWS services. EMR was one of the first services that operated more like PaaS, rather than an IaaS-level service. With EMR, the details of managing individual hosts and configuring them into a Hadoop cluster are hidden from you. Still, an EMR cluster is comprised of one or more instances and thus can be considered an IaaS-level offering. Instances that are part of an EMR cluster are derived from specific AMIs maintained by Amazon. Think of it as Amazon’s own Hadoop distribution packaged in a form of AMIs. There are multiple versions of that distribution (built on top of different versions of Hadoop) available for use. The entire cluster is instantiated and destroyed as a single action, without the need to track and manage all the nodes. In fact, you can even add nodes to the cluster on the fly to speed up computation or take advantage of Amazon’s pricing model.

Before you try to spin up the first EMR cluster, however, you need to do a bit of upfront configuration of the environment that would let you interact with AWS.

Before You Begin

Interacting with AWS requires that you set up an account with Amazon. Detailed instructions on how to do that are available from Amazon in their “Getting Started with AWS” tutorial. Once you have your account set up, you are able to log in to the AWS management console (a web application), use the command-line tools, and start making direct AWS API calls from within your business applications. The AWS management console is the easiest way to get started, since it offers an intuitive Web UI for common operations. That said, once you start running more and more jobs in the Amazon cloud, you find yourself asking for an increased degree of automation and programmatic control. One option allows you to opt out of a language-specific SDK to embed AWS cluster and infrastructure management logic directly into your business application. This offers the most precise level of control over cloud infrastructure usage, but comes at the price of investing in orchestrating all the interactions with AWS. The command-line (CLI) tools provide a convenient middle ground between the ad hoc nature of Web UI and a precise control of a language-specific SDK. Even though there are a few different versions of CLI tools available, for the remainder of this chapter, you are using Amazon AWS CLI implementation.

All programmatic interactions with AWS require user authentication. This is done by providing two pieces of information: access key ID (a string of characters similar to IDAKIAIOSFODNN7EXAMPLE) and a secret access key (a string of characters similar to wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY). Both of these can be requested from the Amazon AWS web console during (or after) the registration process. You can provide both of these to the AWS CLI tools either via environment variables or by running the aws configure command and entering the required information at the prompt.

If you haven’t installed AWS CLI tools on your desktop, make sure to read and follow the instructions provided at http://docs.aws.amazon.com/cli/latest/userguide/. Keep in mind that, unlike everything else in this book, AWS CLI is implemented in the Python programming language, which may require you to set up a Python environment. The Amazon user guide provides instructions on how to do that. Once the Python environment is available, installing AWS CLI boils down to a few steps, which are outlined in Listing 12-1.

After telling aws configure your access and secret keys, and setting the default AWS datacenter to US EAST, you’ll have almost everything set up and configured to instantiate your first cluster on the Amazon cloud. However, to remotely log in to individual nodes of the cluster, you have to set up a Secure Shell (SSH) key pair. This is the last bit of authentication information that you need to set up. Think of it this way: while access key ID and secret access keys are used to request services from AWS, once those services instantiate virtual infrastructure for you, an SSH key pair is needed to remotely log in to that infrastructure. SSH key pairs need unique names. You will call this one giraph-keys. If you have an SSH key pair by than name already, you can skip the steps in Listing 12-2.

The first command in Listing 11-2 requests creation of the SSH key pair and outputs the private key material into the file called giraph-keys.pem in the user’s home directory. The second command is required to make sure that the permissions on the file only allow reading for the user and no one else on the system. These are the default permissions that SSH expects. Finally, the last command makes sure that the user has rights to work with EMR; this needs to be executed only once for each user account.

One thing that you have probably already noticed is the common pattern of how AWS CLI tools are executed from the command line. First, you specify the subset of Amazon cloud APIs that you would like to operate on (e.g., EC2, EMR, etc.). That becomes your main command. Then you select an action (verb) that you need to perform on that subsystem (e.g., creating a key pair or listing the regions). That becomes your subcommand. Finally, you can specify flags that would affect the semantics of the action via -–flag options (e.g., give your key pair a name). The result of this command-line invocation is always one or a few AWS API calls that AWS CLI makes on your behalf and gives you the JSON response back. There is no magic in how AWS CLI calls the APIs; your own application can issue the very same API calls by using various language-specific SDKs.

With that in mind, let’s proceed with creating your very first functional Hadoop cluster on the Amazon cloud. Amazingly enough, all it takes is a single execution of the AWS CLI tools.

Creating Your First Cluster on the Amazon Cloud

Most of the actions required to manipulate Hadoop clusters on the Amazon cloud are provided by the emr command. You can run aws emr help to see which subcommands are available. You are using quite a few of them in the following sections, but for now, let’s create the most basic Hadoop cluster using the create-cluster subcommand, as shown in Listing 12-3. Note that, once again, you are using backslash characters to break lengthy single command lines for readability purposes.

Congratulations! You have just instantiated your first cluster on the Amazon cloud. The cluster consist of one MASTER node and two CORE nodes (three nodes total); it is capable of running MapReduce jobs. The invocation you used to create the cluster is long, but pretty self-explanatory. You had to invoke the emr subcommand of aws and give it a verb asking to create a cluster: create-cluster. That action required you to specify the version of Hadoop that you wanted to use (--ami-version 2.4.11), and then the topology of your cluster (--instance-groups), describing the role of each node, the overall number of nodes assigned to a given role, and an AWS instance type associated with each node. You also asked not to terminate the cluster immediately (--no-auto-terminate) and provided a key pair so that you can log in to the cluster later on (--ec2-attributes KeyName=giraph-keys). This is the minimum amount of information that you need to give to AWS EMR to create a cluster. The result of running this command was a cluster ID (j-DEADBEAF) output formatted as JSON. Keep an eye on that cluster ID. You are using it for a few other things in the chapter.

Now that the Hadoop cluster is up and running in the Amazon cloud, what can you do with it? It would be awesome if at this point you could run Giraph applications without doing much else. Unfortunately, Amazon EMR clusters don’t come with Giraph bits preinstalled the same way that they bundle Apache Hive, Pig, and some of the other Hadoop ecosystem projects. Before proceeding to making Giraph available on the cluster, let’s first make sure that you can at least run a simple MapReduce job on it.

The way you are going to launch your test MapReduce job on the freshly minted cluster is simply by logging into its gateway node and using a preinstalled hadoop command-line utility to launch a Pi job (Hadoop’s answer to HelloWorld in MapReduce), as shown in Listing 12-4. Note that setting up a cluster can take quite a bit of time. Hence, the first command may not give you a login on the cluster for several minutes while waiting for the cluster to be fully online.

What you have done here is used the ssh action from the EMR collection of APIs. You have supplied the cluster ID (--cluster-id) from the output given in Listing 12-3, and also the same key pair that you used to create the cluster in the first place (--key-pair-file). The result of running this command is a regular SSH session that gives you a shell on the gateway node of the cluster. All the commands prefixed with % are at that point executing on the remote gateway node. The first command you ran on the gateway printed its hostname to make sure that you’re really in the cloud. After that, you used the hadoop command-line utility to run the Pi calculation job, and you received an expected (if somewhat imprecise) result. The cluster definitely looks and feels like a real Hadoop cluster. The last thing you did was exit (exit command) the gateway node back to your workstation.

Please keep in mind that at this point, you have three nodes running on the Amazon cloud. Even though they are not doing any useful work, Amazon still charges you for them. Once you are done experimenting, make sure to destroy your cluster(s) and check that there are no leftover nodes running, as shown in Listing 12-5.

The first command gives you a very detailed JSON output that describes all the clusters that Amazon is or has been maintaining on your behalf. You need to look for the ones that are not listed as TERMINATED and terminate them with the terminate-clusters action.

The Building Blocks of an EMR Cluster

Most of the actions that happen on the Amazon cloud can be decomposed into operations on virtual servers. Those virtual servers are called instances, and to a user of the Amazon cloud, they are indistinguishable from racks of real servers. Just like any real server, instances have a copy of an operating system running, and you can either get a remote shell into that operating system (via SSH) or interact with other software services (such as a web or a database service) running on these virtual servers. Unlike real servers, though, you can create as many instances as you are willing to pay for.

Each instance running on the Amazon cloud is defined by

  • An image of a fully configured operating system. This is known as Amazon Machine Image (AMI) and essentially consists of the content of a disk drive from which an operating system boots.
  • The amount of physical resources (e.g., the number of CPUs, RAM, and disk size).

There are thousands of AMIs with all sorts of operating systems available on the Amazon cloud today. Some are maintained by operating system vendors and some are maintained by volunteers. Some are free (like the ones based on Linux) and some cost money to use (like the ones based on Microsoft Windows). And if you don’t find an AMI that works for you, Amazon gives you all the tools to maintain your own AMIs. If you are looking for your favorite operating system image, a good place to do it is on the Cloud Market web site at http://thecloudmarket.com. Keep in mind that the decision on which AMI to run is orthogonal to the amount of resources given to it.

While the Amazon cloud doesn’t allow you to specify an arbitrary number of resources to give to an instance, you can choose from a preselected set of templates. These templates are called instance types. Amazon provides a detailed overview of the different types of resources assigned to each instance type at http://aws.amazon.com/ec2/instance-types/. This is also a good place to visit if you want to understand the billing implications of running different instance types.

In order to facilitate management operations, instances can be grouped into chunks called reservations (or instance groups). All instances in a group are identical in terms of what image they use (AMI specification) and the number of resources given to each instance in a group (instance type). This makes it easy to tell the Amazon cloud to always maintain a number of identical servers. Regardless of whether you ask for one instance in a reservation or a hundred, you’re still creating a group.

These are the basic building blocks that Amazon’s IaaS layer provides you. By using them, you can opt out of rolling your own Hadoop cluster or you can ask Amazon services to do it for you.

The Composition of an EMR Cluster: Instance Groups

Building off the fundamental notion of an instance group, an Amazon EMR cluster consists of a number of nodes belonging to three different instance groups.

  • The MASTER instance group. This instance group always contains just one node that is configured to run all the centralized services of a Hadoop cluster: HDFS NameNode, YARN ResourceManager, or MapReduce JobTracker. It is configured as a gateway so that you can manually launch ad hoc MapReduce or YARN jobs by logging into it using the aws emr ssh action. Amazon also makes sure that client-side bundled software such as Hive, Pig, and so forth, are available by default. Unfortunately, Giraph is not part of this list and needs to be installed on this node separately. If you are running a cluster with a single node, that node must belong to this group. It will also run all the services from a CORE instance group.
  • The CORE instance group. This instance group contains one or more nodes that function as Hadoop slave nodes. The two fundamental services that each node in this group runs are HDFS DataNode (for storing data) and YARN NodeManager/MapReduce TaskTracker. The nodes in this group do all the data processing. Whenever you want to speed up your computation, you can dynamically add nodes to this group, but you cannot dynamically remove them.
  • The TASK instance group. This instance group is optional and contains nodes that function almost exactly as the nodes in the CORE instance group, with one exception: they do not run HDFS DataNode services and thus are not capable of processing data locally (the data always needs to be fetched from an external location). Because the nodes in this instance group are completely ephemeral (they don’t store HDFS blocks), you can increase and decrease the number of nodes in this group. This instance group is convenient when processing data that tends to reside in external repositories, such as Amazon’s Simple Storage Service (S3), or when you want to use spot instances to reduce operational costs.

Putting it all together, you can now make sense of the first two options given to the aws emr create-cluster subcommands (repeated here in Listing 12-6).

The first option (--ami-version) specifies which AMI to use for spinning up instances in all three groups. This is an indirect specification, since instead of referring to an AMI ID you are actually referring to the version of the EMR stack maintained by Amazon. Think of it as any other Hadoop distribution: you can only select the version of the distribution itself. You don’t get to choose versions of Apache Hadoop and its ecosystem components within the distribution. If you are curious about which versions of Amazon EMR stack map to which components, you can visit the web page at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html. The version that you used in the example corresponds to the Hadoop 1.0.x based stack. It must be noted that there is absolutely nothing mysterious about the AMI that Amazon maintains for you as part of a given Hadoop stack. You can create an instance of that AMI the same way you can create an instance of any other AMI with an arbitrary OS.

The second option that you are giving to the create-cluster command (--instance-groups) is an exact specification of each instance group’s size and type. It consists of one, two, or three lists of key value pairs describing each of the three instance groups that you’re adding to the cluster (MASTER is the only mandatory group). The following keys are required.

  • InstanceGroupType defines the instance group. The values here are MASTER, CORE, or TASK.
  • InstanceCount defines the number of instances (nodes) in the group.
  • InstanceType specifies the instance type (the amount of resources each node is given) that every node in the group gets. The values here are any valid instance type specification recognizable by AWS.

The graphical representation of the cluster that you created is shown in Figure 12-1.

9781484212523_Fig12-01.jpg

Figure 12-1. Cluster composed of two instance groups and three nodes

The white box in Figure 12-1 represents an AMI template from which actual instances are derived and placed into requested instance groups: one into a MASTER instance group and two into a CORE instance group.

Deploying Giraph Applications onto an EMR Cluster

As mentioned, Amazon doesn’t bundle Giraph as part of its Hadoop stack. It would be nice if that changes in the future (Amazon is known to listen to its customers’ requests), but given that Giraph is primarily a framework, rather than an application, it is not that big of a deal.

From a Hadoop cluster’s point of view, any graph processing application built on top of Giraph looks like a custom MapReduce or YARN job. With that in mind, let’s create a jar that contains one of the example applications from Chapter 5 and also bundles Giraph and all of its dependencies. This is pretty easy to do with Maven. All you need to do is tell Maven to create a jar-with-dependencies assembly, as outlined in Listing 12-7  (just make sure to go back to the folder you created for the example application in Listing 5-4).

Your first command instructed Maven to build a self-contained JAR file that you then copied to the gateway node of the EMR cluster using the put action. As usual, you had to specify the cluster ID (--cluster-id) and key pair (--key-pair-file) to allow access to a given cluster. An extra argument that you had to give to the put action was the location of the JAR file to copy (--src).

At this point, you could have logged into the EMR cluster using the aws emr ssh command and executed the graph processing example the same way you executed the Pi estimation job in Listing 11-3. Instead, you will look into scheduling it remotely as one of the steps in EMR cluster processing specification.

EMR Cluster Data Processing Steps

The most common use case for EMR clusters is to create them on demand, keep them running for as long as there are data processing steps to be executed, and then let the Amazon cloud tear the cluster down. Of course, you can request the cluster to stick around and wait for an explicit tear-down API call. This is exactly what you did with the --no-auto-terminate option. Having a cluster available for interactive use is convenient for exploring EMR. However, once you start moving more into full automation of your data pipelines, it becomes more natural to let the Amazon cloud manage the cluster life cycle.

If you want to rely on the Amazon cloud for end-to-end cluster life cycle management, there are a few additional bits of information you need to communicate to AWS when creating your cluster:

  • The set of additional applications to be installed on an EMR cluster. Currently, you can install Hive, Pig, HBase, Ganglia, Impala, and MapR distribution that way.
  • The set of bootstrapping actions you would like to perform after the instances are created but before the cluster gets started. This could include things like installing additional software, tweaking clusters, and operating system configuration. Very simply put, bootstrap actions are scripts that reside on externally visible storage (S3, HTTP, etc.) and are run on the cluster nodes.
  • The set of data processing steps you would like to perform once the cluster is up and running. Currently these steps can include the following:
    • Running a custom MapReduce or YARN job from a self-container jar
    • Running a custom script on the MASTER node
    • Running a Hadoop streaming job
    • Running a Hive, Pig, or Impala query

Using both bootstrapping and data processing actions follows a very similar pattern. Since you don’t need any special tweaks on the cluster, we will hold off showing example bootstrapping actions until later and will focus on data processing steps for now. An interesting side note is that while it is not possible to request additional bootstrapping actions for a cluster that is already up and running, it is possible to request additional data processing steps.

When using AWS CLI tools, both bootstrapping actions and data processing steps need to be communicated to the aws emr create-cluster command via either JSON specification or a shorthand notation. Listing 12-8 uses JSON specification to add a Giraph data processing step to an already running cluster.

The preceding command added a processing step for immediate execution on an already available cluster identified by --cluster-id. The description of processing steps is a well-formatted JSON array with one entry per processing step that you wish to run on the cluster. Each entry consists of the following key-value pairs:

  • Type: Defines a type of a processing step. In this case, you requested a custom jar execution via Java. There are other types of steps available, including the execution of Pig or Hive scripts, streaming jobs, and other actions. CUSTOM_JAR type is the most flexible.
  • Name: A symbolic name identifying the processing step in all future outputs; it can be any string.
  • MainClass: For a CUSTOM_JAR type of a step; identifies the name of the entry point class.
  • ActionOnFailure: Specifies what to do if the step fails. Valid values include CONTINUE, TERMINATE_CLUSTER, and CANCEL_AND_WAIT.
  • Jar: For a CUSTOM_JAR type of a step; specifies location of the JAR file.
  • Args: For a CUSTOM_JAR type of a step; specifies arguments to be given to the main method.

The output of the command gives you a step ID to refer to if you want to find out more about the step(s) status or result. Finally, if you are wondering whether you could’ve avoided the lengthy JSON specification types on the command line, the answer is yes. The –-steps flag can read JSON from files, provided you give it a file URL instead of an actual JSON like this: --steps file://./step.json.

In general, adding data processing steps to an EMR cluster is pretty trivial. In fact, you can specify a bunch of steps when you request the cluster to be created by providing the very same –-steps JSON specification to the create-cluster action. That way, you can have the Amazon cloud spin up a cluster, run a series of data processing steps, and at the end, tear down the cluster, thus minimizing the amount of time you have to pay for EC2 resources. You will try that later in this chapter, but for now, let’s see whether your processing step was successful or not. You can do it by running the query command shown in Listing 12-9 and looking for the State key in the JSON output.

Believe it or not, the data processing step failed. You now have to find out why this happened.

When Things Go Wrong: Debugging EMR Clusters

When anything goes wrong with your EMR cluster or a job that it was supposed to process, you need to be able to track what went wrong. Typically, your best option is to analyze the log files and try to deduce the source of failure. There are two sources of log files available for EMR clusters: aggregated logs on S3 and local log files on cluster nodes. On a cluster that is still up and running, you can ssh into a gateway node and go directly to the log files collected under /mnt/var/log/. For example, to figure out why the data processing step from the previous section failed, let’s take a look at /mnt/var/log/hadoop/steps. That folder contains subfolders corresponding to each data processing step submitted to a cluster. Underneath these subfolders, you see three log files:

  • controller: A log file containing the exact invocation of a data processing step. In this case, it has a custom jar Hadoop invocation command.
  • stdout: A log file with the output of whatever controller command was produced on its standard output stream.
  • stderr: A log file with the output of whatever controller command was produced on its error output stream.

First, let’s take a look at what the controller invoked when you submitted the Giraph data processing step via a custom JAR. You see something very similar to Listing 12-10.

This looks like exactly the same command that you would’ve submitted manually to run a Giraph workflow via the 'hadoop jar ...' command. Perhaps if you take a look at the error execution that command produced, you would have a clue as to why it failed. What you see there is very similar to Listing 12-11.

Of course, this makes sense—you tried running a Giraph application on a nonexistent input data set. How to make datasets available to the EMR clusters and get the results is the subject of the next section.

Where’s My Stuff? Data Migration to and from EMR Clusters

EMR clusters are an ephemeral collection of nodes. They are created on demand to accomplish a certain sequence of data processing steps; they are typically destroyed immediately afterward. This is a convenient, cloud-friendly model of utilizing resources, but it does require that the initial dataset comes from some permanent storage location and a resulting dataset is somehow captured before the cluster and its HDFS storage layer are gone.

In general, the Amazon cloud relies on S3 as its permanent storage layer. Almost all AWS services support S3 as a data source or data sink. For example, AMI are stored in S3; it can also be used as a target for logging most AWS services. S3 stores all of its objects in buckets. You can think of buckets as top-level file system folders that you manage under your AWS account. Once you create a bucket, you can use it to create file-like objects holding any kind of data. An arbitrary, unique key that is no more than 1024 bytes long references each object. It is convenient (but in no way enforced) to name your objects as though they were files in a filesystem with a traditional Unix path name convention of folders delimited by the / character. Following this convention makes it natural to reference S3 objects using either S3 or HTTP URIs. For example, if you create a bucket with the name giraph.examples and put an object named datasets/1/data.txt in it, you can then reference it as either s3://giraph.examples/datasets/1/data.txt or https://s3.amazonaws.com/giraph.examples/datasets/1/data.txt.

Keep in mind that bucket names share a global namespace on AWS. This means that you may need to get creative. Chances are, any simple name like mybucket is already taken. With that in mind, the first steps of suppling the Giraph application with data are to create a unique bucket and copy the example input data you used in Chapter 5 under it, as shown in Listing 12-12.

Now that the example graph description is available in S3, you are half way to making it available for Giraph processing. Amazon EMR conveniently offers a custom JAR file that implements efficient data transfer from S3 buckets into HDFS, where Giraph can pick it up. The way that you’re going to trigger this transfer is by adding yet another step to the cluster, as shown in Listing 12-13. Note that this time you’re using a shorthand notation to specify all the required key-value pairs instead of using a properly formatted JSON object. Either way of doing it is fine.

The arguments that you are giving to the custom JAR specify the source S3 bucket via --src and destination in HDFS via --dest. Of course, this step can work in the other direction as well. If you have any datasets that you need to capture after the cluster is destroyed, you can simply add a step that copies a bunch of files from HDFS into an S3 bucket.

A word of caution: S3 is expensive and it holds your data until you explicitly delete it; use it only for final results. Any data that remains in S3 is charged to your account on a monthly basis based on its size.

At this point you could just rerun the graph processing step from Listing 12-8 to see that this time it runs to completion with an exist status being COMPLETED. Instead of doing this, however, let’s tie it all together to see how a true ephemeral cluster can be used for data processing.

Putting It All Together: Ephemeral Graph Processing EMR Clusters

At this point, you have everything you need to consider a graph processing job running on the Amazon cloud that is as close as possible to what you would actually use in production. You are sticking with your good, old friend the “Hello World” graph processing example, but you will make sure that the data and custom JAR file are uploaded to the ephemeral EMR cluster from S3 and the resulting data set is recorded in one of the S3 buckets. You will also make sure that all the logs are transferred to S3, in case you need to do any kind of diagnostics after the cluster disappears. Given what you need to do, the definition of the data processing steps gets to be pretty long. Instead of specifying it all on the command line, you’re going to use a JSON file, as shown in Listing 12-14.

An interesting side note is that since EMR clusters allow custom jars to come from S3, you don’t actually need to invoke an extra put action. All you need to do is to make the GiraphHelloWorld JAR available in S3, as shown in Listing 12-15.

Now that you have all the required files available to you in S3, the only thing left to do is start up an ephemeral EMR cluster and make it run through the data processing steps specified in steps.json, as shown in Listing 12-16 and using the familiar steps. But you also need to add a location for where to store logs (--log-uri) and request to terminate the cluster once the processing steps are done (--auto-terminate).

As before, the command outputs the cluster ID and you need to wait for this cluster to exit before you can inspect the overall execution logs. It is convenient to monitor the status of the cluster using the command shown in Listing 12-17. This commands prints the status of the cluster based on the JSON output of the describe-cluster action processed by a custom JSON query request, specified via the --query flag. This is a convenient way to cut down on the amount of JSON output you are getting. Refer to aws help for information on how JSON queries can be specified.

Once the cluster is terminated, all that you need to do is inspect the logs left in S3, make sure that the processing steps were completed successfully, and that you get the final output of the job. This can be done by a series of s3 commands, as shown in Listing 12-18. The first command copies the output of the Giraph application run to your workstation, under the file name output.txt, and the second is an example of how you can list the contents of a part of the bucket (as though it was a folder) looking for log files that may be of interest.

At this point, all of your output data (including the logs) is available to you in S3. This is great for centralized storage, but it could end up being quite expensive if you don’t purge the logs that you no longer need and prune the output datasets.

Reducing cost while running on EMR is one of the top concerns, and storage costs can contribute greatly to the final bill. Amazon offers various attractive pricing models that make the usage of its EMR clusters cost-effective, however. One such model, Spot Instances, is reviewed in the next section.

Getting the Most Bang for the Buck: Amazon EMR Spot Instances

The pricing model of the Amazon cloud that you have seen so far assumed a fixed billing rate associated with every instance type. It is not, however, the only model. A different way to pay for using compute resources is to bid a certain amount of money on unused capacity within the Amazon cloud. This is applicable to all instance types, and the price per hour typically ends up being less than a flat billing rate of an instance type of the same kind. The downside to spot instances is that Amazon shuts them down when demand increases, and your bidding price falls below the going rate. This is not a big deal for traditional MapReduce workloads (after all, Hadoop was built to withstand node failures) and it gives you a very convenient way to utilize cheap compute resources to speed up data processing.

A typical use for spot instances with EMR clusters is to use them as part of the TASK instance group so that nodes can be added (and removed) at any time without disrupting the cluster operations. In order to indicate that an instance group needs to have a spot instance billing enabled, all you need to do is add the BidPrice property to the instance group specification, as shown in Listing 12-19.

If you run the preceding command, the Amazon cloud immediately creates three nodes for you: one in the MASTER instance group and two in the CORE instance group. These nodes are used to manage storage (HDFS) on the cluster and provide a minimum guaranteed amount of compute resources available to the graph processing application. You may also expect an additional three nodes from the TASK instance group to be added to your cluster at any time (and also taken away from you at any time). For a traditional MapReduce application, these additional nodes give a temporary performance advantage. Giraph applications, however, present a slight complication to this model.

Most traditional MapReduce workloads don’t have hard requirements on the overall number of compute nodes available to mappers. If some of the nodes fail, most likely, it will result in a slower overall execution, but your job will finish anyway. Giraph uses mappers as parallel workers and expects a certain number of them (the value specified via the –-w option) to be available at all times. If a node fails, Giraph expects Hadoop to reschedule a mapper on a different node, which is only possible if the overall number of compute resources available to a Giraph application is more than the number of workers it expects. This complicates spot instance usage with Giraph applications. After all, the whole point of spot instances is to provide a temporary boost to the performance of your cluster when the price is right and to scale back when the price increases.

Using spot instances with Giraph is considered an experimental feature; it is only recommended to advanced users. If you are interested in playing with it, make sure to pay attention to the following configuration properties that you would have to explicitly specify in your configuration (either via the –-ca command-line option or giraph-site.xml).

  • giraph.maxWorkers: The total number of workers that Giraph can utilize on a cluster. Experiment with setting it according to the maximum number of nodes you expect Amazon to give your cluster (CORE + TASK).
  • giraph.minWorkers: The minimum number of workers that Giraph needs to proceed to the next superstep. Experiment with setting it according to the guaranteed number of nodes you expect Amazon to give to your cluster (CORE).
  • giraph.minPercentResponded: The minimum percentage of healthy workers needed to proceed to the next superstep. Experiment with setting it to about 100 × giraph.minWorkers ÷ giraph.maxWorkers.
  • giraph.checkpointFrequency: The number of consecutive supersteps between checkpoints of the worker state. Experiment with setting it to 1 and increasing according to the performance characteristics of your application.

At this point, you have seen all the different ways to run vanilla EMR clusters. But what if you need to optimize price, performance, or both to best suit your graph processing application?

One Size Doesn’t Fit All: Fine-Tuning Your EMR Clusters

So far you have been using a vanilla configuration of an EMR cluster without paying too much attention to how well it is optimized to run your graph processing jobs. This is a fine first step (and these simple examples don’t require much else anyway), but for any real workload, some level of tuning is required. There are two related reasons why you want to tune your EMR cluster: performance and cost.

Every Giraph application is different, but given its flexible architecture, any application utilizing Giraph can stress all the three basic resources: CPU, memory, and I/O. All the performance tuning advice that you saw in the previous chapter still applies to running Giraph in the cloud. Of course, you need to view it through the lens of actually being charged for all resources usage. Thus, identifying the minimum number of resources that still gives you adequate performance becomes the key.

The most rewarding choice you can make on EMR is the right instance size for your job. As mentioned, EMR offers dozens of different instance sizes, and guessing the right one can be intimidating. The best advice here is to start on the small side, deploy performance monitoring tools such as Ganglia, and then experiment with different instance sizes to find the best trade-off of cost vs. performance. Keep the instance type reference table at http://aws.amazon.com/ec2/instance-types/ handy and do a few experiments to see whether you’re on the right track.

Some workloads exhibit the best performance when run on instances with a lot of RAM. If you fall in that category, make sure to tell EMR to configure your cluster for such an environment. In general, any kind of cluster-specific tweaks can be achieved as part of the bootstrapping actions for the cluster. Bootstrapping actions are nothing but scripts that are executed as part of the cluster bring up and affect Hadoop configuration files. For example, the easiest way to configure Hadoop for a memory-intensive workload is to add a bootstrap action referencing Amazon’s own script, as shown in Listing 12-20.

Since bootstrapping actions are the main tool for tweaking Hadoop configuration, it is recommended that you download the memory-intensive script mentioned in Listing 2-20 and use it as a reference for creating your own bootstrapping actions.

Another fundamental variable you need to consider is the number of instances in CORE and TASK instance groups. Once again, EMR makes it easy to start small and dynamically adjust the size of the TASK instance group to see if it helps speed up the processing.

Finally, when working with any AWS service, make sure that everything that you do is confined to the same AWS region; and better yet, the same availability zone. Cross-region data transfers are costly and provide much less bandwidth compared to the I/O within the same region. The best-case scenario is for all the elements of your data processing cluster to be part of the same availability zone.

Summary

Running Giraph in the cloud provides a scalable, elastic, and easy-to-use way of executing graph processing applications, without worrying about managing the infrastructure. All the cluster configuration is done for you, yet leaving enough flexibility to do all the needed tweaks. Not all clouds are created equal. While this chapter only talked about the Amazon cloud, the same core principles are applicable to other public and private clouds. If your cloud provider is not Amazon, at least you now know what services to look for.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset