Chapter 7: Graph IO Formats

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 7

Graph IO Formats

This chapter covers

Graph representations
Reading input graphs in different formats
Saving the result of your analysis

So far, you have familiarized yourself with the concept of graphs and their constituents—vertices and edges—through a variety of use cases. In Chapter 5, you learned how Giraph uses the corresponding Vertex and Edge objects to represent a graph and to programmatically manipulate Vertex and Edge objects and compute useful graph metrics. But before Giraph creates Vertex and Edge objects for you to use, a graph is stored on a storage system in different formats: for instance, plain text files or a binary format. Similarly, you may require the result of a Giraph computation to be stored in different output formats.

In this chapter, you learn about the tools Giraph provides for reading input graphs and writing output, called input formats and output formats, respectively. Giraph uses input formats to convert an input graph into Vertex and Edge objects that you can then manipulate programmatically; it uses output formats to save the result of your analysis. When you launched your first Giraph job in Chapter 5, you were already using specific implementations of the input and output formats. Here, you get to know in detail how input and output formats work.

More important, you learn how to write your own input and output format implementations. Because graphs may be stored in different ways, it is difficult to provide an input format suitable for every possible storage system or graph format. Giraph exports a programming interface that you can implement to build support customized to your specific scenario. This chapter first discusses the different ways graphs can be represented and then teaches you how to implement input and output format interfaces to read graphs and output results in various formats. At the end of this chapter, you will have completed your knowledge of the basic features of the Giraph API and will be ready to move to more advanced features: reading graphs, computing, and outputting the result of your computation.

Graph Representations

Before diving into the details of the Giraph API, you first need to understand the ways you can represent graphs to store them in a storage medium such as a disk. Until now, this book has discussed graphs abstractly and with the help of two-dimensional figures. Let’s borrow the example of the Twitter social graph from Chapter 3, shown here in Figure 7-1, and enrich it with extra information to make it a bit more interesting. This example graph is used throughout the chapter.

Figure 7-1. An example Twitter graph. Vertices represent users, and an edge from vertex A to vertex B denotes that user A follows user B. The vertex value represents the age of a user, and the edge value represents the number of mentions from user A to user B

In the example graph, every vertex corresponds to a Twitter user and is labeled with a name: the vertex ID. An edge from user A to a user B signifies, in this case, that A follows B on Twitter. Although vertices and edges are the basic constituents of a graph, in most cases you want to associate interesting information, or metadata, with vertices and edges. In the case of a social network, a vertex may be associated with user profile information. Here, the vertices have an associated user age. As you may have already observed, this corresponds to the Vertex value you learned about in Chapter 5. Every edge also has an associated number, which denotes how many times a user mentions another user; this corresponds to an Edge value.

It is easy to conceptualize a graph for this image; but to store this graph on a disk, you must break it into pieces of information in such a way that you can later reconstruct the original graph. The typical way to store a graph is by storing information about the vertices—a vertex-based representation—or by storing information about the relations among the vertices—an edge-based representation.

In a vertex-based representation, a graph is defined by a collection of per-vertex information. The adjacency list is one of the most common vertex-based graph representations. In its simplest form, an adjacency list is a collection of lists, where each list corresponds to a vertex and contains the vertex’s neighbors. For example, Figure 7-2 shows the adjacency list that represents the Twitter graph from Figure 7-1. Each list corresponds to user, and the vertices in the list correspond to the Twitter accounts the user follows.

Figure 7-2. A vertex-based representation of the Twitter graph

Notice that the adjacency list provides exactly the same information as Figure 7-1, but in a different form. With this form, you break down the graph into per-vertex information that you can easily store.

Often, a graph is described not by its individual vertices, as in the previous example, but by the relations that exist between vertices. In this edge-based representation, a graph is defined by the set of edges, possibly accompanied by edge metadata. Figure 7-3 shows how you can represent the same Twitter graph by breaking it into per-edge information. Notice that an edge-based representation does not contain any information about the vertices themselves. If you try to reconstruct the Twitter graph from the edge information, in the end you will be missing the vertex information. In general, an edge-based representation is more suitable when you want to describe the structure of the graph through the relations among vertices.

Figure 7-3. An edge-based representation of the Twitter graph

You are likely to come across various representations in your encounters with graph data. In some cases, you choose the representation to use to store the graph data, and in other cases, you have to access graph data that’s organized for you. For instance, a vertex-based representation may be preferred over an edge-based one because it takes up less space on disk. Other times, you want to store vertex-specific information, such as user profile information in a social network, in which case a vertex-based format fits more naturally than an edge-based format.

Different representations must be handled in different ways when it comes to creating Vertex and Edge objects from them. For example, in a vertex-based description, you can reconstruct a Vertex object just by looking at an individual piece of information such as a table row in a database describing a user in social network. In contrast, in an edge-based representation, you need to look for and put together various pieces of information to reconstruct a Vertex object. This seems easy in pictures, but it becomes challenging when you have to deal with terabytes of distributed graph data—imagine trying to do this manually. The following sections examine how Giraph makes it easier to handle graph representations without worrying too much about such details.

Input Formats

Regardless of the representation, a graph may be stored in different storage systems, and possibly distributed in different formats. Giraph provides tools called input formats that simplify the task of reading an input graph. An implementation of an input format is essentially a way to specify how to read data from a storage system and then how to convert the data to the familiar Vertex and Edge objects. Giraph exports a simple API that you can implement to support new types of storage systems and formats. Giraph, in turn, is responsible for doing the hard work for you.

The Giraph API for vertex-based graph representations is VertexInputFormat, and the API for edge-based graph representations is EdgeInputFormat. These are the most basic abstract classes you extend to implement new input formats. Giraph also provides more specialized APIs that support common storage systems and formats; for instance, one common format is adjacency lists stored as text files on the Hadoop Distributed File System (HDFS) or on HBase tables.

In general, you can either extend the basic API to write your own input format implementation that is customized to your application scenario, or you can extend an existing specialized API. In many cases, you do not have to implement an input format from scratch; it is very likely that one of the implementations provided by Giraph already serves your needs.

Tip You’re encouraged to look into the available input and output format implementations before creating your own. The Giraph code base is constantly being enriched, and you may find what you need there and save some time.

In other cases, you may have to extend one of the specialized text-based input formats that already implements most of the functionality for you. For instance, the TextVertexInputFormat handles details like locating files on HDFS and reading them line by line. Implementing this specialized API only requires you to specify how to parse a text line into a Vertex object.

Figure 7-4 shows a small sample of the vertex-based input format classes in the Giraph code base. It also shows how you can build functionality starting from the basic API. White boxes represent abstract classes that you can extend, and shaded boxes represent concrete implementations that already exist in the Giraph source code.

Figure 7-4. Input format class hierarchy. Starting from the basic VertexInputFormat, every child class adds functionality and adds a specific type of format

For instance, TextVertexInputFormat is an extension of VertexInputFormat specifically for input graphs stored as text files on HDFS. The AdjacencyListTextVertexInputFormat abstract class, in turn, can handle text formats where the graph is in the form of an adjacency list. You can find all these under the org.apache.giraph.io.formats package in the Giraph code.

In the rest of this section, you learn how to create input formats from scratch by implementing the basic APIs. This will give you the necessary knowledge to create your own input format when necessary. This chapter uses text-based formats as examples; in Chapter 9, you learn how to use the same API to implement support for more advanced systems and formats, such as HBase and Hive.

Vertex-Based Input Formats

You start by seeing how to implement input formats for vertex-based graph representations. VertexInputFormat is the most basic API you can implement. Let’s look at how to extend this API step by step to implement an input format that can be used to read graphs in the form of an adjacency list stored as plain text files on HDFS, one of the most common input formats. Listing 7-1 shows what the adjacency list representing the Twitter graph would look like in a text file.

Listing 7-1. Example Text File Describing the Twitter Graph

John 22 Peter 10
Mark 40 John 23 Anne 11
Jack 19 Mark 1 Anne 9
Julia 29 Mark 32 Natalie 4
Anne 31 Julia 20 Peter 35
Natalie 45 Peter 2 Anne 50 John 10

In this text input, every line contains a user, the user’s age, the people they follow, and how many times they have mentioned those followers. This implementation is called SimpleTextVertexInputFormat.

Giraph uses VertexInputFormat to split the input data into parts and then process each part to generate Vertex objects. Listing 7-2 shows the VertexInputFormat API.

Listing 7-2. VertexInputFormat Abstract Class

public abstract class VertexInputFormat<
    I extends WritableComparable,
    V extends Writable,
    E extends Writable> extends GiraphInputFormat<I, V, E> {

  public abstract void checkInputSpecs(Configuration conf);

  public abstract List<InputSplit> getSplits(JobContext context,
      int minSplitCountHint) throws IOException, InterruptedException;

  public abstract VertexReader<I, V, E> createVertexReader(
      InputSplit split,
      TaskAttemptContext context) throws IOException;
}

Notice first that VertexInputFormat uses Java generics to declare as parameters type I of the vertex ID, type V of the vertex value, and type E of the edge value. Any implementation of this API, unless abstract, must specify these three types. Recall from Chapter 5 that Vertex objects and Computation implementations also require the same parameters. Before loading the graph, Giraph uses the information about the types to ensure that the types of the vertices your VertexInputFormat implementation creates match the types of the Computation.

In the example in Listing 7-3, the vertex IDs are names, so you use type Text to represent them. Vertex and edge values are the age and number of mentions, respectively, so you use IntWritable to represent them. Next, let’s get into the details of the API method-by-method through our example.

Listing 7-3. An Example Text-Based VertexInputFormat Implementation

public class SimpleTextVertexInputFormat
    extends VertexInputFormat<Text, IntWritable, IntWritable> {

  public void checkInputSpecs(Configuration conf) { ... }      #1

  @Override
  public List<InputSplit> getSplits(JobContext context,
                                    int minSplitCountHint)
    throws IOException, InterruptedException {

    List<FileStatus> files = getFileList(context);           #2
    List<InputSplit> splits = new ArrayList<InputSplit>();
    for (FileStatus file : files) {                          #3
      Path path = file.getPath();                            #3
      long length = file.getLen();                           #3
      splits.add(new FileSplit(path, 0, length,              #3
          new JobConf(context.getConfiguration())));         #3
    }
    return splits;                                           #4
  }

  public VertexReader<I, V, E> createVertexReader(
      InputSplit split, TaskAttemptContext context) throws IOException {
    return new SimpleTextVertexReader();                      #5
  }
}

#1 Check the validity of the input.
#2 Get the list of files in the input path.
#3 For every input file, create a FileSplit object and add it to a list.
#4 Return the list of FileSplit objects created.
#5 Return an implementation of the VertexReader class.

The first method you have to implement is checkInputSpecs(). Giraph calls this method before it starts using the input format. Typically, the role of this method is to check whether the job configuration contains all the necessary information for the job to read the input graph. For instance, in this method you may check whether an input directory has been set in the configuration or whether the input directory exists.

The next method you must implement is getSplits(). It takes as input a JobContext object, which contains configuration information about the job such as the input directory, and returns a list of InputSplit objects. An InputSplit is a logical representation of a part of the input. For instance, a FileSplit implements the InputSplit interface to represents a file on HDFS; the InputSplit contains information about the file’s path, size, and location. In this example, the getSplits() method lists all the files in the input directory passed as a parameter in the job configuration and creates a FileSplit for every file.

TYPES INHERITED FROM THE HADOOP API

Just as in Chapter 5, notice that certain types, such as InputSplit, JobContext, and TaskAttemptContext, are inherited from the Hadoop API. But in most cases you don’t need to know the Hadoop details.

The most important method of the class is createVertexReader(), which returns an object extending the VertexReader abstract class. VertexReader is responsible for performing the actual processing of the data described by InputSplit. For every InputSplit returned by the previous method, Giraph creates a VertexReader object that processes the corresponding data and creates the Vertex objects that eventually form your graph. Figure 7-5 illustrates this entire process.

Figure 7-5. Putting all the pieces together. VertexInputFormat splits the input into parts. VertexReader processes a part and creates Vertex objects

Now let’s look at the details of VertexReader and how to implement one. Listing 7-4 shows the methods defined in the VertexReader abstract class.

Listing 7-4. The VertexReader Abstract Class

public abstract class VertexReader<I extends WritableComparable,
    V extends Writable, E extends Writable> extends
    DefaultImmutableClassesGiraphConfigurable<I, V, E>
    implements WorkerAggregatorUsage {

  public abstract void initialize(InputSplit inputSplit,
                                  TaskAttemptContext context)
    throws IOException, InterruptedException;

  public abstract boolean nextVertex() throws IOException,
      InterruptedException;

  public abstract Vertex<I, V, E> getCurrentVertex()
    throws IOException, InterruptedException;

  public abstract void close() throws IOException;
}

In Listing 7-5, you implement SimpleTextVertexReader, which is responsible for reading the lines in a text file one by one and creating a Vertex object for each line. This example shows how to implement each of the VertexReader methods.

Listing 7-5. An Example Text-Based VertexReader

public class SimpleTextVertexReader
 extends VertexReader<Text, IntWritable, IntWritable> {

  private RecordReader<LongWritable, Text> lineRecordReader;
  private TaskAttemptContext context;

  @Override
  public void initialize(InputSplit inputSplit, TaskAttemptContext context)
     throws IOException, InterruptedException {
     this.context = context;
    lineRecordReader = new LineRecordReader();                            #1
    lineRecordReader.initialize(inputSplit, context);                     #1
  }

  @Override
  public final boolean nextVertex()
        throws IOException, InterruptedException {
    return lineRecordReader.nextKeyValue();                               #2
  }

  @Override
  public final Vertex<Text, IntWritable, IntWritable> getCurrentVertex()
     throws IOException, InterruptedException {

    Text line = lineRecordReader.getCurrentValue();                       #3
    Vertex<Text, IntWritable, IntWritable> vertex =
         getConf().createVertex();                                        #4
    String[] words = line.toString().split(' ');                          #5
    Text id = new Text(words[0]);                                         #6
    IntWritable age = new IntWritable(Integer.parseInt(words[1]));        #7

    List<Edge<TextWritable, IntWritable>> edges = new ArrrayList();
    for (int n = 2; n < tokens.length; n=n+2) {
      Text destId = new Text(tokens[n]);                                  #8
      IntWritable numMentions =                                           #8
         new IntWritable(Integer.parseInt(tokens[n+1]);                   #8
      edges.add(new DefaultEdge(dstId, numMentions));                     #9
    }
    vertex.initialize(id, age, edges);                                   #10
    return vertex;
  }

  @Override
  public void close() throws IOException {
    lineRecordReader.close();                                            #11
  }
}

#1 Create a line record reader to read text files.
#2 Check whether there are more lines to read.
#3 Get the current line.
#4 Create an empty vertex object.
#5 Split the line into tokens.
#6 The first token is the vertex ID.
#7 The second token is the vertex value.
#8 Parse the destination IDs and number of mentions.
#9 Create edge objects.
#10 Initialize the vertex object.
#11 Close the input file.

The first method you implement is initialize(), which Giraph calls immediately after creating but before it starts using VertexReader. Giraph passes as input to this method the InputSplit that the vertex reader will process and a TaskAttemptContext object that contains configuration information such as the HDFS input directory. In general, you use this method to set up VertexReader before for execution. In the example, you use this to initialize the local variables and create a LineRecordReader. LineRecordReader is a helper class that you borrow from the Hadoop API and use to read a text file on HDFS line by line. To keep the example simple, it omits the LineRecordReader details; you can always use it as is by passing it the InputSplit and the context information.

The next two methods, nextVertex() and getCurrentVertex(), resemble an iterator interface. The nextVertex() method must return true if there are more vertices to read and advance the iterator to the next vertex. The getCurrentVertex() method returns the current vertex the iterator points to. Giraph calls these two methods repeatedly as a pair until there is no more data in the input to read.

Now let’s see how you implement these methods. In the example, nextVertex() only has to check whether there are more lines in the file and, if so, read and buffer the next line. This is the operation LineRecordReader performs for you automatically, so you do not need to worry about the details.

The getCurrentVertex() method, in turn, must create a Vertex object from a text line. First it gets the current text line read. Then it creates an empty Vertex object that you need to fill with information: the ID, the value of the class, and its edges. Recall that the input graph is in the form of an adjacency list, as follows:

John 22 Peter 10
Mark 40 John 23 Anne 11
Jack 19 Mark 1 Anne 9
Julia 29 Mark 32 Natalie 4
...

After you split the text line into words, the first word in the line describes the vertex ID of a user. The next word describe the vertex value—that is, the age of the user—which you parse into an IntWritable object. The next words describe the IDs this user follows on Twitter and how many times the user mentions them. You use these to create the vertex edges: you parse every such pair in the line and create Edge objects. After you have parsed the entire line, you use the initialize() method of the Vertex class to fill the empty Vertex object with the necessary information.

Finally, after reading an InputSplit, Giraph calls the close() method of VertexReader. In the example, you use this to close the file opened upon initialization. Again, LineRecordReader takes care of this for you.

This concludes the operation of VertexInputFormat. Using this simple yet very useful example, you can now begin writing your own vertex-based input formats.

Edge-Based Input Formats

As described in the previous section, edge-based graph representations are common as well. Recall that an edge-based representation looks like Figure 7-6 for the example Twitter graph.

Figure 7-6. Edge-based representation of the Twitter graph. Edge values represent number of mentions

Listing 7-6 shows how you store such a representation in a text file. Each line in the file represents an edge in the graph. Recall that an edge from A to B denotes that user A follows B and is accompanied by the number of times A mentions B. Each line in the file includes this information as well, next to each edge.

Listing 7-6. An Example Edge-Based Input File for the Twitter Graph

John Peter 10
Mark John 23
Mark Anne 11
Jack Mark 1
Jack Anne 9
Julia Mark 32
Julia Natalie 4
Anne Julia 20
Anne Peter 10
Natalie Peter 2
Natalie Anne 50
Natalie John 10

Giraph provides an easy way to read such representations, called EdgeInputFormat. Similar to VertexInputFormat, Giraph uses EdgeInputFormat to split the input data into parts and then processes the data in each split. The basic difference is that EdgeInputFormat does not create Vertex objects, but Edge objects. At this point you may wonder how Vertex objects are created; after all, when you implement a Computation, Giraph passes a Vertex object for you to manipulate. The answer is that Giraph handles the creation of the Vertex objects for you. This is one of the conveniences that EdgeInputFormat provides.

Let’s get into the details of the API, illustrated in Listing 7-7. Notice the similarities to VertexInputFormat. The first two methods serve exactly the same purpose: checking the validity of the configuration and returning a list of InputSplit objects describing the input parts. One difference to note is that EdgeInputFormat does not require you to specify the vertex value V, but only the vertex ID type I and the edge value type E. Keep in mind that the vertex ID type and edge value type still have to match those of your Computation.

Listing 7-7. EdgeInputFormat Abstract Class

public abstract class EdgeInputFormat<I extends WritableComparable,
    E extends Writable> extends GiraphInputFormat<I, Writable, E> {

  public abstract void checkInputSpecs(Configuration conf);

  public abstract List<InputSplit> getSplits(JobContext context,
      int minSplitCountHint) throws IOException, InterruptedException;

  public abstract EdgeReader<I, E> createEdgeReader(
      InputSplit split,
      TaskAttemptContext context) throws IOException;
}

The most important difference in the API is the createEdgeReader() method, which returns an object that extends the EdgeReader abstract class, shown in Listing 7-8. EdgeReader is responsible for reading the data described by InputSplit and creating the Edge objects.

Listing 7-8. EdgeReader Abstract Class

public abstract class EdgeReader<I extends WritableComparable,
    E extends Writable> {

  public abstract void initialize(InputSplit inputSplit,
                                  TaskAttemptContext context)
    throws IOException, InterruptedException;

  public abstract boolean nextEdge()
    throws IOException, InterruptedException;

  public abstract I getCurrentSourceId()
   throws IOException, InterruptedException;

  public abstract Edge<I, E> getCurrentEdge()
    throws IOException, InterruptedException;

  public abstract void close() throws IOException;
}

Let’s look at the methods one by one and how to implement them to read the example file in Listing 7-9. This implementation is called SimpleTextEdgeReader.

Listing 7-9. Example EdgeReader for Text-Based Formats

public class SimpleTextEdgeReader extends EdgeReader<Text, IntWritable> {
  private RecordReader<LongWritable, Text> lineRecordReader;

  @Override
  public void initialize(InputSplit inputSplit,                    #1
       TaskAttemptContext context)                                 #1
         throws IOException, InterruptedException {                #1
    lineRecordReader = new LineRecordReader();                     #1
    lineRecordReader.initialize(inputSplit, context);              #1
  }

  @Override
  public final boolean nextEdge()                                  #2
    throws IOException, InterruptedException {                     #2
      return lineRecordReader.nextKeyValue();                      #2
  }

  @Override
  public final Text getCurrentSourceId() throws IOException,
        InterruptedException {
    Text line = getRecordReader().getCurrentValue();               #3
    String[] words = line.toString().split(' ');                   #3
    return new Text(words[0]);                                     #3
  }

  @Override
  public final Edge<Text, IntWritable> getCurrentEdge()
    throws IOException, InterruptedException {
    Text line = getRecordReader().getCurrentValue();
    String[] words = line.toString().split(' ');
    Text targetVertexId = new Text(words[1]);                      #4
    IntWritable edgeValue = Integer.parseInt(words[2]);            #5
    return new DefaultEdge(targetVertexId, edgeValue);             #6
  }

  @Override
  public void close() throws IOException {                         #7
    lineRecordReader.close();                                      #7
  }
}

#1 This method is called at the very beginning to set up EdgeReader.
#2 Check whether there are text lines.
#3 Get the current text line, and parse the source vertex ID.
#4 Parse the target vertex.
#5 Parse the edge value.
#6 Create an Edge object with the parsed values.
#7 This method is called at the end to close the opened files.

First, notice that you have to implement initialize() and close() methods, as with VertexReader. These are called right before Giraph starts using the reader and at the end of execution. Here you use them to open and close the file described by InputSplit the same way you did previously. Their implementation in this case is exactly the same, so we won’t elaborate. Instead, let’s get into the most interesting part of EdgeReader.

EdgeReader implements an iterator-like interface similar to VertexReader. The nextEdge() method returns true if there are more edges to read in the input file and advances the iterator to the next edge. Although the details aren’t shown, if a line is available, LineRecordReader reads and buffers it. After Giraph calls this method, it can use the following two methods. First it calls the getCurrentSourceId() method. This method must return the source ID of the edge the iterator points to currently. Recall that Giraph creates Vertex objects automatically for you; this is how Giraph knows which Vertex to attach the edge to. In this implementation, all you have to do is read the first word in a line and return it as a Text object. Finally, Giraph calls the getCurrentEdge() method to create an Edge object and add it to the vertex. An edge is defined by a target vertex ID and an edge value. To obtain these, the implementation parses the second and third words of each line and then creates and returns an Edge object.

Note the creation of the Edge objects. In Chapter 5, you were only concerned with accessing Edge objects through a Vertex; you never had to create one. In reality, Edge is an interface; under the hood, Giraph uses specific types that implement the Edge interface. DefaultEdge is one of these Edge implementations provided by Giraph. Most of the time, the default type will serve your needs, but you are free to write your own Edge implementations.

USING THE BUILT-IN EDGE IMPLEMENTATIONS

Giraph provides different implementations of the Edge interface that serve different purposes. You can find these in the org.apache.giraph.edge package. For instance, the EdgeNoValue edge type is suitable when edges have no associated value, because this implementation handles this scenario in a more memory-efficient way than the default implementation. As a convenience, Giraph provides the EdgeFactory utility class with methods to simplify the creation of Edge objects.

Combining Input Formats

You have learned that graphs can be stored in either vertex-based or edge-based representations. But in several scenarios, a graph may be stored in multiple representations at the same time. One common case is to store a graph by separating the graph structure: that is, by separating the connecting edges from the per-vertex information. There are various practical reasons for doing so. For instance, different analytical applications may require different data; you can separate the data for performance reasons so that no application needs to read all the data every time. In the Twitter graph scenario, you may want to keep the who-follows-whom relations stored in text files on HDFS in an edge-based format, and the user profile information for each user in a vertex-based format in a key-value store. Figure 7-7 shows an example of this separation.

Figure 7-7. Graph information split into two separate representations

This separation may be common practice, but it is also common for an application to combine data sources to analyze a graph in a meaningful way. An application must be able to combine the two sources on the right in Figure 7-7 to reconstruct the original graph.

You may wonder whether you need a vertex-based or an edge-based input format to handle this scenario. But you will quickly realize that no one type of input format alone is sufficient for this. For instance, in the previous subsection, the EdgeInputFormat does not provide a means to specify a vertex value, so you cannot use it to read profile information. Similarly, you cannot use a VertexInputFormat because information about a single vertex is scattered across different places in the input. As a workaround, you could preprocess the data to merge the two data sources into one input data set that is in a vertex-based format before reading. But this approach can be cumbersome and inefficient, because you have to read the same data twice every time you want to do your analysis.

Fortunately, Giraph provides a more convenient and efficient way to handle this scenario by allowing you to combine two different graph representations. Specifically, you can use an EdgeInputFormat that is responsible for creating the edges of a vertex by reading the edge representation, and at the same time a VertexInputFormat that is responsible for setting the value of a vertex by reading the profile information.

Recall that in the example in Chapter 5, when you launched the job you specified a vertex input format using the –vif command-line parameter to define the input format class name and a vertex input path using the –vip parameter. Similarly, you can use an edge input format by replacing this with the –eif parameter and specifying the edge input format class name and the –eip parameter to indicate the edge input path. Combining two types of input formats is as simple as including all four parameters in your command line. Giraph takes care of the rest.

Input Filters

Filtering a graph is a common operation in graph analysis. Filtering means removing some of the vertices or some of the edges from the original graph and performing your analysis on the remaining graph. There are various reasons to do this. Certain graph-mining algorithms give good approximations even if executed on a random sample of the graph. This is often preferred as a way to speed up the computation.

In other cases, filtering is an implicit requirement of an application. A common scenario is when the weight of the edges in the graph signifies the strength of the connection between vertices, and you want to perform the analysis only on users with strong connections. Consider the Twitter scenario, and assume you want to recommend new people for Mark to follow by looking at who the users he follows—Anne and John—follow. For instance, John follows Peter, so Peter may be a good recommendation for Mark too. But to make the recommendation more relevant, perhaps you want to include in this analysis only neighbors with whom Mark has a strong connection. In this case, you need to filter out edges with low weights. For instance, you could filter out neighbors that a user mentions fewer than 20 times, excluding Anne in this case, indicating that Julia may not be a relevant recommendation for Mark.

You could even perform this kind of filtering based on per-vertex information. For instance, you may want to perform this analysis on Twitter users in a specific age range. This way, you ensure that you do not recommend friends to a youngster based on who older people follow. To address this, you could filter from the graph all vertices that have an age above an age threshold. These are only a few examples where filtering might be necessary.

Giraph provides an elegant way for you to integrate this filtering process into your analysis when necessary. It allows you to specify vertex filters and edge filters to be used along with a VertexInputFormat and an EdgeInputFormat. Vertex and edge filters provide a way for you to specify whether a vertex or an edge should be added in the final graph during the loading of the input data. You can specify filters by implementing the VertexInputFilter and EdgeInputFilter interfaces, shown in Listing 7-10 and Listing 7-11, respectively.

Listing 7-10. VertexInputFilter

public interface VertexInputFilter<I extends WritableComparable,
    V extends Writable, E extends Writable> {

  boolean dropVertex(Vertex<I, V, E> vertex);                 #1
}

#1 Decide whether to drop a vertex.

Listing 7-11. EdgeInputFilter

public interface EdgeInputFilter<I extends WritableComparable,
  E extends Writable> {

boolean dropEdge(I sourceId, Edge<I, E> edge);
}

In the case of a vertex-based input format, during the loading of the graph, Giraph uses a VertexInputFilter object to decide whether a vertex should be filtered. The VertexInputFilter interface has a single method named dropVertex() with a boolean return value. Recall from the previous sections that Giraph uses VertexReader to create Vertex objects. For every Vertex object created by VertexReader, Giraph calls dropVertex() and passes it the Vertex object as an input parameter to decide whether to keep the corresponding vertex. If the method returns true, then Giraph drops the vertex.

Similarly, for edge-based input formats, Giraph uses an EdgeInputFilter to decide whether an edge that is read should be included in the graph. For every Edge object created, Giraph calls the dropEdge() method and passes it the ID of the edge source vertex and the edge object itself. If the method returns true, Giraph drops the edge. Listing 7-12 shows an example EdgeInputFilter implementation that decides whether to include an edge based on the edge weight. In the example of the Twitter graph, you could use such an edge filter to filter out weak connections identified by a low number of mentions.

Listing 7-12. An Example Edge Input Filter

public class WeightInputFilter extends EdgeInputFilter<Text,IntWritable> {

  public static int THRESHOLD = 10;

  boolean dropEdge(Text sourceId, Edge<Text,IntWritable> edge) {
    if (edge.getValue().get()<THRESHOLD) {
      return true;
    }
    return false;
  }
}

As you may have expected, the vertex ID and edge value types must match those of the input format implementation you use.

Once you have implemented your filter, you can specify that you want to use it through the command line. You can do this by adding the giraph.vertexInputFilterClass custom argument in the execution command and setting your filter implementation class name as its value.

Note that you could implement the filtering logic in your application if you wished, using the mutation API you learned about in Chapter 5. This way, you would dedicate the first superstep of the computation to filtering. However, filters offer a couple of benefits. First, they allow you to decouple the filtering logic from your main application logic, leading to clean, reusable code. Second, using input filters, Giraph has an opportunity to do the filtering at the early stage of reading the graph, leading to more efficient graph loading. It is faster and requires less memory.

Alternatively, you could preprocess your graph with separate scripts. However, this can be a tedious task, because you have to write code to read data from different storage systems and formats. This functionality is already provided by Giraph and its input formats. It can also be inefficient, because you have to read the graph twice every time you want to analyze it: once for preprocessing and once for the actual analysis. Input filters simplify these tasks.

Output Formats

You have learned how to read an input graph and how to perform useful analysis on it. What you are missing to complete the picture is a way to output the result of your analysis. After all, the result is no good if you cannot somehow save it to use later. Output formats are the tools that Giraph provides for you to achieve this. At the end of a computation, the Vertex and Edge objects in the graph contain all the useful information you want to store. For instance, in the shortest-paths application you wrote in Chapter 5, at the end of the computation the vertices contained the value of the shortest distance. Giraph uses an output format to save the contents of Vertex and Edge objects to a storage system.

As with input formats, there are both vertex-based and edge-based output formats. As you may have guessed, a vertex-based output format is used to save information about individual vertices, and an edged-based output format is used to save information stored in Edge objects. In the examples in Chapter 5, you primarily used vertex-based output formats. The type of output format you need will depend on your particular scenario. For instance, when you compute a metric per vertex, such as a distance or a rank, you naturally need a vertex-based output format. But suppose you read an input graph in an edge-based input format and you simply want to transform it. (You saw an example of such a transformation in Chapter 4, where you converted an undirected graph to a directed one.) In this case, you may want to save the transformed graph in an edge-based format as well.

Similar to when reading an input graph, you may wish to store the result information to different storage systems and in different formats. To accommodate this, Giraph exports two basic abstract classes VertexOutputFormat and EdgeOutputFormat that you can extend. For instance, the GraphvizOutputFormat that you used in Chapter 5 is an implementation of VertexOutputFormat provided for you. In this section, you learn how to write your own output formats. This chapter focuses on text-based formats, but in Chapter 9 you discover how to output data to any storage system and in any format.

Vertex-Based Output Formats

First you learn how to implement a vertex-based output format. Let’s assume that you run the shortest-distances application, and you want to save the distance computed for every vertex in text files and store them to HDFS—the typical scenario. To do this, you extend the VertexOutputFormat abstract class as shown in Listing 7-13.

Listing 7-13. VertexOutputFormat Abstract Class

public abstract class VertexOutputFormat<
    I extends WritableComparable, V extends Writable,
    E extends Writable> {

  public abstract void checkOutputSpecs(JobContext context)
    throws IOException, InterruptedException;

  public abstract VertexWriter<I, V, E> createVertexWriter(
    TaskAttemptContext context) throws IOException, InterruptedException;

  public abstract OutputCommitter getOutputCommitter(
    TaskAttemptContext context) throws IOException, InterruptedException;

}

The first thing to notice in the VertexOutputFormat is that the vertex ID, vertex value, and edge value types are parameters of the class. This means your implementation, unless abstract, must specify these types, and they must match the corresponding types in your Computation.

Next, Listing 7-14 describes the VertexOutputFormat class methods and implements them. This implementation is called SimpleTextVertexOutputFormat.

Listing 7-14. An Example Text-Based VertexOutputFormat

public class SimpleTextVertexOutputFormat
    extends VertexOutputFormat<Text, IntWritable, IntWritable> {

  @Override
  public void checkOutputSpecs(JobContext context)
    throws IOException, InterruptedException {
    textOutputFormat.checkOutputSpecs(context);
  }

  @Override
  public VertexWriter<Text, IntWritable, E> createVertexWriter(
    TaskAttemptContext context)
      throws IOException, InterruptedException {

    return new SimpleTextVertexWriter();                              #1
  }

  @Override
  public OutputCommitter getOutputCommitter(TaskAttemptContext context)
    throws IOException, InterruptedException {
    return textOutputFormat.getOutputCommitter(context);
  }
}

#1 This returns an object extending the VertexWriter class.

Giraph calls checkOutputSpecs() at the beginning of a job and passes it a JobContext object that contains configuration information. Its role is to perform checks similar to the one you learned for input formats. For instance, you can check whether the output directory has been set in the job configuration and, if so, whether it already exists and contains data that cannot be overwritten. In such a case, this method may throw an exception.

Like VertexReader objects, Giraph uses VertexWriter objects to do most of the work of writing Vertex information in objects to output. Specifically, after your computation finishes, Giraph creates a VertexWriter object on every worker machine in your cluster and uses these objects to write information about every vertex to the output. It does so by calling the createVertexWriter() method of the VertexOutputFormat class, which returns an object extending the VertexWriter abstract class. Every such object is responsible for processing a set of vertices in the graph. Figure 7-8 shows this process.

Figure 7-8. Saving a graph using a VertexOutputFormat and VertexWriter objects

Notice that Giraph follows a process that is the inverse of the one it performed when it was reading the graph. Giraph started by reconstructing a graph from text files, and now it must save the graph back to the same type of format. But let’s look what happens in a VertexWriter in more detail. Listing 7-15 shows the VertexWriter abstract class methods you implement.

Listing 7-15. VertexWriter Abstract Class

public abstract class VertexWriter<I extends WritableComparable,
    V extends Writable, E extends Writable> {

  public abstract void initialize(TaskAttemptContext context)
    throws IOException, InterruptedException;

  public abstract void writeVertex(Vertex<I, V, E> vertex)
       throws IOException, InterruptedException;

  public abstract void close(TaskAttemptContext context)
    throws IOException, InterruptedException;

}

The example output format returns a VertexWriter called SimpleTextVertexWriter, which is designed to write vertex information to text files. Listing 7-16 shows the implementation of SimpleTextVertexWriter and its methods.

Listing 7-16. An Example Text-Based VertexWriter

public class SimpleTextVertexWriter
      extends VertexWriter<Text, IntWritable, IntWritable> {

  private RecordWriter<Text, Text> lineRecordWriter;

  @Override
  public void initialize(TaskAttemptContext context)
    throws IOException, InterruptedException {
    lineRecordWriter = createLineRecordWriter(...);             #1
  }

  @Override
  public final void writeVertex(Vertex vertex) throws
      IOException, InterruptedException {
    String line = vertex.getId()+” “+vertex.getValue();         #2
    lineRecordWriter.write(new Text(line), null);               #3
  }

  @Override
  public void close(TaskAttemptContext context)
     throws IOException, InterruptedException {
    lineRecordWriter.close(context);                            #4
  }
}

#1 Create a line record writer object.
#2 Make a string containing the vertex information to write.
#3 Use a record writer to write the string.
#4 Close the record writer.

Before it begins using the VertexWriter, Giraph calls initialize(), passing it a TaskAttemptContext as input. The initialize() method typically sets up the output files using configuration information from the TaskAttemptContext object. This involves, for instance, creating the output text file and obtaining a handle on the file. To write output to HDFS, you use the LineRecordWriter class, which takes care of these tasks for you. After the initialization of a VertexWriter, Giraph calls the writeVertex() method for every vertex in the graph. In this method, you must specify what information to output and how to write it. In this case, you write the vertex value to the output text files. The implementation first makes a string that contains the vertex ID and the value of the vertex; it then uses the LineRecordWriter to append the string to the output files.

USING THE LINERECORDWRITER

Notice that you use the write() method of LineRecordWriter to append text to the file. The write() methods accepts two input parameters: a key and a value. The LineRecordWriter then writes both the key and value parameters to the output file. Here you only need to write a single string, so you only need to pass it as the key parameter and leave the second parameter as null. The LineRecordWriter knows to ignore the null parameter.

After writing every vertex, Giraph calls the close() method, which in this case closes the files you opened upon initialization. Again, the LineRecordWriter takes care of this task for you.

One new feature here is OutputCommitter. An output committer is a component borrowed from Hadoop; its primary job is to set up the output directory of a job. This involves setting up a directory where temporary output files can be written before the job finishes and moving the files from the temporary directory to a final one upon successful completion of the job. We do not go into the details of the OutputCommitter methods here. For HDFS output, you can simply reuse the standard OutputCommitter provided by the Hadoop API, which handles all the tasks involved in setting up the output directories for you. Chapter 9 revisits OutputCommitter, when it discusses storage systems.

Edge-Based Output Formats

Edge-based output formats are very similar to their vertex-based counterparts. You can immediately see the similarities in Listing 7-17, which shows the EdgeOutputFormat abstract class.

Listing 7-17. EdgeOutputFormat Abstract Class

public abstract class EdgeOutputFormat<
    I extends WritableComparable, V extends Writable,
    E extends Writable> {

  public abstract void checkOutputSpecs(JobContext context)
    throws IOException, InterruptedException;

  public abstract OutputCommitter getOutputCommitter(
    TaskAttemptContext context) throws IOException, InterruptedException;

  public abstract EdgeWriter<I, V, E> createEdgeWriter(
    TaskAttemptContext context) throws IOException, InterruptedException;
}

The basic difference is that now you have to implement an EdgeWriter instead of a VertexWriter. Listing 7-18 shows the EdgeWriter abstract class. The most important method in this class is writeEdge(); Giraph calls this method for every edge in the graph.

Listing 7-18. EdgeWriter Abstract Class

public abstract class EdgeWriter<
    I extends WritableComparable, V extends Writable,
    E extends Writable>

  public abstract void initialize(TaskAttemptContext context)
    throws IOException, InterruptedException;

  public abstract void writeEdge(I sourceId, V sourceValue, Edge<I,E> edge)
      throws IOException, InterruptedException;

  public abstract void close(TaskAttemptContext context)
    throws IOException, InterruptedException;
}

Let’s look at an example of how to implement an EdgeWriter that saves the edges along with their values into text files stored on HDFS. The implementation is called SimpleTextEdgeWriter and is shown in Listing 7-19.

Listing 7-19. An Example Text-Based EdgeWriter

public class SimlpeTextEdgeWriter<
      extends EdgeWriter<Text, NullWritable, IntWritable> {

  private RecordWriter<Text, Text> lineRecordWriter;

  @Override
  public void initialize(TaskAttemptContext context)             #1
    throws IOException, InterruptedException {                   #1
    lineRecordWriter = createLineRecordWriter(...);              #1
  }

  @Override
  public final void writeEdge(Text sourceId,
      IntWritable sourceValue, Edge<Text, IntWritable> edge)
      throws IOException, InterruptedException {

    String line = sourceId+” “+                                  #2
                edge.getTargetVerexId()+” “+                     #2
                edge.getValue();                                 #2
    lineRecordWriter.write(new Text(line), null);                #3
  }

  @Override
  public void close(TaskAttemptContext context)
    throws IOException, InterruptedException {
    lineRecordWriter.close(context);                             #4
  }
}

#1 Initialize the EdgeWriter by creating a line record writer.
#2 Construct the string to write to the output.
#3 Use the line record writer to write the string.
#4 Close the output files at the end.

By now, most of the methods you need to implement for the EdgeWriter should not surprise you. You use initialize() to set up the EdgeWriter by creating the familiar LineRecordWriter. Similar to VertexWriter, Giraph calls writeEdge() for every edge in the graph. Giraph passes to this method the ID of the source vertex of the edge and its vertex value as well as the edge object itself. In this implementation, you construct a string that contains the source and destination vertex IDs and the edge value. You then use the LineRecordWriter to write the string to the output file.

Output formats provide you with all the tools you need to output information stored on your graph: the vertex and edge values. But recall that a Giraph program allows you to keep information in different objects, as well. These are the most useful aggregators.

Aggregator Writers

Aggregators are used to hold information, such as counters and aggregate statistics, about the entire graph. For example, in Chapter 5, you saw an example of how to use an aggregator to keep track of the Twitter user with the most followers. Often, at the end of a program, aggregators hold useful information—in this case, the most popular Twitter user—that you want to output to a file in the same way you output information held in the graph’s vertices and edges.

To achieve this, Giraph provides a tool called an aggregator writer. Giraph uses aggregator writers to save the values of aggregators to a storage medium and in a format of your choosing: for instance, text files on HDFS. In Chapter 5, you used an aggregator writer implementation provided in the Giraph code base that saves aggregator values in a file. Here, you learn how to implement your own aggregator writer.

To specify how to save the values of aggregators, you must implement the AggregatorWriter interface shown in Listing 7-20. Giraph uses an AggregatorWriter at the end of every superstep to save the values stored in aggregators.

Listing 7-20. The AggregatorWriter Interface

public interface AggregatorWriter
    extends ImmutableClassesGiraphConfigurable {

  void initialize(Context context, long applicationAttempt)
      throws IOException;

  void writeAggregator(Iterable<Entry<String, Writable>> aggregatorMap,
      long superstep) throws IOException;

  void close() throws IOException;
}

Let’s look at an example of how to implement these methods to customize the saving of aggregator information. Listing 7-21 shows a simple aggregator writer called SimpleAggregatorWriter that exports aggregator values to a text file on HDFS.

Listing 7-21. Example Aggregator Writer

public class SimpleAggregatorWriter implements AggregatorWriter {
private FSDataOutputStream output;

  @Override
  public void initialize(Context context, long applicationAttempt)
    throws IOException {
    Path p = new Path("aggregatedValues_" + applicationAttempt);  #1
    FileSystem fs = FileSystem.get(context.getConfiguration());
    output = fs.create(p, true);                                  #2
  }

  @Override
  public void writeAggregator(
    Iterable<Entry<String, Writable>> aggregatorMap,
    long superstep) throws IOException {
    for (Entry<String, Writable> entry : aggregatorMap) {         #3
      String text = entry.getKey()+” “+entry.getValue();
      output.writeChars(text);                                    #4
    }
    output.flush();
  }

  @Override
  public void close() throws IOException {
    output.close();                                               #5
  }
}

#1 Name the output file based on the application attempt.
#2 Create a file on HDFS.
#3 Iterate over all aggregators.
#4 Append the aggregator name and value to the file.
#5 Close the output file.

The first method you implement is initialize(). Giraph calls this method right before it starts using the aggregator and typically uses it to set up the aggregator writer. The first argument it passes is a Context object that contains configuration information about the job. The second argument indicates the number of the application attempt. An application attempt occurs whenever Giraph elects a new master worker; in most cases this happens only once, but sometimes, such as after a master worker failure, a new master must be elected. The example uses this method to create the HDFS file where you output the aggregator values. Notice how you name the output file based on the application attempt so that you can distinguish between attempts.

Next, you implement the writeAggregator() method. Giraph calls this method at the end of every superstep and passes two input arguments. The first is a Map that contains a key-value pair for each aggregator registered, where the key is a String representing the name of an aggregator and the value is the aggregator’s associated value. The second parameter is the number of the superstep that just finished. Note that Giraph uses the special superstep number -1 to denote that this is the last superstep. You can use this information to decide whether you want to perform some action at every superstep or only at the end. This example iterates over all aggregators; for every aggregator, you write to the output file a line containing the name and the value of the aggregator.

In this method, you can implement arbitrary logic. For instance, you may want to select only specific aggregators to export to the file, or you may decide not to export an aggregator value unless it is the last superstep. You may also want to output different aggregators to different file names or even output the values to a different storage system or format. You have the ability to accommodate your particular application scenario.

Finally, you implement the close() method. Giraph calls this method at the end of a successful execution. Here, you use this method to close the file that you opened upon initialization.

Summary

Reading an input graph is the first step you need to perform to use Giraph, and saving the output of your analysis is equally necessary. These sound like simple tasks, but they can be challenging once you start thinking about the various types of formats and storage systems where your graph may be stored. Giraph provides the necessary tools to simplify this process.

In this chapter, you learned the following:

Understanding how your input graph is represented, the details of the storage system where your graph is stored, and its format is the first step in managing your input and output data.
You can use VertexInputFormat and EdgeInputFormat implementations to read input data in vertex-based and edge-based representations, respectively. Their role is to convert raw input data into Vertex and Edge objects for you to use.
VertexOutputFormat and EdgeOutputFormat implementations, in turn, allow you to save output information in a vertex- or edge-based format.
Giraph provides a library of input and output formats that support common cases and may cover your needs. Be sure to check the existing code base, because it may save you time.
As you start using Giraph in a more advanced way and integrating it into your particular data-analytics architecture, you may need more than the basics. Extending the input and output format APIs allows you to add more functionality—for instance, supporting new storage systems.
You may frequently need to combine different graph representations. The ability to combine a VertexInputFormat with an EdgeInputFormat becomes very handy in these cases.

In the last few chapters, you have learned how to use the basic Giraph APIs. Although this gives you the ability to perform sophisticated analysis on graphs, the Giraph API is even more powerful. The following chapters explore advanced features provided by Giraph; you learn how to add functionality to your applications, make them more efficient, and write cleaner and more reusable code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Graph IO Formats

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 7: Graph IO Formats