Chapter 4: Giraph Algorithmic Building Blocks

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 4

Giraph Algorithmic Building Blocks

This chapter covers

Principles and patterns behind scalable graph algorithms
Graph connectivity, paths, and connected components
Ranking vertices with PageRank
Predicting ratings for user-item recommendations
Identifying communities with label propagation
Graph types and how to characterize them

This chapter focuses on algorithmic building blocks for graph algorithms, with a particular emphasis on their scalability. Graph problems are commonly solved in Giraph using a number of patterns. Due to Giraph’s vertex-centric paradigm based on message-passing, patterns use a type of value propagation. The chapter presents the general pattern and looks at a number of typical problems that can be solved with this pattern, such as finding paths, ranking vertices, identifying components and communities, and predicting ratings. You also look at different types of graphs and how they can be characterized by some of the algorithms described in this chapter.

Designing Graph Algorithms That Scale

In the previous chapter, you saw that Giraph provides a programming model that lets you express graph algorithms through a simple paradigm. Vertices have values and can exchange data through messages in a number of iterations. Under the hood, the system takes care of executing the algorithm in parallel in a distributed environment. Although more restrictive, the paradigm has been designed specifically to put you in a position to produce scalable algorithms. An algorithm expressed following this model is inherently decentralized. Each vertex makes independent decisions (such as whether and how to update its value, send messages, or vote to halt) based on local information, such as its current vertex value and the messages it has received. Because each vertex makes decisions during every iteration based on this set of its own data, the execution of the algorithm can be massively parallelized. In fact, the user-defined function can be executed independently on all vertices in parallel across the available processing units. A model based on message-passing avoids expensive concurrency primitives—in particular in a distributed environment—such as locks, semaphores, and atomic operations, which are required by a model based on shared memory.

Using a restrictive model allows you to focus on the semantics of the algorithm and ignore the execution model, which comes with the framework. This way, little has to be reinvented each time, and only problem-specific code needs to be developed. Although the process is simple, you still have to consider a few important decisions when designing a new algorithm. It takes some time to get acquainted with the Giraph model, but once your brain clicks with the vertex-centric perspective, you’ll begin looking at graph problems in a totally different way.

Chapter 3 used the example of a social network in which “a person is as good as her friends.” That example made the point that the value of a vertex depends on the value of its neighbors; the values of the neighbors depend in turn on the values of their neighbors, and so on. This means to compute these values, information needs to be propagated through the structure of the graph, iteration after iteration. Information propagation is the basic and fundamental pattern behind the Giraph model. An algorithm in Giraph defines how information is propagated through the graph and how this information is used by each vertex to make independent decisions, such as whether and how to update its value. You’ll see how this pattern is used throughout all the algorithms presented in this chapter.

Note In Giraph, the global state of a computation is distributed across the vertex values, and it is shared through the messages that vertices send to each other.

According to this definition, if a vertex needs a certain piece of information owned by another vertex to compute its value, this piece of information must be delivered to the vertex through messages. To design a graph algorithm in Giraph and to build on the information-propagation pattern, you have to decide a number of things that define how the data associated with each vertex is initialized, accessed, exchanged, and updated—basically, how it is used.

In particular, you must define the following elements of the algorithm that are specific to a decentralized, vertex-centric approach:

Independent vertex decisions: The decisions made by each vertex, such as whether it should send a message or vote to halt, should be based on information owned by the vertex itself (local), and only on that (independent/decentralized).
Initialization of vertex values: Although it is obvious that vertex values need to be initialized correctly, value initialization is much more relevant for decentralized vertex-centric algorithms. Because vertices make decisions based on the current vertex value and the incoming messages, the path the computation takes depends on the initial values assigned to the vertices as much as on the graph structure.
Halting condition: Vertices make decisions independently, so it is important to design a halting condition that is consistent, is well understood, and, most important, can be decided on collaboratively by the vertices.
Aggregators and combiners: Sometimes global computations can simplify an algorithm or even be unnecessary. On these occasions, aggregators can prove very handy and do not undermine the scalability of an algorithm. When possible, an algorithm should use combiners to minimize the amount of resources—such as network and memory—used for messages.

The bottom line is that you should use a vertex-centric approach focusing on decentralized decisions based on local information, which are often more practical to parallelize.

The remainder of this chapter looks concretely at how existing algorithms define each of these points. The following sections have a two-fold function: they help you better understand the Giraph programming model and the concepts presented in this section, and they present solutions that often act as building blocks for more complex solutions. You may want to reuse some of the principles behind these solutions in your own applications.

Exploring Connectivity

A graph is nothing more than a bunch of vertices connected to other vertices, as you have learned. At this point, you consider two vertices to be connected if they are direct neighbors: if they are the two endpoints of a single edge. This is, however, a simplified view of the structure of a graph. Connectivity can be seen as a broader relationship between vertices. In fact, connectivity is a transitive relationship. In other words, if vertex A is connected to vertex B, and vertex B is connected to vertex C, then vertex A is connected to vertex C. The edge connecting A to B and the edge connecting B to C constitute a path. A path is a sequence of edges that connect vertices. If traversed, these edges “bring” from one vertex to another. Through paths, connectivity can be extended to vertices that are not neighbors; they are one of the fundamental tools to study the structure of a graph. This section looks at how you can compute shortest paths and use them to identify components in the graph that are subgraphs.

Computing Shortest Paths

Imagine that you have a social graph, and you want to find out whom you should ask to introduce you to tennis superstar Roger Federer. If none of your acquaintances knows Roger, chances are they may know somebody who knows him. Or they may know somebody who knows somebody who knows Roger, and so on. Basically, you are looking for a path of “friend of a friend” relationships that allows you to reach Roger. You are probably familiar with the theory of six degrees of separation. According to this theory, in our social relationships we are all separated on average by six steps. For example, each of us is on average six steps away from Roger Federer (or anybody else on Earth), in a long chain of “friend of a friend” relationships.

At the end of the 1960s, Stanley Milgram executed a simple experiment. He sent a letter to a number of randomly selected individuals in the United States; these people were called sources, and each letter contained the full name of a target individual (in Milgram’s experiment, living in Boston) and a roster. The recipient was asked whether they knew the target on a first-name basis. If that was the case, the recipient was asked to forward the letter directly to the target. If the recipient did not know the target, they were asked to forward the letter to a friend who they thought was more likely to know the target. Each recipient of the letter put their name on the roster, to keep track of the chain. When the letter was received by the target, the roster contained the entire chain of “friends of friends” connecting the initial randomly chosen source and the target individual. This procedure was performed for more than 600 individuals, and the result was that, for those letters that actually reached the target, the rosters contained on average between five and six names. The list of names contained in each roster was a path. Figure 4-1 illustrates the possible path of one of the letters; for this example, the target is Maria, and the randomly chosen source is Sam.

Figure 4-1. A fictitious path for one of the letters from Milgram’s experiment

In practice, what Milgram did with his experiment was to compute an approximation of the average path length of the social graph in the United States—and this number turned out to be about six. Often, multiple paths exist between two vertices, and you usually care about the shortest one. In an unweighted graph, the shortest path between two vertices is nothing more than the path between two vertices with the smallest number of edges. There are many ways to compute the distance between two vertices in a graph and the path with such length. Most of them can be thought of as variations of the technique used by Milgram in his experiment.

An algorithm to search for the shortest path(s) between two vertices—or the shortest path between a vertex and any other (reachable) vertex—is called a single-source shortest paths (SSSP) algorithm. A very intuitive and simple algorithm that can be used to this end is the so-called breadth-first search (BFS).¹ Starting from a source vertex, the algorithm visits all its neighbors. These neighbors are considered to be at distance 1, because they are separated from the source vertex by one edge. The next vertices to be visited are the neighbors of these neighbors, except those that have been already visited. These vertices are considered to be at distance 2 from the source, or one hop more distant than the vertices they were visited from. The algorithm continues like this until all vertices have been visited and all distances have been computed. Intuitively, the way BFS visits the vertices in a graph starting from a source vertex follows a wave-like pattern, just like the waves produced on a flat surface by a stone thrown into water. The paths are explored in breadth, all of them one hop at a time.

In the case of a weighted graph, edge weights represent distances between neighbors. Hence, the length of a path is computed as the sum of the weights of the edges that constitute the path. To support weights, the algorithm has to be slightly modified. Figure 4-2 shows the execution of SSSP on a weighted graph starting from the leftmost vertex, Mark. Any time vertices are visited, instead of adding one hop to the distance of the vertices they were visited from, the weight of the traversed edge is added. Look, for example, at how the distance of John is defined as the weight of the edge that connects him to Mark, and how the distance of Peter is this value plus the distance from John. If a vertex is visited multiple times, either in the same iteration or in a future one, its distance is updated only if the distance has improved. For example, notice how Maria decides her distance (6) based on the path that goes through Julia, because it is shorter than the distance that goes through Peter (8). Also, notice how Sophia updates her distance in two steps, as soon as the shorter distance is determined.

Figure 4-2. Example of execution of SSSP on a weighted graph

You can consider SSSP on an unweighted graph to be like SSSP on a weighted graph where all the edge weights are set to 1. In Figure 4-2, the current distance from the source is written inside each vertex, and the currently visited vertices are contained in the bag-like shape. Distances are initialized to infinity, except for the source vertex, which is initialized to 0. Initializing the distances to infinity guarantees that vertices use a new distance as soon as it is discovered (any distance is smaller than Infinity).

Implementing SSSP becomes very natural in the vertex-centric programming model provided by Giraph. First, all vertices store their distance from the source vertex in their vertex value. Edge values can store the edge weights. Messages can be used to propagate distances from neighbors to neighbors. Hence, distance is the information propagated through the graph. Listing 4-1 presents the pseudocode for this algorithm, and Figure 4-3 shows the flow of the computation in Giraph.

Listing 4-1. Weighted Single Source Shortest Paths (SSSP)

function compute(vertex, messages):
  if getSuperstep() == 0:                #1
    if isSource(vertex.getId()) is True: #1
      vertex.setValue(0)                 #1
      propagateDistance(vertex)          #1
    else:                                #1
      vertex.setValue(Inf)               #1
  else:
    minDistance = min(messages)          #2
    if minDistance < vertex.getValue():  #2
      vertex.setValue(minValue)          #2
      propagateDistance(vertex)          #2
  vertex.voteToHalt()

function propagateDistance(vertex):
  minDistance = vertex.getValue()
  for edge in vertex.getEdges():
    endpoint = edge.getTargetVertexId()
    weight = edge.getValue()
    sendMessage(endpoint, minDistance + weight) #3

#1 Vertex distances are initialized.
#2 Vertices compute the shortest distances.
#3 Vertices propagate the distances, considering edge weights.

Figure 4-3. Example of execution of SSSP in Giraph

According to the points outlined earlier, for the SSSP algorithm you make the following decisions:

Independent vertex decisions: Each vertex computes its distance based on its current value and the incoming messages. It selects the smallest distance contained in the messages and compares that distance to the shortest distance discovered so far. The vertex also decides whether to send new messages depending on the outcome of this comparison only, hence independently.
Initialization of vertex values: Each vertex makes decisions based on the comparison between incoming distances and its currently stored shortest distance, so you initialize the value of each destination vertex (thus excluding the source vertex) to infinity. This guarantees that as soon as the first shortest distance is discovered, the vertex will store this distance as its vertex value. On the other hand, the source vertex initializes its vertex value to 0. In this sense, the vertex-centric algorithm behaves like the general BFS algorithm outlined earlier.
Halting condition: The algorithm halts when all shortest distances have been discovered. Messages are sent only when a vertex discovers a new shortest distance, so no messages are produced when a distance is not improved. For this reason, it is sufficient for each vertex to vote to halt every time after it has evaluated a set of messages. If new candidate distances need to be evaluated, the vertex is woken up. Otherwise, the vertex remains inactive while storing the shortest distance discovered so far. Eventually, all vertices discover their shortest distance and stop sending messages. This heuristic allows a fully decentralized halting condition.
Aggregators and combiners: Aggregators are not necessary to compute shortest distances, unless general statistics need to be computed, such as the maximum or minimum shortest distance from the source. Instead, a combiner can play an important role in BFS, minimizing the number of messages that need to be transmitted.

Listing 4-2 shows the pseudocode for the combiner.

Listing 4-2. MinValue combiner for Single Source Shortest Paths

function ssspCombiner(messages):
  return min(messages)

The semantics of the combiner are simple. It produces a message that contains the minimum value contained in the messages it combines. Basically, it corresponds to line minDistance = min(messages) in Listing 4-1. This correspondence allows you to use (or not use) combiners transparently, without breaking the semantics of the algorithm. The effect of using the combiner is that each vertex may need to evaluate only a subset of the original messages sent to it (from the spectrum of all the original messages down to a single message). In other words, line 9 will not change its functioning as a result of a combiner being used. The only result is that unnecessary messages are discarded early by the combiner, and you avoid transmitting them.

Paths are very important in the study of graph connectivity. They measure the reachability of vertices. In a social network scenario like Facebook, reachability helps clarify the extent to which information can propagate in the network, or, in other words, to what extent content can go viral via the share functionality. A graph is said to be connected if all the vertices in the graph are connected: if for each pair of vertices, at least one path connects them. Intuitively, a social graph is connected if for each person you can find a chain of “friend of a friend” relationships to all the others. In principle, in a connected graph, the content shared by a user can potentially reach any other user.

This is no different from a more traditional question applied to computer networks: to what extent can computers transmit data to the other computers in a network? Computers are connected through routers and other networking infrastructure that allows them to communicate even if they are not directly connected to each other. But does the topology of the network allow any computer to exchange data with any other computer in the network? You can easily see that this is exactly the same as the dissemination of information in a social network. If you think of a computer network as a graph, the traceroute program does nothing more than compute a path between two computers.

Computing Connected Components

So far, you have looked mostly at connected graphs. Another related concept is that of a connected component. A connected component, in an undirected graph, is a connected subgraph of the graph. A graph can be composed of multiple subgraphs. Think of a social network composed of two groups of friends, where none of the individuals in the first group know any of the second (and vice versa, of course). Each of these groups of friends is a different component, because the members of each group are disconnected from the members of the other. You can think of connected components as graphs composing other graphs (hence the term subgraph).

For an undirected graph, the definition of a connected component is pretty straightforward. For directed graphs, the definition is trickier. The nature of directed graphs imposes that, for example, in a Twitter-like network, the existence of a path from user Mark to user Julia does not imply that the inverse also exists (paths depend on follower edge directions). Lady Gaga has millions of followers on Twitter, and if she shares content, this content may reach users very far away in the network. However, regardless of her millions of followers, it is less likely that content shared by the “average John” reaches Lady Gaga, because she probably does not follow him back.

For this reason, a directed graph can either be weakly connected or strongly connected. In the former case, if you substitute each directed edge with an undirected edge, you obtain a connected graph: there is a path that connects each pair of vertices in both directions. Note that certain vertices may be connected only because of the conversion to undirected; this is why the graph is considered weakly connected. In the latter case, a graph is already connected without the need to convert the graph. Figure 4-4 shows an example of a weakly connected directed graph and a strongly connected graph. To obtain the second graph, we added to the first a bunch of edges that help connect unreachable pairs. Weakly and strongly connected components are respectively weakly and strongly connected subgraphs of a graph.

Figure 4-4. Example of a weakly connected graph and a strongly connected graph

What makes connected components interesting is that they allow you to study the initial problem about dissemination of information in a social network. You can identify disconnected islands where information can propagate; but propagation eventually encounters a barrier that does not allow information to reach the rest of the graph because such a path does not exist. Many real-world graphs are characterized by multiple connected components, with one component comprising most of the vertices. This component is also called the largest connected component. This component is usually the focus of further analysis, and the remaining components are ignored. This filtering has two main causes. First, the smaller connected components may be considered noise, such as people who try Facebook once and never connect to anybody. Second, some algorithms assume a connected graph, and hence the largest connected component is usually selected as the most representative for the graph.

Computing weakly and strongly connected components in Giraph is straightforward. Let’s look at a approach that computes connected components in undirected graphs and weakly connected components in directed ones. The algorithm builds on the idea that in a connected component, information can propagate to all the vertices. In Giraph terms, this means if a vertex sends a message to its neighbors, and each neighbor forwards this message to its neighbors, the message eventually reaches all the vertices. In particular, if vertices have comparable IDs such that you can find the maximum or minimum one (for example, integers or strings), each connected component can be characterized by the maximum (or minimum) ID of a vertex that belongs to that component. If each vertex initially uses its ID as its value and sends that to its neighbors, and later the vertex assigns and propagates another ID only if it is larger than the currently assigned ID, eventually all vertices receive the largest ID in the component. Because the components are disconnected from each other, this information stays within the boundaries of each component, and at the end of the computation vertices belonging to different components are assigned a different value (the largest ID of a vertex in that component).

Listing 4-3 presents the pseudocode that implements this algorithm. For directed graphs, the graph must be first converted to a logically undirected graph—for example, using the algorithm presented in Chapter 3. Figure 4-5 shows the flow of the algorithm in Giraph.

Listing 4-3. Weakly Connected Components (WCC)

function compute(vertex, messages):
  if getSuperstep() == 0:                             #1
    vertex.setValue(vertex.getId())                   #1
    sendMessageToAllEdges(vertex, vertex.getId())     #1
  else:
    maxID = max(messages)                             #2
    if maxID > vertex.getValue():                     #2
      vertex.setValue(maxID)                          #2
      sendMessageToAllEdges(vertex, maxID)            #2
  vertex.voteToHalt()

#1 Vertices initialize their value to their ID and propagate it.
#2 Vertices update their value, if necessary, and propagate it.

Figure 4-5. Execution of the Weakly Connected Components algorithm in Giraph

Analyzing the algorithm with respect to the points presented earlier, you can outline the following decisions:

Independent vertex decisions: Each vertex computes its component ID based on its current value and the incoming messages. It selects the maximum ID contained in the messages and compares that to the current ID. It also decides whether to send new messages depending on the outcome of this comparison only, hence independently.
Initialization of vertex values: Each vertex makes decisions based on the comparison between incoming IDs and its currently stored ID, so you initialize the value of a vertex to the vertex ID. This guarantees that as soon as a larger ID is discovered, the vertex stores it as its vertex value.
Halting condition: The algorithm halts when the maximum ID has been propagated to all the vertices. Messages are sent only when a vertex discovers a new maximum ID, so no messages are produced when a new maximum ID is not discovered. For this reason, it is sufficient for each vertex to vote to halt every time after it has evaluated a set of messages. If new candidate IDs need to be evaluated, the vertex is woken up. Otherwise, the vertex remains inactive while storing the maximum ID discovered so far. Eventually, all vertices discover the maximum ID and stop sending messages. This heuristic allows a fully decentralized halting condition.
Aggregators and combiners: Aggregators are not necessary to compute connected components. Instead, a combiner can play an important role in this algorithm, minimizing the number of messages that need to be transmitted.

Listing 4-4 presents the pseudocode for the combiner.

Listing 4-4. MaxValue combiner for the Weakly Connected Component Algorithm

function weaklyConnectedComponentsCombiner(messages):
  return max(messages)

The semantics of the combiner are simple. It produces a message that contains the maximum value contained in the messages it combines. Basically, it corresponds to line maxID = max(messages) in Listing 4-2. Again, this correspondence allows you to use (or not use) combiners transparently, without breaking the semantics of the algorithm. As with BFS, the effect of using the combiner is that each vertex may need to evaluate only a subset of the original messages sent to it (from the spectrum of all the original messages down to a single message).

POPULAR DISTANCE-BASED GRAPH METRICS

SSSP and WCC are just two examples of algorithms that can be used to explore graph connectivity. Many metrics that are used to study graphs are based on paths and distances. Here are some of these metrics:

Eccentricity: The distance between that vertex and the furthest vertex. In other words, the largest distance between that vertex and any other vertex. Eccentricity is a metric used by the following metrics.
Diameter: The largest eccentricity of any vertex in the graph. The diameter is the longest shortest path in the graph. It can also define the maximum number of iterations needed for an algorithm to complete. For example, computing SSSP will not take more iterations than the length of the diameter of the graph (because it is the longest of the shortest paths to be discovered).
Radius: The smallest eccentricity of any vertex in the graph. It represents the shortest, among the longest shortest paths for any vertex. The radius gives a minimum number of iterations used by SSSP to complete. This the case when you compute SSSP starting from the center of the graph.
Center: Any vertex whose eccentricity equals the radius of the graph. The center of the graph are vertices that can reach any other vertex with few steps. These are central vertices to the graph because they can connect other vertices and shorten their paths. The information starting from these vertices can quickly reach any other vertex.
Peripheral: Any vertex whose eccentricity equals the diameter of the graph. As opposed to the center of the graph, peripheral vertices need more steps to spread their information. To reach the furthest of the reachable vertices, a peripheral vertex requires a number of steps equal to the diameter.

Centrality measures are also interesting ways to study a graph. Like the metrics just listed, they are based on paths and distances between vertices. Unfortunately, a thorough presentation of all the metrics goes beyond what this book can cover; you can learn more at the related Wikipedia pages, such as http://en.wikipedia.org/wiki/Centrality and http://en.wikipedia.org/wiki/Distance_(graph_theory).

You now understand how to compute paths and use them to measure different characteristics of a graph. Let’s move on to see how to rank vertices according to their importance in the graph.

Ranking Important Vertices with PageRank

A graph can have many vertices, but are they are equally important? Are all web pages equally important? Are all Twitter users equally important? Can you define a notion of importance in a graph? Important in this context is by no means a correct term, but it can often be useful to define metrics to rank vertices. Some of these metrics are domain-dependent, capturing the definition that is relevant for a particular problem—for example, by counting the number of Olympic medals as a way to rank athletes. Others look solely at the structure of the graph, trying to capture a more general picture. An example of such an algorithm to rank vertices in a graph is PageRank. As you saw in Chapter 2, PageRank was designed by Brin and Page at Google to identify important web pages and rank them in search results. You have also seen that PageRank is a graph algorithm that looks at the Web as a graph, with pages being vertices and hyperlinks being edges.

This section looks at how PageRank works and how to implement it in Giraph. There is some math involved, but we break it down and show you how it is translated into actual code, so you don’t need to remember it after you understand the underlying idea. First you look at the general model of travelling through a graph. Then you use this model to define the ranking of vertices through PageRank. After the basics are set, you learn how to implement PageRank in Giraph.

Ranking Web Pages

Although it was initially applied to the Web, PageRank can be applied to many other scenarios. To see how it works, let’s look at the Web as an example. Imagine that you were to open your browser and point it to a random URL (assuming you could pick a valid URL at random). You would then click a random link on the page and jump to that page. You would follow this procedure forever: an infinitely long sequence of random clicks from page to page. The question is, would you end up on certain pages more often than on others? After all, if a page is important on the Web, there should be many links pointing to that page. More links means more probability of clicking such links and more probability of ending up on certain pages. Moreover, it should be safe to assume that important pages often contain links to other important pages. This increases the likelihood of clicking a link to an important page. Figure 4-6 shows an example of a web graph where vertices have been ranked with PageRank; the vertex size reflects the PageRank.

Figure 4-6. A web graph where a page is represented by a vertex whose size depends on its PageRank

This is the model of the so-called random surfer. The model can be extended with teleportation: the random surfer sometimes jumps to a new random page, without following a link. This is a realistic scenario, if you think about it—you sometimes open a new tab and start surfing the Web from another page, either because you are interested in something new, or because the last page contained no links. Intuitively, performing random surfing on the Web for a long time, and counting the number of times you end up on each page, should give you a pretty good idea of how important pages are and let you rank them accordingly.

PageRank

The idea of infinitely surfing the Web is the idea behind PageRank. This concept is expressed formally like this:

Don’t be scared by the math, and bear with us for a moment. What this formula says is as simple as the following. The PageRank of a page is the sum of two components: the probability of landing on that page due to chance, and the probably of arriving there from another page. The dumping factor d allows you to control, as a weight, how much emphasis you want to give the first and the second components. The first component is obtained by dividing 1 by the total number of pages in the graph and is hence a uniform distribution. The second component is obtained by summing the partial PageRanks coming from the pages that have a link to the page for which you are computing the PageRank. The partial PageRank of a page is its PageRank divided by the number of links appearing on that page. It’s as simple as that!

One thing to notice in the definition of the PageRank is that it is recursive. The PageRank of each page depends on the PageRank of the incoming neighbors. How can you then compute the PageRank values? Every vertex PageRank depends on that of its neighbors. You need a starting point. Well, you can assign an initial value to each page and then iteratively compute the new PageRank value for each page with the formula. Although each vertex starts with the same initial value—typically 1 divided by the number of vertices in the graph—to mimic the initial random choice of page to start the random surf, iteration after iteration certain pages collect a higher PageRank due to the underlying structure of the graph. After some iterations, the PageRank values converge to stability, meaning the values change only slightly between the previous value and the next one. The pseudocode for PageRank is presented in Listing 4-5. Figure 4-7 shows the flow of the algorithm in Giraph when using a dumping factor of 0.85.

Listing 4-5. PageRank

function compute(vertex, messages):
  if getSuperstep() == 0:
    vertex.setValue(1 / getTotalNumVertices())        #1
  else:
    prSum = sum(messages)                             #2
    pr = (1 - D) / getTotalNumVertices() + D * prSum  #3
    vertex.setValue(pr)
  msg = vertex.getValue() / vertex.getNumEdges()      #4
  sendMessageToAllEdges(vertex, msg)
  if getSuperstep() == NUMBER_OF_ITERATIONS:
    vertex.voteToHalt()

#1 Initialization of the vertex value
#2 The sum of the partial PageRank values
#3 Computation of PageRank following the formula
#4 Current partial PageRank value to be sent

Figure 4-7. Three iterations of PageRank in Giraph

Analyzing the algorithm with respect to the points presented at the beginning of the chapter, you can outline the following decisions:

Independent vertex decisions: Each vertex computes its PageRank based on the incoming messages and the number of vertices (and the constant d). The computation is thus completely independent for each vertex.
Initialization of vertex values: Each vertex is initialized based on the number of vertices in the graph. This value is used to compute the first iteration.
Halting condition: Ideallys the algorithm would complete at convergence. Because convergence is reached only asymptotically, PageRank is usually computed for a fixed number of iterations (often fewer than 10 to 15 iterations). The halting condition is based on this number. An alternative approach requires you to compute how much the PageRank values of the vertices have changed between two iterations and use a threshold to decide to halt the computation.
Aggregators and combiners: An aggregator could be used to implement the second halting condition just mentioned. This would require having an aggregator collect the PageRank values between two iterations to perform the thresholding. A combiner can also play an important role in this algorithm, minimizing the number of messages that need to be transmitted.

Listing 4-6 shows the pseudocode for the combiner.

Listing 4-6. SumValues PageRank Combiner

function pageRankCombiner(messages):
  return sum(messages)

The combiner sums the values sent to a particular vertex; its semantics correspond to the line prSum = sum(messages) from Listing 4-5. Being the sum of an associative operation and a commutative operation, the combiner works transparently to the normal execution of the algorithm.

PageRank, with its model of the random surfer (or random walker), has been used for a number of applications in addition to ranking web pages. For this reason, understanding this algorithm and how to implement it in Giraph may prove useful to you; you may find an application or adaptation that can help you with a problem. This example is more than just the implementation of the PageRank algorithm; it shows a pattern in the design of an algorithm in Giraph. In particular, the algorithm implements a (stationary) iterative method that can be used to compute an approximate solution to a system by iteratively improving the solution until convergence. In this case, the vertex values represent the variables of the system represented by the graph and the PageRank formula. The algorithm discussed in the next section is similar in this respect.

Predicting Ratings to Compute Recommendations

Chapter 2 presented a use-case scenario that used a graph to model the relationships between users and items rated by those users. You saw a high-level way to build a recommender system on top of such a graph. This section presents an algorithm that follows the collaborative filtering (CF) approach to predict the ratings a user will give unrated items, based on the user’s past ratings. You first learn how to model users, items, and ratings as a graph, and then you see the design of an algorithm in Giraph to predict ratings. This section contains some math, but it’s broken down for you, and you see how to implement it in the code. You do not have to remember the precise mathematical formulation—just the idea behind it.

Modeling Ratings with Graphs and Latent Vectors

Think of all the sites where users can rate items like movies, books, and so on using, for example, stars on a scale from 1 to 5. An interesting question is whether you can predict, based on past ratings only, the ratings a user will give unrated items. If you can, then you can recommend to each user items with the highest predicted ratings. Many techniques try to achieve this goal, and discussing their differences goes beyond the scope of this book. This section examines the stochastic gradient descent (SGD) algorithm and how to implement it in Giraph. You learn how this algorithm requires minimal modification to implement another popular algorithm in this class: the alternating least squares (ALS) algorithm.

Figure 4-8 shows a conceptual version of a ratings database. Each column represents a user, and each row represents an item. Hence, each cell contains a user’s rating for an item. A question mark appears in each cell where no rating was provided; these are the ratings you want to predict. The basic assumption is that a user has assigned ratings based on a profile that describes the user’s taste. The other assumption is that the same kind of profile can be used to describe the items. Conceptually, you can think of this profiling data as a vector of variables, where each element of the vector represents a dimension of a profile space. You can think of each dimension as something like a genre or other feature describing the user/item space, but in practice this space is abstract and cannot be mapped to a human-friendly representation of particular genres or features. The elements of the vectors represent variables that are not directly observed by the system about the user’s taste or the item. For this reason, they are usually called latent variables, and the vector is called the latent vector of a user and of an item. Figure 4-9 shows such a project space of two dimensions, where items and users that are close together have similar latent vectors.

Figure 4-8. An example of a database of ratings. A question mark (?) indicates ratings that you want to predict

Figure 4-9. An example of a two-dimensional space of features of latent vectors. Users and items that are close together should have a respectively high predicted rating

Definition The dot product between vector [1, 2, 3] and vector [4, 5, 6] is the sum of the products of the corresponding elements: (1 * 4) + (2 * 5) + (3 * 6) = 32.

The interesting thing about these latent vectors is that if you compute a dot product between the latent vector of a user and the latent vector of an item, you can interpret the result as a prediction of a rating that the user would give to that particular item. The extent to which this prediction will be correct depends on the values that compose the vectors. With your algorithm, you want to find for each user and each item the latent vectors that will produce the most accurate predictions of ratings. But how can you do that? You can use the ratings you know already. To start, you initialize the latent variables of each user and item at random, and you compute the predicted rating for each user-item pair for which you know the correct rating. You can then compare the predictions with the known ratings, and you can try to update the latent variables to minimize the error. In principle, if you can get your set of variables to predict correctly the past ratings, they should be able to predict the future ones correctly. In other words, you want to train your system, and you want to use the known ratings as a training set.

Minimizing Prediction Error

The question now is, how do you modify the latent variables to minimize the prediction error? In the answer to this question lies the difference between SGD and ALS (and other optimization methods of this kind). First, the procedure is the same from the perspective of both a user and an item. The principle is that for a given user and their latent vector, you take the latent vectors of the items that user has rated, and you compute the predicted ratings for those items through a dot product. Conversely, for a given item and its latent vector, you take the latent vectors of the users who have rated that item, and you compute the predicted ratings of those users to that item through a dot product. You then modify the latent variables accordingly to minimize this error. Before digging in to how you perform this in parallel in Giraph, let’s look at the update function that modifies the latent variables to minimize the prediction error.

According to SGD, the update function of a latent vector for a user, given another latent vector of an item and the prediction error, can be used like this:

This formula gives you a new latent vector that produces a more accurate prediction. Keep in mind that the same formula can be used to compute the new latent vector for an item, given the latent vector of a user and the prediction error. Alpha represents the learning rate and can be used to speed up convergence at the cost of suboptimal convergence, and lambda is the regularization constant that is used to avoid over-fitting to the training data and improve prediction accuracy. As mentioned, the error can be computed via the dot product between the two vectors.

To this point, the update to a user latent vector or item latent vector has been a one-time operation. Chapter 2 introduced the concept of a bipartite graph: a graph that has two types of vertices, with the vertices of the first type connected only to the vertices of the second type, and vice versa. In this case, one type is users and the other type is items. You connect two vertices of different types with an edge if a rating was issued. The edge weight is used to store the rating. You keep the edge logically undirected by having two specular edges, one for each direction. Now comes the tricky part: you organize the computation so that at a given iteration, only one class of vertices is updated. During an iteration, you let only the vertices of one type execute their update function; during the following iteration, you let only the second type of vertices execute it. You continue like this, alternating which vertex type is updated during each iteration. For example, during an iteration, you let only users update their latent vectors based on the latent vectors of the items, and during the following iteration you let only the items update their latent vectors based on the latent vectors of the users (computed during the previous iteration). You divide the iterations according to their number and let one type of vertex execute during odd iterations and the other during even iterations.

This mechanism allows you to update the latent vectors of the users based on the freshly computed latent vectors of the items, and vice versa. This creates a “ping pong” computation, where a group of vertices updates their values based on the values of the other group. This computation continues for a number of iterations or until the latent vectors converge to a stable state: that is, they do not change more than a certain threshold between two iterations. Letting both types of vertices update their latent vectors during the same iteration would yield inaccurate results.

It is important to understand what is happening in the graph during the computation. You can see from the formula that the latent vectors are recursively interleaved. The latent vector of a user is updated based on the latent vectors of the items the user has rated, and these items’ latent vectors are updated as well, based on the other users who have rated them. And these users may have rated different items whose latent vectors have been influenced by yet another set of users. The latent vectors are influenced not only by direct neighbors but also throughout the entire graph, iteration after iteration. In practice, the user vertices act as bridges between the item vertices (because users are not directly connected), and vice versa. The computation of the latent vectors of each vertex in the graph is in the end influenced by the latent vectors of all the other vertices.

In the Giraph model, SGD is implemented following this pattern. At each superstep, only one of the two types of vertices is active and able to update the latent vectors; the other type is inactive. The latent vectors from the neighbors used to update a single latent vector are transmitted from each vertex to its neighbors through messages. Edge values are used to store the ratings.

Listing 4-7 presents the code for SGD, and Figure 4-10 shows the computation of the algorithm in Giraph. In particular, note in the figure how only one type of vertex is active at each iteration.

Listing 4-7. Stochastic Gradient Descent (SGD)

function compute(vertex, messages):
  if getSuperstep() == 0:
    value = generateRandomLatentVector()                 #1
    vertex.setValue(value)                               #1
    if vertexType(vertex.getId()) == getSuperstep() % 2: #2
      propagateValue(vertex, value)
  elif getSuperstep() < NUMBER_OF_ITERATIONS:
    errorRMSE = 0
    for message in messages:
      rating = vertex.getEdgeValue(message.getSourceId())
      msgValue = message.getValue()
      oldValue = vertex.getValue()
      newValue = computeValue(oldValue, msgValue, rating)  #3
      error = computeError(newValue, msgValue, rating)     #4
      vertex.setValue(newValue)
      errorRMSE += pow(error, 2)
    aggregate(errorRMSE)                                   #5
    propagateValue(vertex, vertex.getValue())
  vertex.voteToHalt()

function computeError(value, msgValue, rating):
  prediction = value · msgValue                            #6
  return max(min(prediction, 5), 1) – rating

function computeValue(value, msgValue, rating):
  error = computeError(value, msgValue, rating)
  return value - ALPHA * (LAMBDA * value - error * msgValue) #7

function propagateValue(vertex, value):
  message = Message(vertex.getId(), value)
  sendMessageToAllEdges(vertex, message)

#1 Latent vectors are initialized at random.
#2 You divide the vertices according to their type.
#3 Compute the new latent vector.
#4 Compute the error with the new latent vector.
#5 Aggregate the error.
#6 Use a dot product to compute the prediction error.
#7 This is the actual SGD formula in the code.

Figure 4-10. The flow of the computation of SGD in Giraph

Note that the operations on the vectors are actually vector operations, so you have to implement the dot and scalar products yourself. Also, this code assumes that given a vertex ID, you can identify the type of vertex (user or item)—for example, via a prefix in the ID. This is what vertexType() does: it returns a value of 0 or 1 depending on the type of vertex. The code assumes a class called Message that is a wrapper for the normal message and also stores the ID of the sender vertex, because you need to identify in which edge to store the respective latent vector.

Analyzing the algorithm with respect to the points presented at the beginning of the chapter, you can outline the following decisions:

Independent vertex decisions: Each vertex updates its latent vector based on the incoming latent vectors and its current latent vector. This way, each vertex can independently compute its latent vector in parallel.
Initialization of vertex values: Each latent vector is initialized randomly during the first superstep so that during the following ones, the compute() method can update them according to the SGD formula.
Halting condition: Ideally, the algorithm would complete at convergence. Because a convergence is reached only asymptotically, usually SGD is computed for a number of iterations. The halting condition is based on this number. An alternative approach requires computing how much the overall mean prediction error has been improved between two iterations. When this error is not improved overall a certain threshold, the computation can halt.
Aggregators and combiners: An aggregator could be used to implement the second halting condition mentioned. This would require having an aggregator collect the RMSE values between two iterations to perform the thresholding. Combiners have no use in this application, because you need to keep the latent vectors distinct to compute the SGD formula.

Listing 4-8 shows the pseudocode for the aggregate() method for the aggregator used in SGD.

Listing 4-8. SGDAggregator

function aggregate(value):
  setAggregatedValue(getAggregatedValue() + value)

The aggregator is initialized to 0 at the beginning of each superstep, and it sums the values sent by each vertex. At the end of the computation, the aggregator will contain the sum of all the errors at the last superstep. By dividing this value by the number of ratings, you can compute the average error. This aggregator is just summing values, so there is nothing specific to SGD; it is usually called SumAggregator, which can be used as is in different algorithms.

Keep in mind that this algorithm only computes the latent vectors. The latent vectors are then used by another application to compute recommendations. The naive approach would be to perform a dot product between each user and all the items that user has not rated yet, and select the top-K. With millions of users and items, this approach most likely is not feasible. Instead, you can use the topology of the graph and compute predictions only between a user and the items that have been rated by other users who have some ratings in common with that user.

Another interesting aspect of this algorithm is that with a little modification, you can also implement another recommendation algorithm such as ALS. The only part you need to modify is the logic behind computeValue(), which is responsible for the method-specific math.

Now that you have learned how to rank vertices in a graph, you are ready to move to the next problem: how to identify communities in a social network. The next section presents a very popular algorithm called label propagation.

Identifying Communities with Label Propagation

Social networks often share properties in their graphs. Look, for example, at Figure 4-11, which shows a social network of individuals. It is a simplified example, and the structure is overemphasized, but it should make the point. What do you see in this figure? The graph is divided in communities, or clusters. Communities are groups of vertices that tend to be connected to each other more than they are connected to vertices outside the community. Two good friends tend to have many friends in common—not all of them, but many. And these friends they have in common also tend to be friends with each other—again, not all of them, but this is the characteristic of groups of friends (and other groups, such as colleagues). In graph terms, vertices that are embedded in a community tend to have a high clustering coefficient, which measures the extent to which the neighbors of a vertex are also connected with each other.

Figure 4-11. Example of a community structure in a graph

Identifying communities is an interesting problem that may yield useful information. For instance, you could recommend other users to connect to. Another option is to use the identified community structure to study other dimensions, perhaps for analytics. Does information exit the boundaries of communities (such as shared pictures or posts)? Do people who belong to the same community share characteristics such as age, hobbies, or type of work?

The basic idea of identifying communities is to assign labels to vertices, where each label represents a community. Typically, community-detection algorithms do not require specifying the number of communities in advance, as clustering algorithms tend to require (for example, k-means). The idea is that you should find those that are in the graph (you do not know how many). Keep in mind that in the real world, individuals belong to multiple communities at the same time: they have different groups of friends, co-workers, family, and so on. Algorithms that assign multiple labels (multiple communities) to vertices are also said to identify overlapping communities. This section, however, focuses on a simple algorithm called the label propagation algorithm (LPA) that does not detect overlapping communities, but that can be (and has been) adapted to accomplish that goal.

How do you assign labels to vertices? Well, the idea is simple and the intuitions behind it are as follows. First, at the end of the computation, each vertex in the same community should have the same label. If two vertices belong to the same community, then this should be reflected by the fact that they have the same label. Second, a vertex should have the same label as its neighbors, or at least most of them (remember that a user might have friends who belong to a different community, whose members are closer friends with that user).

The algorithm works as follows. Each vertex has a label, initially its own ID, that it sends to its direct neighbors through messages. The simple heuristic that each vertex applies is to acquire the label that is occurring most frequently among its neighbors, breaking ties randomly (that is, if two labels occur most frequently across neighbors and have the same frequency, choose randomly). When a vertex acquires a new label, it sends the label to its neighbors. This is why the algorithm is called the label propagation algorithm. For example, Lady Gaga initially propagates the string “Lady Gaga”, which is her ID. At each iteration, each vertex acquires a new label if it finds a label that occurs more frequently across its neighbors than its current one. In the case of Lady Gaga, after some iterations, the label appears more frequently in the neighborhood of her fans should be “Lady Gaga”. In that case, the vertex propagates the label to its neighbors through a message. The algorithm halts when no vertex changes label. At the end of the computation, each vertex holds the label representing its community. This label corresponds to the ID of a vertex in its community—a vertex that is more “central” to the community it represents (in this example, “Lady Gaga”).

The algorithm is as simple as that. At the first iteration, there are as many labels as vertices. When each vertex receives IDs from its neighbors (with each label occurring exactly once), it must choose randomly among them. IDs of vertices with many outgoing edges are more likely to be chosen by the random tie-breaking heuristic. In the following iteration, more vertices in the neighborhoods of these vertices have that same label, hence increasing the likelihood of that label being assigned to more vertices. Iteration after iteration, labels spread in the neighborhoods and compete with each other, until a point of equilibrium is reached and no more label changes occur. How and when this point of equilibrium is reached depends on the topology of the graph and on its community structure.

Warning Running LPA in a synchronous system like Giraph requires some ad hoc arrangements, because it can result in unstable states in which the algorithm oscillates between two or more solutions and never halts.

To run LPA in Giraph, you must modify the algorithm slightly. The reason is that LPA does not play well with synchronous systems like Giraph. Running LPA in synchronous systems can cause unstable states where some vertices keep oscillating between two or more labels, avoiding convergence. Consider this simple example. Imagine a graph with two vertices A and B, connected only to each other. At the first superstep, vertex A receives label B and vertex B receives label A. At this point, each vertex decides to acquire the received label, because it is the one occurring most frequently across their neighbors, and each vertex sends its label again through a message. At the next superstep, the same situation occurs, but with inverse labels. The problem is that the algorithm continues, with the two vertices trying to “catch” each other forever. This happens not only with the trivial case of two vertices connected by one edge, but also with more complex topologies. The good news is that this can be fixed. First you need to ensure that a vertex has a deterministic way to break ties—for example, by acquiring the smallest label between the (equally) most frequently occurring labels. Second, a vertex needs to avoid re-acquiring the label it had before the current label. This guarantees that vertices will reach a non-oscillating state.

Listing 4-9 presents the pseudocode of LPA. LPA works best on undirected graphs, so you assume the graph has been converted to a logically undirected graph. Figure 4-12 shows the flow of LPA in Giraph. Note that for the sake of presentation, the figure shows the flow of the algorithm without the ad hoc modifications to work in Giraph. The code uses an LPAValue object as a vertex value that contains both current and previous values, to allow you to implement the technique mentioned to guarantee convergence. Also, note that you store labels in edge values. This is because vertices send new labels only when they change. When a vertex computes label frequencies, it needs the labels of all the neighbors, including labels sent a few supersteps earlier. This is why you store labels in the corresponding edge. This comes in handy because labels are easily serialized for checkpointing by Giraph, and you don’t need additional data structures for labels.

Listing 4-9. Label Propagation Algorithm (LPA)

function compute(vertex, messages):
  if getSuperstep() == 0:                                       #1
    vertex.setValue(LPAValue(vertex.getId(), vertex.getId()))   #1
    propagateValue(vertex, vertex.getId())                      #1
  else:
    for message in messages:                                    #2
      label = message.getValue()                                #2
      source = message.getSourceId()                            #2
      vertex.setEdgeValue(source, label)                        #2
    labels = DefaultDictionary(0)
    for edge in vertex.getEdges():                              #3
      labels[edge.getValue()]++                                 #3
    mfLabel = computeMostFrequentLabel(labels)
    cLabel = vertex.getValue().getCurrent()
    lLabel = vertex.getValue().getLast()
    if labels[cLabel] < labels[mfLabel] and mfLabel != lLabel:  #4
      vertex.setValue(LPAValue(cLabel, mfLabel))                #4
      propagateValue(vertex, mfLabel)                           #4
  vertex.voteToHalt()

function computeMostFrequentValue(labels):
  maxFreq, mfLabel = -INFINITY, None
  for label, freq in labels:
    if freq > maxFreq or (freq == maxFreq and label < mfLabel): #5
      maxFreq, mfLabel = freq, label
  return mfLabel

function propagateValue(vertex, value):
  message = Message(vertex.getId(), value)
  sendMessageToAllEdges(vertex, message)

class LPAValue:
  function LPAValue(lastLabel, currentLabel):
    this.lastLabel = lastLabel
    this.currentLabel = currentLabel
  function getCurrent():
    return this.currentLabel
  function getLast():
    return this.lastLabel

#1 Vertices use their own ID as the initial label.
#2 Vertices store the new labels as edge values.
#3 Occurrences of labels are computed.
#4 The current label is updated if necessary.
#5 Find the most frequently occurring label with the smallest ID.

Figure 4-12. The flow of the computation of LPA in Giraph

Analyzing the algorithm with respect to the points presented at the beginning of the chapter, you can outline the following decisions:

Independent vertex decisions: Each vertex chooses its label according to the label of the neighbors and its current label. Because the neighbors’ values also depend on their respective neighborhood, a vertex can only consider its direct neighbors for its own label. Hence it has all the necessary information in these elements. The tie-breaking strategy is also important, because due to its deterministic nature (it is based on an ordering of labels), it is consistent across different vertices.
Initialization of vertex values: Each vertex uses its own ID as the initial value. This value is by definition unique, which makes it possible to uniquely identify communities at the end of the computation.
Halting condition: Each vertex propagates its value only when the vertex changes the value. The vertex votes to halt at the end of each superstep. Hence, when no vertex changes its label, no message is produced, and all the vertices are inactive. Because of the synchronous model implemented by Giraph, you have to introduce the tie-breaking heuristic described earlier to guarantee that the job will halt.
Aggregators and combiners: Combiners are not usable given the way the algorithm is implemented. You want to keep the messages uncombined because you need to keep track of senders. An aggregator could be used to keep track of the number of communities found and their size.

You have seen how to detect communities with an algorithm that propagates membership to communities following the Giraph information-propagation pattern. The algorithm is very scalable and simple to implement. The algorithms presented so far should give you a taste of how to design an algorithm for Giraph and what to pay attention to. Hopefully you’ll be inspired to get your hands dirty with the Giraph API.

By now you should understand what is relevant and what to avoid when designing an algorithm that scales. For convenience, the following checklist summarizes the items you should consider when designing an algorithm for Giraph.

GIRAPH ALGORITHM DESIGN CHECKLIST

Keep he following things in mind when you design an algorithm for Giraph. Don’t consider this a definitive and complete selection of items that will guarantee scalability; but this list should bootstrap your analysis in the right direction.

The decisions a vertex makes are based on local information, such as its current vertex and edge values; the incoming messages; and the current superstep.
A vertex sends messages mostly to known vertices, such as its neighbors, and it does not have to look up vertex IDs.
The graph and the algorithm do not imply patterns in which vertices receive messages from all other vertices.
The halting condition is clear and consistent.
Vertex values have a well-defined initial value that is coherent with the heuristics applied by the vertices.
If possible, messages are combined.
A vertex value’s size is not proportional to the size of the graph.
The algorithm considers side cases like a vertex with no edges or only outgoing or incoming edges, and so on.
The size of messages does not increase too much over time (for example, exponentially), such as when incoming messages are concatenated and propagated to all neighbors iteration after iteration.
Although the API supports adding and removing vertices and edges, the algorithm should not assume that an entire graph is constructed iteration after iteration from scratch.
Aggregators are associative and commutative functions, and they do not occupy space proportional to the size of the graph (such as a “fake” aggregator that stores all the incoming values in memory).
When a received message needs to be reused, if possible store it locally instead of requiring it to be sent again, to save network I/O.

Characterizing Types of Graphs and Networks

Graph algorithms can be useful to build applications and run analytics on your graph. Depending on your business scenario, if you are modeling your data as a graph, you can find a bunch of graph algorithms that will prove useful to you. Some of them will probably belong to one of the classes of algorithms presented so far in this chapter. However, running graph algorithms on your graph can also help you better understand the processes or phenomena underlying your data. For example, you could study the type of social interactions that result in certain relationships in a social network, or the hyperlink structure of web pages.

Although all graphs consist of a number of vertices connected by edges, modeling various aspects of the world, the topologies of graphs modeling different data often have characteristics in common. There are many types of graphs, each with specific properties, many of which are common in real-world graphs—in particular the graphs described so far, such as social networks, the Internet, and the Web. This section presents a selection of the most common graph types and explains how to recognize whether a graph is of each type. It is very important for you to get to know your graph, and some of these algorithms are tools that allow you to do so.

Three characteristics are at the core of this analysis:

Average path length: The average number of hops between two nodes chosen at random. This number is usually compared to the number of vertices in the graph.
Degree distribution: The way vertex degrees are distributed across the vertices. Do vertices have approximately the same number of edges, or do a few have many edges while most of the others have a small number of edges?
Clustering coefficient: The extent to which vertices are embedded in tightly connected clusters.

These are only some of the graph characteristics you may want to look at to identify a graph type, but they are usually sufficient to give you a good picture of the graph. Note that these characteristics relate only to the topology of the graph and do not depend on what the graph actually represents (web pages, individuals, users and items, and so on).

For example, consider a graph that represents the United States’ numbered highway system. As you saw in Chapter 2, the road network can be modeled with a graph by using a vertex for each point where multiple roads can be chosen—such as a crossing, town, or city—and edges to represent the roads. If you look at Figure 4-13 and consider the road network as a graph, you can see that the graph has the following properties:

All vertices tend to have a low degree. Most have five or six edges, some more, some fewer. This is natural, because crossings and cities are connected directly only to the closest neighbors.
The clustering coefficient is very low, and neighbor vertices tend to share very few neighbors, if any.
The graph has a very large average path length. If two cities are far from each other, such as New York and Los Angeles, the path that connects them in the graph is long.

Figure 4-13. The United States numbered highway system

These properties are expected, because a road network resembles a grid (or lattice, or mesh).

As the name hints, a grid is a graph that, when drawn, has regular tiles, like squares. Figure 4-14 shows such a graph with 15 vertices. This is a very regular graph with a large average path length, which increases quickly as the graph grows in the number of vertices and edges. The degree distribution is very regular: the vast majority of the vertices have exactly four neighbors, except the vertices at the borders of the graph. The average clustering coefficient is zero, because vertices do not share any neighbors with their neighbors. This is a boring graph, and few real-world graphs have such a strict topology (although the road network in Figure 4-13 resembles this topology in a more relaxed way). However, as a toy graph, it serves the purpose of illustrating the discussion. Things get more interesting in a second.

Figure 4-14. Example of a grid graph

Now look at the routes of domestic airlines in the United States, shown in Figure 4-15. This is a very different graph, although it still connects cities in the United States. First, notice how certain (few) cities have a very high degree, whereas most have a low degree. Those with high degree are called hub airports, because they can be used to connect many cities with connecting flights. In graph terms, vertices with this kind of role in a graph are also called hubs. This degree distribution resembles a power-law distribution. This kind of connectivity pattern allows an important drop in the average path length. Graphs with these type of connectivity are also called scale-free networks, because when you add vertices to the graph, the average shortest path length tends to remain low. Every city can reach any other city by passing through hubs, without the need to connect every city with every other city. The clustering coefficient of this graph is still pretty low, because airlines try to remove redundant flights that can be replaced by two connected routes through hubs. Of course, the flight routes do not allow all cities to be reached—only those that have a (large enough) airport.

Figure 4-15. United States domestic flight routes

Suppose you merged the two graphs together, connecting cities through roads and airlines routes. What would their effect be on each other?

Because you would be connecting cities far from each other through flight routes, you would still have a graph with a smaller average path length than the pure road map. Still, the average path length would be larger than the average path length of the airline map, because certain paths would need to be traversed through the road map. Also, the degree distribution of the merged graph would have a power-law distribution, due to the hubs. The clustering coefficient would remain low, because both graphs have a low clustering coefficient and you would conveniently only add long-range connections to the road map. In graph terms, merging the two graphs is analogous to rewiring.

Let’s go back to the lattice in Figure 4-14. Take a number of edges from this graph, and rewire them at random. This means you take a number of the edges (say, 18%) and reconnect one endpoint to a vertex chosen at random in the graph. How does this rewiring affect the three characteristics of the graph? Well, most important, it decreases the average path length. In the grid topology, the average path length tends to be large because there can be vertices that are far from each other, such as those at far left and those at far right in Figure 4-14. Figure 4-16 presents such a rewired graph. Rewiring vertices in the grid increases the chance that areas that are far from each other in the graph are connected through the rewired edge. Basically, these rewired edges act like bridges. Shortest paths can now traverse these bridges, with an impact on their length. As far as the degree distribution is concerned, because you choose endpoints at random, you may end up adding more edges to certain vertices than the edges you remove from those same vertices. Hence, their degree may change, but on average edges should still be uniformly distributed. This means you have a slightly different degree distribution than the deterministic distribution of the grid. In other words, the average vertex degree should be very similar, but this time the deviation should be larger than in the grid topology. For the same reason, the average clustering coefficient should be different than 0 (although still very small), because there is a certain likelihood that a rewired edge introduces a triangle.

Figure 4-16. The grid and the rewired grid with 18% of edges rewired at random

To take this example to an extreme, you can generate a fully random graph, by taking the 15 vertices and connecting them completely at random with the same number of edges as before (continuing to rewire at random until you obtain a connected graph). For the same reasons discussed, this random graph, while having the same number of vertices and edges, has an even shorter average path length, a degree distribution with a similar mean but higher deviation, and a slightly higher clustering coefficient (although still small). Figure 4-17 shows such a graph. Grids and random graphs are not very realistic representations of real-world networks, but they serve to introduce the three characteristics that are commonly used to describe graph types.

Figure 4-17. A random graph can be obtained by rewiring all the edges at random

Before moving on to a realistic graph, let’s go back to the rewiring of the grid. If you used another rewiring strategy when rewiring a percentage of the edges, you could obtain a graph that resembles the graph that merges airlines map and the road map: it would have a power-law degree distribution. To obtain such a graph, when you choose the new endpoint of a rewired edge, instead of choosing randomly, you choose a vertex with a probability that is proportional to its degree. Vertices with high degree have a higher probability and are chosen more often. The more a vertex is chosen as the new endpoint for the rewired edge, the more likely it is that the next rewired edges will choose that vertex as an endpoint (due to its increasing degree). This process tends to make big vertices bigger. At the end of the process, you obtain a power-law distribution with few hubs and a long tail of vertices with low degree (this strategy is also called preferential attachment)—exactly like the merged graph. Figure 4-18 shows such a graph.

Figure 4-18. A grid with 60% of the edges rewired following preferential attachment

These examples have shown the relationships between graph topology and the degree distribution, the clustering coefficient, and the average path length. You have also learned about hubs and the way they influence these metrics. Let’s now look at a final important type of graph. In the last decade, the study of real-world networks has shown that many of these networks have small-world properties (that in turn makes them small-world networks or graphs). Social graphs tend to have such properties. Think of your network of friends: you tend to have friends in the region where you live (which may be the region where you also grew up), and many of these friends tend to be friends with each other. After all, people tend to be part of many communities. This means social networks tend to have a high clustering coefficient. Such a network with a local nature would tend to have a large average path length. However, you may also have friends who live far from you and who belong to very different communities. For example, you might have a friend in Japan, which puts you only a few hops from many Japanese people. These particular friendships act like bridges in the rewired grid, which significantly lowers the average path length. Finally, some people have a much larger network of friends than the average, perhaps because they travel a lot, or due to their job. This means social networks also have hubs. Summarizing, small-world networks tend to have a high clustering coefficient, small average path length, and a power-law degree distribution. Small-world properties are the reason Milgram discovered such a small average path length in a large social network like the population of the United States. The same sort of experiment was recently executed on the Facebook social graph, revealing a small average path length of four to seven! Figure 4-19 shows a graph with small-world properties.

Figure 4-19. A graph with small-world properties

You need a better understanding of your graph in order to design, build, and master graph applications. If you were to build a social network such as Twitter, you would be interested in the way information (tweets) propagates through the followers network. In a social network with a low average path length, you know that only a few retweets would be necessary to reach distant areas of the graph. Moreover, a graph with a high clustering coefficient would increase the likelihood that information would reach a large portion of a community from which it started.

Graph topology has many more aspects, such as how resilient a graph is to attacks or failures, and so on; but network science constitutes a larger body of work than this book can present. However, with the tools provided in this chapter, you should be ready to start your journey toward a better understanding and management of your graph data.

Summary

Modeling data as a graph is the first step toward using Giraph. The next step is to design an algorithm that can process it:

In Giraph, algorithms build on the principle of information propagation, where the state of a computation is distributed across the vertex values, and vertices share it through messages.
Paths are sequences of edges that connect vertices. SSSP is an algorithm to compute the distance between a source vertex and all other reachable vertices, and it fits the Giraph model naturally.
Paths can be used to define connected components, which are subgraphs where each vertex can reach any other vertex. Connected components can be used to study how information propagates in the graph.
Certain vertices are more important than others, and it is useful to identify them. PageRank is an algorithm that lets you rank vertices according to their importance in the graph.
If you model users and items and their ratings as a graph, you can compute recommendations for users about new items through the stochastic gradient descent (SGD) algorithm.
Graphs, and in particular social networks, can be characterized by communities (or clusters) where vertices are tightly connected to each other. Label propagation allows you to identify these communities and label each vertex with the community it belongs to.
Different types of graphs, regardless of the data they represent, are characterized by particular topological properties. Using a few metrics and some graph algorithms to compute them, you can find out the type of a graph.

Giraph provides a programming API that helps developers design scalable graph algorithms. You are now ready to dig into the Giraph Java API, to implement and run algorithms on a graph with a Hadoop cluster.

__________________

¹BFS is a strategy to traverse a graph that can be used for computing shortest paths but also for other operations. There are also other algorithms that can be used to compute shortest paths.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Giraph Algorithmic Building Blocks

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4: Giraph Algorithmic Building Blocks