Chapter 7. Working with Graph Algorithms

In this chapter, I'll delve into graph libraries and algorithm implementations in Scala. In particular, I will introduce Graph for Scala (http://www.scala-graph.org), an open source project that was started in 2011 in the EPFL Scala incubator. Graph for Scala does not support distributed computing yet—the distributed computing aspects of popular graph algorithms is available in GraphX, which is a part of MLlib library that is part of Spark project (http://spark.apache.org/docs/latest/mllib-guide.html). Both, Spark and MLlib were started as class projects at UC Berkeley around or after 2009. I considered Spark in Chapter 3, Working with Spark and MLlib and introduced an RDD. In GraphX, a graph is a pair of RDDs, each of which is partitioned among executors and tasks, represents vertices and edges in a graph.

In this chapter, we will cover the following topics:

  • Configuring Simple Build Tool (SBT) to use the material in this chapter interactively
  • Learning basic operations on graphs supported by Graph for Scala
  • Learning how to enforce graph constraints
  • Learning how to import/export graphs in JSON
  • Performing connected components, triangle count, and strongly connected components running on Enron e-mail data
  • Performing PageRank computations on Enron e-mail data
  • Learning how to use SVD++

A quick introduction to graphs

What is a graph? A graph is a set of vertices where some pairs of these vertices are linked with edges. If every vertex is linked with every other vertex, we say the graph is a complete graph. On the contrary, if it has no edges, the graph is said to be empty. These are, of course, extremes that are rarely encountered in practice, as graphs have varying degrees of density; the more edges it has proportional to the number of vertices, the more dense we say it is.

Depending on what algorithms we intend to run on a graph and how dense is it expected to be, we can choose how to appropriately represent the graph in memory. If the graph is really dense, it pays off to store it as a square N x N matrix, where 0 in the nth row and mth column means that the n vertex is not connected to the m vertex. A diagonal entry expresses a node connection to itself. This representation is called the adjacency matrix.

If there are not many edges and we need to traverse the whole edge set without distinction, often it pays off to store it as a simple container of pairs. This structure is called an edge list.

In practice, we can model many real-life situations and events as graphs. We could imagine cities as vertices and plane routes as edges. If there is no flight between two cities, there is no edge between them. Moreover, if we add the numerical costs of plane tickets to the edges, we say that the graph is weighted. If there are some edges where only travels in one direction exist, we can represent that by making a graph directed as opposed to an undirected graph. So, for an undirected graph, it is true that the graph is symmetric, that is, if A is connected to B, then B is also connected to A—that is not necessarily true for a directed graph.

Graphs without cycles are called acyclic. Multigraph can contain multiple edges, potentially of different type, between the nodes. Hyperedges can connect arbitrary number of nodes.

The most popular algorithm on the undirected graphs is probably connected components, or partitioning of a graph into subgraph, in which any two vertices are connected to each other by paths. Partitioning is important to parallelize the operations on the graphs.

Google and other search engines made PageRank popular. According to Google, PageRank estimates of how important the website is by counting the number and quality of links to a page. The underlying assumption is that more important websites are likely to receive more links from other websites, especially more highly ranked ones. PageRank can be applied to many problems outside of websites ranking and is equivalent to finding eigenvectors and the most significant eigenvalue of the connectivity matrix.

The most basic, nontrivial subgraph, consists of three nodes. Triangle counting finds all the possible fully connected (or complete) triples of nodes and is another well-known algorithm used in community detection and CAD.

A clique is a fully connected subgraph. A strongly connected component is an analogous notion for a directed graph: every vertex in a subgraph is reachable from every other vertex. GraphX provides an implementation for both.

Finally, a recommender graph is a graph connecting two types of nodes: users and items. The edges can additionally contain the strength of a recommendation or a measure of satisfaction. The goal of a recommender is to predict the satisfaction for potentially missing edges. Multiple algorithms have been developed for a recommendation engine, such as SVD and SVD++, which are considered at the end of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset