GraphFrames

Note that so far, to compute any interesting indicators on a given graph, we had to use the compute model of the graph, an extension of what we know from RDDs. With Spark's DataFrame or Dataset concept in mind, the reader may wonder if there is any possibility to use an SQL-like language to do run queries against a graph for analytics. Query languages often provide a convenient way to get results quickly.

This is indeed possible with GraphFrames. The library was developed by Databricks and serves as natural extension of GraphX graphs to Spark DataFrames. Unfortunately, GraphFrames are not part of Spark GraphX, but instead available as Spark package. To load GraphFrames upon starting spark-submit, simply run

spark-shell --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11

and suitably adapt preceding version numbers for both your preferred Spark and Scala versions. Converting a GraphX Graph to a GraphFrame and vice versa is as easy as it gets; in the following we convert our friend graph from earlier to a GraphFrame and then back:

import org.graphframes._

val friendGraphFrame = GraphFrame.fromGraphX(friendGraph)
val graph = friendGraphFrame.toGraphX

As indicated before, one added benefit of GraphFrames is that you can use Spark SQL with them, as they are built on top of DataFrames. This also means that GraphFrames are much faster than Graphs, since the Spark core team has brought a lot of speed gains to DataFrames through their catalyst and tungsten frameworks. Hopefully we see GraphFrames added to Spark GraphX in one of the next releases.

Instead of looking at a Spark SQL example, which should be familiar enough from previous chapters, we consider another query language available for GraphFrames, that has a very intuitive compute model. GraphFrames has borrowed the Cypher SQL dialect from the graph database neo4j, which can be used for very expressive queries. Continuing with the friendGraphFrame, we can very easily find all length two paths for which either end in the vertex "Chris" or pass through the edge "trusts" first by using one concise command:

friendGraphFrame.find("(v1)-[e1]->(v2); (v2)-[e2]->(v3)").filter(
  "e1.attr = 'trusts' OR v3.attr = 'Chris'"
).collect.foreach(println)

Note how we can specify the graph structure in a manner that lets you think in terms of the actual graph, that is, we have two edges e1 and e2, that are connected to each other by a common vertex v2. The result of this operation is listed in the following screenshot, which indeed gives back the three paths that suffice the preceding condition:

Unfortunately, we can not discuss GraphFrames here in more detail, but the interested reader is referred to the documentation available at https://graphframes.github.io/ for more details. Instead, we will now turn to the algorithms available in GraphX and apply them to a massive graph of actor data.

Table of Contents for GraphFrames

Create new playlist

Sign In

Sign Up

Table of Contents for
GraphFrames