GraphFrames are an abstraction of DataFrames that are used to do Graph Analytics. Graph Analytics stems from the mathematical Graph Theory. Graph Theory is a very important theory used to represent relationships between entities, which we can use to perform various analyses. You are using Graph Theory in your everyday life when using Google. Google introduced the PageRank algorithm that is based on Graph Theory. It tries to identify the most influential website that suits your search in the best way.
While Graph Theory is used in various sciences, computer science also tends to solve a lot of problems with Graph Theory. Some of the applications of Graph Theory include social media problems, travel, chip design, and many other fields. In fact, every time you run a Spark job, you are using Graph Theory. Spark uses Directed Acyclic Graphs to represent an RDD. It uses it to find the optimized plan to your query.
The diagram in Figure 9-1 represents a small family of four members—the husband (Andrew), the wife (Sierra), a son (Bob), and a daughter (Emily). People are represented as nodes, which are called vertices. Each person is connected to the other. You can observe that just within a four-member family, there are 12 relationships. These relationships are represented using the edges that connect them. Now imagine a social media app, such as LinkedIn, that needs to connect millions of people. There is going to be an enormous number of edges. To be able to apply analytics on this kind of data, regular databases will not suffice. With a regular database, you would need to apply self-joins so many times and applying self-joins will literally bring your database down.
Graph Theory solves complex issues like this. Spark provides GraphFrames to represent this graph data. In this chapter, we will learn how to create GraphFrames and apply some of the most used algorithms to solve complex problems.
Recipe 9-1. Create GraphFrames
Recipe 9-2. Apply triangle counting in a GraphFrame
Recipe 9-3. Apply the PageRank algorithm
Recipe 9-4. Apply the Breadth First algorithm
Recipe 9-1. Create GraphFrames
Problem
You need to create a GraphFrame from a given dataset.
Solution
This is a convenient dataset for everyone to use and understand. Let’s load this dataset.
How It Works
Before starting with GraphFrames, you need to install GraphFrames on your machine.
If you don’t see the error message anymore, you can use GraphFrame successfully.
A DataFrame that represents vertices should contain a column named id. Here, personsDf contains a column name id.
A DataFrame that represents edges should contain columns named src and dst. Here, reationshipDf contains the columns src and dst.
So it is a GraphFrame that contains v and e. The v represents vertices and e represents edges.
Now that you have successfully created a GraphFrame, it is important to understand degrees.
Degrees represent the number of edges that are connected to a vertex. GraphFrame supports inDegrees and outDegrees. inDegrees give you the number of incoming links to a vertex. outDegrees give the number of outgoing edges from a node. It is very important to understand this. Let’s try to see the output for a given example.
Here you are going to find all the edges connected to Andrew.
With this, you have successfully created a GraphFrame from the vertices and edges dataset.
Apache Spark provides multiple Graph algorithms built-in. These algorithms are abstracted and provided as easy-to-use APIs. Once you prepare a GraphFrame, any of the following algorithms can be used with just a method.
Connected components
Label propagation
PageRank
SVD++
Shortest Path
Strongly connected components
Triangle count
Recipe 9-2. Apply Triangle Counting in a GraphFrame
Problem
You need to find the triangle count value for each vertex.
Solution
GraphFrames provide an easy-to-use triangleCount API, which upon calling on a given GraphFrame, outputs a DataFrame with a count column added to each of the vertex rows. This count column identifies how many triangle relationships the vertex is participating in. Now we will see how to get the triangle count for each vertex. This is very helpful with route-finding problems and places an important role on the PageRank algorithm.
How It Works
A new column count is added in the output that represents the triangle count. The output shows that Andrew and Sierra have the maximum triangle counts, since they are involved in three kinds of relationships. Andrew as father, friend, and husband and Sierra as mother, friend, and wife. With this, you have successfully created a GraphFrame and applied analytics to it.
You can also register the output of the triangleCount output DataFrame as a table and easily apply a query on that. Let’s use PySparkSQL to identify all the people with a maximum triangle count.
Using the following code, you can find the people in the family with the highest triangle counts.
After that, you can join it with the Persons DataFrame to be able to view the person’s details.
Recipe 9-3. Apply a PageRank Algorithm
Problem
You need to apply PageRank algorithms to find the most influential person in this family.
Solution
The PageRank algorithm was the base of Google during its initial period. It was originally started by Google’s founders to identify the most important pages on the Internet. It uses the idea that the most important pages are linked to by other pages most often. Also, it uses the idea that the higher the link to a given page from higher ranked pages, the more important the page. Thus, Google uses linked web pages represented in graph form to identify important pages for us.
The PageRank algorithm measures the importance of each vertex in a graph. Assume a scenario where one Twitter user has 10 important followers, and each of those followers has multiple followers in turn. That Twitter user gets a higher ranking compared to a Twitter user with 50 “normal” followers. This is to say that the PageRank algorithm considers each important follower a legitimate endorsement of the Twitter user and thereby gives a higher ranking to the user.
How It Works
resetProbablity: This value is a random value reset probability (alpha).
maxIter : This is the number of times you want pageRank to run.
You can see from the original persons schema that a new column has been added called pagerank. This column is added by Spark and indicates the pageRank score for the vertex.
As you can see from the schema, a new column weight has been added to the original relationship schema. This column weight indicates the edge weight that contributed to the PageRank score.
Let’s look at the PageRank score for each vertex and the weight for each of the edges.
We are going to order the PageRank in descending order so that we can see the most connected person in the family based on the links with the other family members.
You can see from this output that Andrew is the most connected person.
Now that you understand the PageRank algorithm and have applied it to GraphFrames, let’s move on to the Breadth First algorithm.
Recipe 9-4. Apply the Breadth First Algorithm
Problem
You need to apply the Breadth First algorithm to find the shortest way to connect to a person.
Solution
You might have often noticed LinkedIn telling you how far you are from any new user. For example, you will notice that a user whom you would like to connect to is a second connection or a third connection. This tells you that you are two vertices away from the vertex from where you are looking. This is one way of identifying how far a vertex is to another vertex.
Similarly, in scenarios where flight companies need to identify the shortest path between cities, they need to identify the path with the least number of vertices between airports. Of course, there may be additional conditions, such as time, whether stops are required on the course, etc.
Problems very similar to these exist in every industry. For example, chip-designing companies need to identify the shortest circuitry path; telecom companies need to find the shortest path between routers, and so on.
Breadth First search is one of the shortest path-finding algorithms and it helps us identify the shortest path between two vertices.
We are going to apply this algorithm to the persons dataset to find the shortest path for Bob to connect to William. You can see that only Andrew is connected to William. So, for Bob to be able to connect to William, what is the shortest path? This is what we are going to determine using the following code.
How It Works
fromExpr: Expression to identify the from vertex.
toExpr: Expression to identify the to vertex.
From this code notice that we are calling the bfs method with two inputs—fromExpr and toExpr—and with filters called = 'Bob' and "name = 'William'". This is to say that we are looking for the shortest path between Bob and William.
From the previous output, you can infer that for Bob to connect to William, he needs to go through Andrew.
In the previous code snippet, we modified the expressions so that we are looking for all people younger than 20 to find ways to connect to Rachel.
Notice that Bob and Emily are both listed in the output. Since Rachel is a friend to both Andrew and Sierra, they are the vertex between Bob and Emily. Bob and Emily need to go through either Andrew or Sierra to be able to connect to Rachel.
If you want to restrict some of the paths—let’s say you want the kids to only go through the parents—then you can use the edgeFilter to determine through which relationships Bob and Emily can connect to Rachel.
The following code shows the usage of the edgeFilter attribute. We are going to say that only the daughter is allowed in the results.
With this, you have successfully learned about the Breadth First search algorithm and used it for analysis.