Directed acyclic graphs

The next notion we want to discuss is that of acyclicity. A cyclic graph is one in which there is at least one vertex for which there is a path through the graph, connecting this vertex to itself. We call such a path a cycle. In an undirected graph, any chain creating a cycle will do, while in a directed graph, we only speak of cycles if we can reach the starting vertex by means of following the directed edges. For example, consider some of the graphs we have seen before. In Figure 2, there is precisely one cycle formed by {e2, e4, e5}, while in its undirected version, shown in Figure 1, there are precisely two cycles, namely {e2, e4, e5} and {e1, e2, e3}.

There are a few special cases of cyclic graphs that are worth mentioning here. Firstly, if a vertex is connected to itself by a single edge, we will say the graph has a loop. Secondly, a directed graph that does not contain any two-loops, that is, without pairs of vertices joined by edges in both directions, is called an oriented graph. Thirdly, a graph with three-loops is said to contain triangles. The notion of triangles is an important one, as it is often used to assess the connectivity of a graph, which we will discuss later on. The following diagram shows an artificial example with different types of loops:

Figure 4: A toy graph illustrating loops or self-edges, two-loops and triangles.

In general, studying n-loops in a graph for any natural number n can tell you a lot about a graph, but triangles are the most common. As directed cycles are not only more expensive to compute but also rarer than their undirected versions, we will often look for undirected triangles only in a graph; that is, we'll forget its directed structure.

An important class of graphs found repeatedly in many applications is that of Directed Acyclic Graphs (DAGs). We already know what a DAG is from the last paragraph, namely a directed graph without cycles, but since DAGs are so ubiquitous, we should spend a little more time on them.

One instance of a DAG that we have implicitly used throughout all the chapters leading up to this one is Spark’s job execution graph. Remember that any Spark job consists of stages executed in a certain order. Stages consist of tasks executed on each partition, some of which may be independent, while others depend on each other. We can thus interpret the execution of a Spark job as a directed graph consisting of stages (or tasks) as vertices, in which an edge represents the output of one computation being required for the next. The prototypical example might be that of a reduce stage that needs the output of a preceding map stage. Naturally, this execution graph does not contain any cycles, as this would mean we are to feed the output of some operators into the graph ad infinitum, preventing our program to eventually halt. Thus, this execution graph can be represented, and is in fact implemented in the Spark scheduler, as DAG:

Figure 5: Visualizing a chain of operations carried out on RDDs with Spark. The execution graph is by definition a DAG.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset