In the last section of this chapter, we will discuss, very briefly, how Spark works internally. For a more detailed discussion, see the References section at the end of the chapter.
When you open a Spark context, either explicitly or by launching the Spark shell, Spark starts a web UI with details of how the current task and past tasks have executed. Let's see this in action for the example mutual information program we wrote in the last section. To prevent the context from shutting down when the program completes, you can insert a call to readLine
as the last line of the main
method (after the call to takeOrdered
). This expects input from the user, and will therefore pause program execution until you press enter.
To access the UI, point your browser to 127.0.0.1:4040
. If you have other instances of the Spark shell running, the port may be 4041
, or 4042
and so on.
The first page of the UI tells us that our application contains three jobs. A job occurs as the result of an action. There are, indeed, three actions in our application: the first two are called within the wordFractionInFiles
function:
val nMessages = messages.count()
The last job results from the call to takeOrdered
, which forces the execution of the entire pipeline of RDD transformations that calculate the mutual information.
The web UI lets us delve deeper into each job. Click on the takeOrdered
job in the job table. You will get taken to a page that describes the job in more detail:
Of particular interest is the DAG visualization entry. This is a graph of the execution plan to fulfill the action, and provides a glimpse of the inner workings of Spark.
When you define a job by calling an action on an RDD, Spark looks at the RDD's lineage and constructs a graph mapping the dependencies: each RDD in the lineage is represented by a node, with directed edges going from this RDD's parent to itself. This type of graph is called a directed
acyclic graph (DAG), and is a data structure useful for dependency resolution. Let's explore the DAG for the takeOrdered
job in our program using the web UI. The graph is quite complex, and it is therefore easy to get lost, so here is a simplified reproduction that only lists the RDDs bound to variable names in the program.
As you can see, at the bottom of the graph, we have the mutualInformation
RDD. This is the RDD that we need to construct for our action. This RDD depends on the intermediate elements in the sum, igFragment1
, igFragment2
, and so on. We can work our way back through the list of dependencies until we reach the other end of the graph: RDDs that do not depend on other RDDs, only on external sources.
Once the graph is built, the Spark engines formulates a plan to execute the job. The plan starts with the RDDs that only have external dependencies (such as RDDs built by loading files from disk or fetching from a database) or RDDs that already have cached data. Each arrow along the graph is translated to a set of tasks, with each task applying a transformation to a partition of the data.
Tasks are grouped into stages. A stage consists of a set of tasks that can all be performed without needing an intermediate shuffle.