How it works...

To better understand the performance of the previous RDD and DataFrame, let's return to the Spark UI. For starters, when we run the flights RDD query, three separate jobs are executed, as can be seen in Databricks Community Edition in the following screenshot:

Each of these jobs spawn their own set of stages to initially read the text (or CSV) file, execute reduceByKey(), and execute the sortByKey() functions:

With two additional jobs to complete the sortByKey() execution:

As can be observed, by using RDDs directly, there can potentially be a lot of overhead, generating multiple jobs and stages to complete a single query.

In the case of Spark DataFrames, for this query it is much simpler for it to consist of a single job with two stages. Note that the Spark UI has a number of DataFrame-specific set tasks, such as WholeStageCodegen and Exchange, that significantly improve the performance of Spark dataset and DataFrame queries. More information about the Spark SQL engine catalyst optimizer can be found in the next chapter:

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...