Introduction

In this chapter, we will explore the current fundamental data structure—DataFrames. DataFrames take advantage of the developments in the tungsten project and the Catalyst Optimizer. These two improvements bring the performance of PySpark on par with that of either Scala or Java.

Project tungsten is a set of improvements to Spark Engine aimed at bringing its execution process closer to the bare metal. The main deliverables include:

Code generation at runtime: This aims at leveraging the optimizations implemented in modern compilers
Taking advantage of the memory hierarchy: The algorithms and data structures exploit memory hierarchy for fast execution
Direct-memory management: Removes the overhead associated with Java garbage collection and JVM object creation and management
Low-level programming: Speeds up memory access by loading immediate data to CPU registers
Virtual function dispatches elimination: This eliminates the necessity of multiple CPU calls

Check this blog from Databricks for more information: https://www.databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html.

The Catalyst Optimizer sits at the core of Spark SQL and powers both the SQL queries executed against the data and DataFrames. The process starts with the query being issued to the engine. The logical plan of execution is first being optimized. Based on the optimized logical plan, multiple physical plans are derived and pushed through a cost optimizer. The selected, most cost-efficient plan is then translated (using code generation optimizations implemented as part of the tungsten project) into an optimized RDD-based execution code.

Table of Contents for Introduction

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction