Introduction

In this chapter, we will explore the current fundamental data structure—DataFrames. DataFrames take advantage of the developments in the tungsten project and the Catalyst Optimizer. These two improvements bring the performance of PySpark on par with that of either Scala or Java.

Project tungsten is a set of improvements to Spark Engine aimed at bringing its execution process closer to the bare metal. The main deliverables include:

  • Code generation at runtime: This aims at leveraging the optimizations implemented in modern compilers
  • Taking advantage of the memory hierarchy: The algorithms and data structures exploit memory hierarchy for fast execution
  • Direct-memory management: Removes the overhead associated with Java garbage collection and JVM object creation and management
  • Low-level programming: Speeds up memory access by loading immediate data to CPU registers
  • Virtual function dispatches elimination: This eliminates the necessity of multiple CPU calls

The Catalyst Optimizer sits at the core of Spark SQL and powers both the SQL queries executed against the data and DataFrames. The process starts with the query being issued to the engine. The logical plan of execution is first being optimized. Based on the optimized logical plan, multiple physical plans are derived and pushed through a cost optimizer. The selected, most cost-efficient plan is then translated (using code generation optimizations implemented as part of the tungsten project) into an optimized RDD-based execution code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset