Introduction

Resilient Distributed Datasets (RDDs) are collections of immutable JVM objects that are distributed across an Apache Spark cluster. Please note that if you are new to Apache Spark, you may want to initially skip this chapter as Spark DataFrames/Datasets are both significantly easier to develop and typically have faster performance. More information on Spark DataFrames can be found in the next chapter.

An RDD is the most fundamental dataset type of Apache Spark; any action on a Spark DataFrame eventually gets translated into a highly optimized execution of transformations and actions on RDDs (see the paragraph on catalyst optimizer in Chapter 3, Abstracting Data with DataFrames, in the Introduction section).

Data in an RDD is split into chunks based on a key and then dispersed across all the executor nodes. RDDs are highly resilient, that is, there are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. Thus, even if one executor fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes. RDDs keep a log of all the execution steps applied to each chunk. This, on top of the data replication, speeds up the computations and, if anything goes wrong, RDDs can still recover the portion of the data lost due to an executor error.

While it is common to lose a node in distributed environments (for example, due to connectivity issues, hardware problems), distribution and replication of the data defends against data loss, while data lineage allows the system to recover quickly.

Table of Contents for Introduction

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction