Resilient distributed datasets (RDD)

In this section, we will talk about the architecture, motivation, features, and other important concepts related to RDD. We will also briefly talk about the implementation methodology adopted by Spark and various APIs/functions exposed by RDDs.

Frameworks such as Hadoop and MapReduce are widely adopted for parallel and distributed data processing. There is no doubt that these frameworks introduce a new paradigm for distributed data processing and that too in a fault-tolerant manner (without losing a single byte). However, these frameworks do have some limitations; for example, Hadoop is not suited for the problem statements where we need iterative data processing as in recursive functions or machine learning algorithms because this kind of use cases data needs to be in-memory for the computations.

For all these scenarios, a new paradigm, RDD, was introduced that contains all the features of Hadoop-like systems, such as distributed processing, fault-tolerant, and so on, but essentially keeps data in the memory providing distributed in-memory data processing over a cluster of nodes.

RDD – by definition

RDD is a core component of the Spark framework. It is an independent concept, which was developed by the University of California, Berkeley, and was first implemented in systems, such as Spark, to show its real usage and power.

RDD provides an in-memory representation of immutable datasets for parallel and distributed data processing. RDD is an abstract layer and agnostic of the underlying data store, providing the core functionality of in-memory data representation for storing and retrieving data objects. RDDs are further extended to capture various kinds of data structures, such as graphs or relational structures or streaming data.

Let's move on and discuss the important features of RDDs.

Fault tolerance

Fault tolerance is not a new concept and is already been implemented in various distributed processing systems such as Hadoop, key/value stores, and so on. These system leveraged the strategy of data replication to achieve the fault tolerance. They replicate and maintain multiple copies of the same dataset over the various nodes in the cluster, or maintain the log of updates happening over the original datasets, and instantly apply the same over the machines/nodes. This kind of architecture/process is good for disk-based systems but the same mechanism is inefficient for data-intensive workloads or memory-based systems because first they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of RAM, and second, they incur substantial storage overheads.

The objective of an RDD was to solve the existing challenges and also provide an efficient fault-tolerance mechanism for the datasets, which are already loaded and processed in the memory.

RDD introduced a new concept for fault-tolerance and provided a coarse-grained interface based on transformations. Now instead of replicating data or keeping logs of updates, RDDs keep track of the transformations (such as a map, reduce, join, and so on) applied to the particular dataset, which is also called a data lineage.

This allows an efficient mechanism for fault-tolerance where, in the event of loss of any partition, RDD has enough information to derive the same partition by applying the same set of transformations to the original dataset. Moreover, this computation is parallel and involves processing on multiple nodes, so the recomputation is very fast and efficient as well, in contrast to costly replication used by other distributed data processing frameworks.

Storage

The architecture/design of RDD facilitates the data to be distributed and partitioned over clusters of nodes. RDDs are kept in the system's memory, but they also provide operations, which can be used to store RDD on disks or to an external system. Spark and its core packages, by default, provide the API to work with the data residing in local filesystems and HDFS. There are other vendors and open source communities, which provide the appropriate packages and APIs for storage of RDDs in external storage systems such as MongoDB, DataStax, Elasticsearch, and so on.

The following are a few of the functions, which can be used to store RDDs:

Persistence

Persistence in RDD is also called caching of RDD. It can simply be done by invoking <RDD>.persist(StorageLevel) or <RDD>.cache(). By default, RDD is persisted in memory (default for cache()), but it also provides the persistence on disks or any other external systems, which are defined and provided by the persist(…) function and its parameter of the StorageLevel class (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.storage.StorageLevel$).

The StorageLevel class is annotated as DeveloperApi(), which can be extended to provide the custom implementation of persistence.

Caching or persistence is a key tool used for iterative algorithms and fast interactive use. Whenever persist(…) is invoked on RDD, each node stores its associated partitions and computes in memory and further reuses them in other actions on the computed datasets. This, in turn, also enables the future actions to be much faster.

Shuffling

Shuffling is another important concept in Spark. It redistributes data across clusters so that it is grouped differently across the partitions. It is a costly operation as it involves the following activities:

  • Copying data across executors and nodes
  • Creating new partitions
  • Redistributing the newly created partitions across the cluster

There are certain transformation operations defined in org.apache.spark.rdd.RDD and org.apache.spark.rdd.PairRDDFunctions, which initiate the shuffling process. These operations include the following:

  • RDD.repartition(…): This repartitions the existing dataset across the cluster of nodes
  • RDD.coalesce(…): This repartitions the existing dataset into a smaller number of given partitions
  • All the operations that end with ByKey (except count operations) such as PairRDDFunctions.reducebyKey() or groupByKey
  • All the join operations such as PairRDDFunctions.join(…) or PairRDDFunctions.cogroup(…) operations

The shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O, but there are certain configurations, which can help in tuning and performance optimizations. Refer to https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior for a complete list of parameters, which can be used to optimize the shuffle operations.

Note

Refer to https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf for more information on RDDs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset