In this section, we will talk about the architecture, motivation, features, and other important concepts related to RDD. We will also briefly talk about the implementation methodology adopted by Spark and various APIs/functions exposed by RDDs.
Frameworks such as Hadoop and MapReduce are widely adopted for parallel and distributed data processing. There is no doubt that these frameworks introduce a new paradigm for distributed data processing and that too in a fault-tolerant manner (without losing a single byte). However, these frameworks do have some limitations; for example, Hadoop is not suited for the problem statements where we need iterative data processing as in recursive functions or machine learning algorithms because this kind of use cases data needs to be in-memory for the computations.
For all these scenarios, a new paradigm, RDD, was introduced that contains all the features of Hadoop-like systems, such as distributed processing, fault-tolerant, and so on, but essentially keeps data in the memory providing distributed in-memory data processing over a cluster of nodes.
RDD is a core component of the Spark framework. It is an independent concept, which was developed by the University of California, Berkeley, and was first implemented in systems, such as Spark, to show its real usage and power.
RDD provides an in-memory representation of immutable datasets for parallel and distributed data processing. RDD is an abstract layer and agnostic of the underlying data store, providing the core functionality of in-memory data representation for storing and retrieving data objects. RDDs are further extended to capture various kinds of data structures, such as graphs or relational structures or streaming data.
Let's move on and discuss the important features of RDDs.
Fault tolerance is not a new concept and is already been implemented in various distributed processing systems such as Hadoop, key/value stores, and so on. These system leveraged the strategy of data replication to achieve the fault tolerance. They replicate and maintain multiple copies of the same dataset over the various nodes in the cluster, or maintain the log of updates happening over the original datasets, and instantly apply the same over the machines/nodes. This kind of architecture/process is good for disk-based systems but the same mechanism is inefficient for data-intensive workloads or memory-based systems because first they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of RAM, and second, they incur substantial storage overheads.
The objective of an RDD was to solve the existing challenges and also provide an efficient fault-tolerance mechanism for the datasets, which are already loaded and processed in the memory.
RDD introduced a new concept for fault-tolerance and provided a coarse-grained interface based on transformations. Now instead of replicating data or keeping logs of updates, RDDs keep track of the transformations (such as a map, reduce, join, and so on) applied to the particular dataset, which is also called a data lineage.
This allows an efficient mechanism for fault-tolerance where, in the event of loss of any partition, RDD has enough information to derive the same partition by applying the same set of transformations to the original dataset. Moreover, this computation is parallel and involves processing on multiple nodes, so the recomputation is very fast and efficient as well, in contrast to costly replication used by other distributed data processing frameworks.
The architecture/design of RDD facilitates the data to be distributed and partitioned over clusters of nodes. RDDs are kept in the system's memory, but they also provide operations, which can be used to store RDD on disks or to an external system. Spark and its core packages, by default, provide the API to work with the data residing in local filesystems and HDFS. There are other vendors and open source communities, which provide the appropriate packages and APIs for storage of RDDs in external storage systems such as MongoDB, DataStax, Elasticsearch, and so on.
The following are a few of the functions, which can be used to store RDDs:
saveAsTextFile(path)
: This writes the elements of RDD to a text file in a local filesystem or HDFS or any other mapped or mounted network drive.saveAsSequenceFile(path)
: This writes the elements of RDD as a Hadoop sequence file in a local filesystem or HDFS or any other mapped or mounted network drive. This is available on RDDs of key/value pairs that implement Hadoop's writable interface.saveAsObjectFile(path)
: This writes the elements of the dataset to a given path using the Java serialization mechanism, which can then be loaded using SparkContext.objectFile(path)
.You can refer to http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD (Scala version) or http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaRDD (Java version) for more information on the APIs exposed by RDD.
Persistence in RDD is also called caching of RDD. It can simply be done by invoking <RDD>.persist(StorageLevel)
or <RDD>.cache()
. By default, RDD is persisted in memory (default for cache()
), but it also provides the persistence on disks or any other external systems, which are defined and provided by the persist(…)
function and its parameter of the StorageLevel
class (https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.storage.StorageLevel$).
The StorageLevel
class is annotated as DeveloperApi()
, which can be extended to provide the custom implementation of persistence.
Caching or persistence is a key tool used for iterative algorithms and fast interactive use. Whenever persist(…)
is invoked on RDD
, each node stores its associated partitions and computes in memory and further reuses them in other actions on the computed datasets. This, in turn, also enables the future actions to be much faster.
Shuffling is another important concept in Spark. It redistributes data across clusters so that it is grouped differently across the partitions. It is a costly operation as it involves the following activities:
There are certain transformation operations defined in org.apache.spark.rdd.RDD
and org.apache.spark.rdd.PairRDDFunctions
, which initiate the shuffling process. These operations include the following:
RDD.repartition(…)
: This repartitions the existing dataset across the cluster of nodesRDD.coalesce(…)
: This repartitions the existing dataset into a smaller number of given partitionsByKey
(except count operations) such as PairRDDFunctions.reducebyKey()
or groupByKey
PairRDDFunctions.join(…)
or PairRDDFunctions.cogroup(…)
operationsThe shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O, but there are certain configurations, which can help in tuning and performance optimizations. Refer to https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior for a complete list of parameters, which can be used to optimize the shuffle operations.
Refer to https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf for more information on RDDs.