In this section, we will discuss the architecture of Spark and its various components in detail. We will also briefly talk about the various extensions/libraries of Spark, which are developed over the core Spark framework.
Spark is a general-purpose computing engine that initially focused to provide solutions to the iterative and interactive computations and workloads. For example, machine learning algorithms, which reuse intermediate or working datasets across multiple parallel operations.
The real challenge with iterative computations is the dependency of the intermediate data/steps on the overall job. This intermediate data needs to be cached in the memory itself for faster computations because flushing and reading from a disk is an overhead, which, in turn, makes the overall process unacceptably slow.
The creators of Apache Spark not only provided scalability, fault tolerance, performance, and distributed data processing, but also provided in-memory processing of distributed data over the cluster of nodes.
To achieve this, a new layer abstraction of distributed datasets that is partitioned over the set of machines (cluster) was introduced, which can be cached in the memory to reduce the latency. This new layer of abstraction is known as resilient distributed datasets (RDD).
RDD, by definition, is an immutable (read-only) collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.
It is important to note that Spark is capable of performing in-memory operations, but at the same time, it can also work on the data stored on the disk.
We will read more about RDDs in the next section, but let's move forward and discuss the components and architecture of Spark.
Spark provides a well-defined and layered architecture where all its layers and components are loosely coupled and integration with external components/libraries/extensions is performed using well-defined contracts.
Here is the high-level architecture of Spark 1.5.1 and its various components/layers:
The preceding diagram shows the high-level architecture of Spark. Let's discuss the roles and usage of each of the architecture components:
The preceding architecture should be sufficient enough to understand the various layers of abstraction provided by Spark. All the layers are loosely coupled, and if required, can be replaced or extended as per the requirements.
Spark extensions is one such layer that is widely used by architects and developers to develop custom libraries. Let's move forward and talk more about Spark extensions, which are available for developing custom applications/jobs.
In this section, we will briefly discuss the usage of various Spark extensions/libraries that are available for different use cases.
The following are the extensions/libraries available with Spark 1.5.1:
All the previously listed Spark extension/libraries are part of the standard Spark distribution. Once we install and configure Spark, we can start using APIs that are exposed by the extensions.
Apart from the earlier extensions, Spark also provides various other external packages that are developed and provided by the open source community. These packages are not distributed with the standard Spark distribution, but they can be searched and downloaded from http://spark-packages.org/. Spark packages provide libraries/packages for integration with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content.
Let's move on to the next section where we will dive deep into the Spark packaging structure and execution model, and we will also talk about various other Spark components.
In this section, we will briefly talk about the packaging structure of the Spark code base. We will also discuss core packages and APIs, which will be frequently used by the architects and developers to develop custom applications with Spark.
Spark is written in Scala (http://www.scala-lang.org/), but for interoperability, it also provides the equivalent APIs in Java and Python as well.
For brevity, we will only talk about the Scala and Java APIs, and for Python APIs, users can refer to https://spark.apache.org/docs/1.5.1/api/python/index.html.
A high-level Spark code base is divided into the following two packages:
org.apache.spark.streaming.*
package, and the same packaging structure goes for other extensions: Spark MLlib—org.apache.spark.mllib.*
, Spark SQL—org.apcahe.spark.sql.*
, Spark GraphX—org.apache.spark.graphx.*
.For more information, refer to http://tinyurl.com/q2wgar8 for Scala APIs and http://tinyurl.com/nc4qu5l for Java APIs.
SparkContext
and SparkConfig
. Both of these components are used by each and every standard or customized Spark job or Spark library and extension. The terms/concepts Context
and Config
are not new and more or less they have now become a standard architectural pattern. By definition, a Context
is an entry point of the application that provides access to various resources/features exposed by the framework, whereas a Config
contains the application configurations, which helps define the environment of the application.Let's move on to the nitty-gritty of the Scala APIs exposed by Spark Core:
org.apache.spark
: This is the base package for all Spark APIs that contains a functionality to create/distribute/submit Spark jobs on the cluster.org.apache.spark.SparkContext
: This is the first statement in any Spark job/application. It defines the SparkContext
and then further defines the custom business logic that is provided in the job/application. The entry point for accessing any of the Spark features that we may want to use or leverage is SparkContext
, for example, connecting to the Spark cluster, submitting jobs, and so on. Even the references to all Spark extensions are provided by SparkContext
. There can be only one SparkContext
per JVM, which needs to be stopped if we want to create a new one. The SparkContext
is immutable, which means that it cannot be changed or modified once it is started.org.apache.spark.rdd.RDD.scala
: This is another important component of Spark that represents the distributed collection of datasets. It exposes various operations that can be executed in parallel over the cluster. The SparkContext
exposes various methods to load the data from HDFS or the local filesystem or Scala collections, and finally, create an RDD on which various operations such as map, filter, join, and persist can be invoked. RDD also defines some useful child classes within the org.apache.spark.rdd.*
package such as PairRDDFunctions
to work with key/value pairs, SequenceFileRDDFunctions
to work with Hadoop sequence files, and DoubleRDDFunctions
to work with RDDs of doubles. We will read more about RDD in the subsequent sections.org.apache.spark.annotation
: This package contains the annotations, which are used within the Spark API. This is the internal Spark package, and it is recommended that you do not to use the annotations defined in this package while developing custom Spark jobs. The three main annotations defined within this package are as follows:DeveloperAPI
: All those APIs/methods, which are marked with DeveloperAPI
, are for advance usage where users are free to extend and modify the default functionality. These methods may be changed or removed in the next minor or major releases of Spark.Experimental
: All functions/APIs marked as Experimental
are officially not adopted by Spark but are introduced temporarily in a specific release. These methods may be changed or removed in the next minor or major releases.AlphaComponent
: The functions/APIs, which are still being tested by the Spark community, are marked as AlphaComponent
. These are not recommended for production use and may be changed or removed in the next minor or major releases.org.apache.spark.broadcast
: This is one of the most important packages, which is frequently used by developers in their custom Spark jobs. It provides the API for sharing the read-only variables across the Spark jobs. Once the variables are defined and broadcast, they cannot be changed. Broadcasting the variables and data across the cluster is a complex task, and we need to ensure that an efficient mechanism is used so that it improves the overall performance of the Spark job and does not become an overhead.HttpBroadcast
and TorrentBroadcast
. The HttpBroadcast
broadcast leverages the HTTP server to fetch/retrieve the data from the Spark driver. In this mechanism, the broadcast data is fetched through an HTTP Server running at the driver itself and further stored in the executor block manager for faster accesses. The TorrentBroadcast
broadcast, which is also the default implementation of the broadcast, maintains its own block manager. The first request to access the data makes the call to its own block manager, and if not found, the data is fetched in chunks from the executor or driver. It works on the principle of BitTorrent and ensures that the driver is not the bottleneck in fetching the shared variables and data. Spark also provides accumulators, which work like broadcast, but provide updatable variables shared across the Spark jobs but with some limitations. You can refer to https://spark.apache.org/docs/1.5.1/api/scala/index.html#org.apache.spark.Accumulator.org.apache.spark.io
: This provides implementation of various compression libraries, which can be used at block storage level. This whole package is marked as Developer API, so developers can extend and provide their own custom implementations. By default, it provides three implementations: LZ4, LZF, and Snappy.org.apache.spark.scheduler
: This provides various scheduler libraries, which help in job scheduling, tracking, and monitoring. It defines the directed acyclic graph (DAG) scheduler (http://en.wikipedia.org/wiki/Directed_acyclic_graph). The Spark DAG scheduler defines the stage-oriented scheduling where it keeps track of the completion of each RDD and the output of each stage and then computes DAG, which is further submitted to the underlying org.apache.spark.scheduler.TaskScheduler
API that executes them on the cluster.org.apache.spark.storage
: This provides APIs for structuring, managing, and finally, persisting the data stored in RDD within blocks. It also keeps tracks of data and ensures that it is either stored in memory, or if the memory is full, it is flushed to the underlying persistent storage area.org.apache.spark.util
: These are the utility classes used to perform common functions across the Spark APIs. For example, it defines MutablePair
, which can be used as an alternative to Scala's Tuple2 with the difference that MutablePair
is updatable while Scala's Tuple2 is not. It helps in optimizing memory and minimizing object allocations.Let's move on to the next section where we will dive deep into the Spark execution model, and we will also talk about various other Spark components.
Spark essentially enables the distributed in-memory execution of a given piece of code. We discussed the Spark architecture and its various layers in the previous section. Let's also discuss its major components, which are used to configure the Spark cluster, and at the same time, they will be used to submit and execute our Spark jobs.
The following are the high-level components involved in setting up the Spark cluster or submitting a Spark job:
SparkContext
. The entry point for any job that defines the environment/configuration and the dependencies of the submitted job is SparkContext
. It connects to the cluster manager and requests resources for further execution of the jobs.The following diagram shows the high-level components and the master-worker view of Spark:
The preceding diagram depicts the various components involved in setting up the Spark cluster, and the same components are also responsible for the execution of the Spark job.
Although all the components are important, let's briefly discuss the cluster/resource manager, as it defines the deployment model and allocation of resources to our submitted jobs.
Spark enables and provides flexibility to choose our resource manager. As of Spark 1.5.1, the following are the resource managers or deployment models that are supported by Spark:
Refer to http://spark.apache.org/docs/latest/running-on-mesos.html for more information on running Spark jobs on Apache Mesos.
println
) are printed on the same console, which is used to submit the job.println
) are also written in the log files maintained by YARN and not on the machine that is used to submit our Spark job.For more information on executing Spark applications on YARN, refer to http://spark.apache.org/docs/latest/running-on-yarn.html.
local[N]
as the master URL, where N
is the number of parallel threads.We will soon implement the same execution model, but before that, let's move on to the next section and understand one of the most important components of Spark: resilient distributed datasets (RDD).