The architecture of Spark

In this section, we will discuss the architecture of Spark and its various components in detail. We will also briefly talk about the various extensions/libraries of Spark, which are developed over the core Spark framework.

Spark is a general-purpose computing engine that initially focused to provide solutions to the iterative and interactive computations and workloads. For example, machine learning algorithms, which reuse intermediate or working datasets across multiple parallel operations.

The real challenge with iterative computations is the dependency of the intermediate data/steps on the overall job. This intermediate data needs to be cached in the memory itself for faster computations because flushing and reading from a disk is an overhead, which, in turn, makes the overall process unacceptably slow.

The creators of Apache Spark not only provided scalability, fault tolerance, performance, and distributed data processing, but also provided in-memory processing of distributed data over the cluster of nodes.

To achieve this, a new layer abstraction of distributed datasets that is partitioned over the set of machines (cluster) was introduced, which can be cached in the memory to reduce the latency. This new layer of abstraction is known as resilient distributed datasets (RDD).

RDD, by definition, is an immutable (read-only) collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

It is important to note that Spark is capable of performing in-memory operations, but at the same time, it can also work on the data stored on the disk.

We will read more about RDDs in the next section, but let's move forward and discuss the components and architecture of Spark.

High-level architecture

Spark provides a well-defined and layered architecture where all its layers and components are loosely coupled and integration with external components/libraries/extensions is performed using well-defined contracts.

Here is the high-level architecture of Spark 1.5.1 and its various components/layers:

High-level architecture

The preceding diagram shows the high-level architecture of Spark. Let's discuss the roles and usage of each of the architecture components:

  • Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage.
  • Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch.
  • Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated applications. Spark applications can leverage cluster managers such as YARN (http://tinyurl.com/pcymnnf) and Mesos (http://mesos.apache.org/) for the allocation and deallocation of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the APIs that are used to request for the allocation and deallocation of available resource across the cluster.
  • Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a wide variety of applications and languages.
  • Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to perform ad hoc queries and interactive analysis over large datasets.

The preceding architecture should be sufficient enough to understand the various layers of abstraction provided by Spark. All the layers are loosely coupled, and if required, can be replaced or extended as per the requirements.

Spark extensions is one such layer that is widely used by architects and developers to develop custom libraries. Let's move forward and talk more about Spark extensions, which are available for developing custom applications/jobs.

Spark extensions/libraries

In this section, we will briefly discuss the usage of various Spark extensions/libraries that are available for different use cases.

The following are the extensions/libraries available with Spark 1.5.1:

  • Spark Streaming: Spark Streaming, as an extension, is developed over the core Spark API. It enables scalable, high-throughput, and fault-tolerant stream processing of live data streams. Spark Streaming enables the ingestion of data from various sources, such as Kafka, Flume, Kinesis, or TCP sockets. Once the data is ingested, it can be further processed using complex algorithms that are expressed with high-level functions, such as map, reduce, join, and window. Finally, the processed data can be pushed out to filesystems, databases, and live dashboards. In fact, Spark Streaming also facilitates the Spark's machine learning and graph processing algorithms on data streams. For more information, refer to http://spark.apache.org/docs/latest/streaming-programming-guide.html.
  • Spark MLlib: Spark MLlib is another extension that provides the distributed implementation of various machine learning algorithms. Its goal is to make practical machine learning library scalable and easy to use. It provides implementation of various common machine learning algorithms used for classification, regression, clustering, and many more. For more information, refer to http://spark.apache.org/docs/latest/mllib-guide.html.
  • Spark GraphX: GraphX provides the API to create a directed multigraph with properties attached to each vertex and edges. It also provides the various common operators used for the aggregation and distributed implementation of various graph algorithms, such as PageRank and triangle counting. For more information, refer to http://spark.apache.org/docs/latest/graphx-programming-guide.html.
  • Spark SQL: Spark SQL provides the distributed processing of structured data and facilitates the execution of relational queries, which are expressed in a structured query language. (http://en.wikipedia.org/wiki/SQL). It provides the high level of abstraction known as DataFrames, which is a distributed collection of data organized into named columns. For more information, refer to http://spark.apache.org/docs/latest/sql-programming-guide.html.
  • SparkR: R (https://en.wikipedia.org/wiki/R_(programming_language) is a popular programming language used for statistical computing and performing machine learning tasks. However, the execution of the R language is single threaded, which makes it difficult to leverage in order to process large data (TBs or PBs). R can only process the data that fits into the memory of a single machine. In order to overcome the limitations of R, Spark introduced a new extension: SparkR. SparkR provides an interface to invoke and leverage Spark distributed execution engine from R, which allows us to run large-scale data analysis from the R shell. For more information, refer to http://spark.apache.org/docs/latest/sparkr.html.

All the previously listed Spark extension/libraries are part of the standard Spark distribution. Once we install and configure Spark, we can start using APIs that are exposed by the extensions.

Apart from the earlier extensions, Spark also provides various other external packages that are developed and provided by the open source community. These packages are not distributed with the standard Spark distribution, but they can be searched and downloaded from http://spark-packages.org/. Spark packages provide libraries/packages for integration with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content.

Let's move on to the next section where we will dive deep into the Spark packaging structure and execution model, and we will also talk about various other Spark components.

Spark packaging structure and core APIs

In this section, we will briefly talk about the packaging structure of the Spark code base. We will also discuss core packages and APIs, which will be frequently used by the architects and developers to develop custom applications with Spark.

Spark is written in Scala (http://www.scala-lang.org/), but for interoperability, it also provides the equivalent APIs in Java and Python as well.

Note

For brevity, we will only talk about the Scala and Java APIs, and for Python APIs, users can refer to https://spark.apache.org/docs/1.5.1/api/python/index.html.

A high-level Spark code base is divided into the following two packages:

  • Spark extensions: All APIs for a particular extension is packaged in its own package structure. For example, all APIs for Spark Streaming are packaged in the org.apache.spark.streaming.* package, and the same packaging structure goes for other extensions: Spark MLlib—org.apache.spark.mllib.*, Spark SQL—org.apcahe.spark.sql.*, Spark GraphX—org.apache.spark.graphx.*.

    Note

    For more information, refer to http://tinyurl.com/q2wgar8 for Scala APIs and http://tinyurl.com/nc4qu5l for Java APIs.

  • Spark Core: Spark Core is the heart of Spark and provides two basic components: SparkContext and SparkConfig. Both of these components are used by each and every standard or customized Spark job or Spark library and extension. The terms/concepts Context and Config are not new and more or less they have now become a standard architectural pattern. By definition, a Context is an entry point of the application that provides access to various resources/features exposed by the framework, whereas a Config contains the application configurations, which helps define the environment of the application.

Let's move on to the nitty-gritty of the Scala APIs exposed by Spark Core:

  • org.apache.spark: This is the base package for all Spark APIs that contains a functionality to create/distribute/submit Spark jobs on the cluster.
  • org.apache.spark.SparkContext: This is the first statement in any Spark job/application. It defines the SparkContext and then further defines the custom business logic that is provided in the job/application. The entry point for accessing any of the Spark features that we may want to use or leverage is SparkContext, for example, connecting to the Spark cluster, submitting jobs, and so on. Even the references to all Spark extensions are provided by SparkContext. There can be only one SparkContext per JVM, which needs to be stopped if we want to create a new one. The SparkContext is immutable, which means that it cannot be changed or modified once it is started.
  • org.apache.spark.rdd.RDD.scala: This is another important component of Spark that represents the distributed collection of datasets. It exposes various operations that can be executed in parallel over the cluster. The SparkContext exposes various methods to load the data from HDFS or the local filesystem or Scala collections, and finally, create an RDD on which various operations such as map, filter, join, and persist can be invoked. RDD also defines some useful child classes within the org.apache.spark.rdd.* package such as PairRDDFunctions to work with key/value pairs, SequenceFileRDDFunctions to work with Hadoop sequence files, and DoubleRDDFunctions to work with RDDs of doubles. We will read more about RDD in the subsequent sections.
  • org.apache.spark.annotation: This package contains the annotations, which are used within the Spark API. This is the internal Spark package, and it is recommended that you do not to use the annotations defined in this package while developing custom Spark jobs. The three main annotations defined within this package are as follows:
    • DeveloperAPI: All those APIs/methods, which are marked with DeveloperAPI, are for advance usage where users are free to extend and modify the default functionality. These methods may be changed or removed in the next minor or major releases of Spark.
    • Experimental: All functions/APIs marked as Experimental are officially not adopted by Spark but are introduced temporarily in a specific release. These methods may be changed or removed in the next minor or major releases.
    • AlphaComponent: The functions/APIs, which are still being tested by the Spark community, are marked as AlphaComponent. These are not recommended for production use and may be changed or removed in the next minor or major releases.
  • org.apache.spark.broadcast: This is one of the most important packages, which is frequently used by developers in their custom Spark jobs. It provides the API for sharing the read-only variables across the Spark jobs. Once the variables are defined and broadcast, they cannot be changed. Broadcasting the variables and data across the cluster is a complex task, and we need to ensure that an efficient mechanism is used so that it improves the overall performance of the Spark job and does not become an overhead.
  • Spark provides two different types of implementations of broadcasts—HttpBroadcast and TorrentBroadcast. The HttpBroadcast broadcast leverages the HTTP server to fetch/retrieve the data from the Spark driver. In this mechanism, the broadcast data is fetched through an HTTP Server running at the driver itself and further stored in the executor block manager for faster accesses. The TorrentBroadcast broadcast, which is also the default implementation of the broadcast, maintains its own block manager. The first request to access the data makes the call to its own block manager, and if not found, the data is fetched in chunks from the executor or driver. It works on the principle of BitTorrent and ensures that the driver is not the bottleneck in fetching the shared variables and data. Spark also provides accumulators, which work like broadcast, but provide updatable variables shared across the Spark jobs but with some limitations. You can refer to https://spark.apache.org/docs/1.5.1/api/scala/index.html#org.apache.spark.Accumulator.
  • org.apache.spark.io: This provides implementation of various compression libraries, which can be used at block storage level. This whole package is marked as Developer API, so developers can extend and provide their own custom implementations. By default, it provides three implementations: LZ4, LZF, and Snappy.
  • org.apache.spark.scheduler: This provides various scheduler libraries, which help in job scheduling, tracking, and monitoring. It defines the directed acyclic graph (DAG) scheduler (http://en.wikipedia.org/wiki/Directed_acyclic_graph). The Spark DAG scheduler defines the stage-oriented scheduling where it keeps track of the completion of each RDD and the output of each stage and then computes DAG, which is further submitted to the underlying org.apache.spark.scheduler.TaskScheduler API that executes them on the cluster.
  • org.apache.spark.storage: This provides APIs for structuring, managing, and finally, persisting the data stored in RDD within blocks. It also keeps tracks of data and ensures that it is either stored in memory, or if the memory is full, it is flushed to the underlying persistent storage area.
  • org.apache.spark.util: These are the utility classes used to perform common functions across the Spark APIs. For example, it defines MutablePair, which can be used as an alternative to Scala's Tuple2 with the difference that MutablePair is updatable while Scala's Tuple2 is not. It helps in optimizing memory and minimizing object allocations.

Let's move on to the next section where we will dive deep into the Spark execution model, and we will also talk about various other Spark components.

The Spark execution model – master-worker view

Spark essentially enables the distributed in-memory execution of a given piece of code. We discussed the Spark architecture and its various layers in the previous section. Let's also discuss its major components, which are used to configure the Spark cluster, and at the same time, they will be used to submit and execute our Spark jobs.

The following are the high-level components involved in setting up the Spark cluster or submitting a Spark job:

  • Spark driver: This is the client program, which defines SparkContext. The entry point for any job that defines the environment/configuration and the dependencies of the submitted job is SparkContext. It connects to the cluster manager and requests resources for further execution of the jobs.
  • Cluster manager/resource manager/Spark master: The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running by the worker nodes.
  • Spark worker/executors: A worker actually executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs.

The following diagram shows the high-level components and the master-worker view of Spark:

The Spark execution model – master-worker view

The preceding diagram depicts the various components involved in setting up the Spark cluster, and the same components are also responsible for the execution of the Spark job.

Although all the components are important, let's briefly discuss the cluster/resource manager, as it defines the deployment model and allocation of resources to our submitted jobs.

Spark enables and provides flexibility to choose our resource manager. As of Spark 1.5.1, the following are the resource managers or deployment models that are supported by Spark:

  • Apache Mesos: Apache Mesos (http://mesos.apache.org/) is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other frameworks on a dynamically shared pool of nodes. Apache Mesos and Spark are closely related to each other (but they are not the same). The story started way back in 2009 when Mesos was ready and there were talks going on about the ideas/frameworks that can be developed on top of Mesos, and that's exactly how Spark was born.

    Note

    Refer to http://spark.apache.org/docs/latest/running-on-mesos.html for more information on running Spark jobs on Apache Mesos.

  • Hadoop YARN: Hadoop 2.0 (http://tinyurl.com/lsk4uat), also known as YARN, was a complete change in the architecture. It was introduced as a generic cluster computing framework that was entrusted with the responsibility of allocating and managing the resources required to execute the varied jobs or applications. It introduced new daemon services, such as the resource manager (RM), node manager (NM), and application master (AM), which are responsible for managing cluster resources, individual nodes, and respective applications. YARN also introduced specific interfaces/guidelines for application developers where they can implement/follow and submit or execute their custom applications on the YARN cluster. The Spark framework implements the interfaces exposed by YARN and provides the flexibility of executing the Spark applications on YARN. Spark applications can be executed in the following two different modes in YARN:
    • YARN client mode: In this mode, the Spark driver executes on the client machine (the machine used for submitting the job), and the YARN application master is just used for requesting the resources from YARN. All our logs and sysouts (println) are printed on the same console, which is used to submit the job.
    • YARN cluster mode: In this mode, the Spark driver runs inside the YARN application master process, which is further managed by YARN on the cluster, and the client can go away just after submitting the application. Now as our Spark driver is executed on the YARN cluster, our application logs/sysouts (println) are also written in the log files maintained by YARN and not on the machine that is used to submit our Spark job.

    Note

    For more information on executing Spark applications on YARN, refer to http://spark.apache.org/docs/latest/running-on-yarn.html.

  • Standalone mode: The Core Spark distribution contains the required APIs to create an independent, distributed, and fault tolerant cluster without any external or third-party libraries or dependencies.
  • Local mode: Local mode should not be confused with standalone mode. In local mode, Spark jobs can be executed on a local machine without any special cluster setup by just passing local[N] as the master URL, where N is the number of parallel threads.

We will soon implement the same execution model, but before that, let's move on to the next section and understand one of the most important components of Spark: resilient distributed datasets (RDD).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset