Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1. Introduction to Apache Spark: A Unified Analytics Engine

In this chapter, we’ll chart the course of Apache Spark’s short evolution: its genesis, inspiration, and adoption in the community as a de-facto big data unified processing engine.

If you are familiar with its history and high-level concepts, you can skip this chapter.

Today, most big data practitioners—data engineers or data scientists—either use it at scale for unified analytics or are on course to use it. And with recent barrier mode support in Apache Spark 2.4 scheduler¹ for better integration with MPI-based programs for distributed deep learning frameworks, machine learning developers can employ Spark to perform distributed training and hyperparameter tuning as regular Spark jobs.

But first, let’s chart a brief history of distributed computing at scale.

The Genesis of Big Data and Distributed Computing at Google

When you think of scale today, you can’t help but think of Google’s search engine’s ability to index and search the world’s data on the internet at lightning speed. The name Google is synonymous with scale. In fact, Google is a deliberate misspelling of the mathematical term googol: that’s 1 plus 100 zeros!

Neither traditional storage systems such as RDBMS nor imperative ways of programming satisfied the scale at which Google wanted to build and search the internet’s indexed documents. Hence, this necessity led to the innovation of Google Filesystem² (GFS), MapReduce³ (MR), and Bigtable.⁴

While GFS provided a fault-tolerant and distributed file system across many commodity hardware servers in a cluster farm, Big Table offered scalable storage of structured data across GFS, and MR introduced a new parallel programming paradigm, based on functional programming, to process data at scale distributed over GFS and Big Table.

In essence, your MR applications interact with the MapReduce system that sends computation (mappers and reducers) code to where the data resides, favoring data locality and cluster rack affinity rather than bringing data to your application.

The workers aggregate and reduce intermediate computations and produce a final appended output from reduce function, which is then written to a distributed storage where it is accessible to your application.⁵ This way you significantly reduce network traffic and keep most of I/O local to disk rather than distributing over the network.

But most of the work Google did was proprietary, yet the ideas expressed in the aforementioned three papers⁶ spurred innovative ideas elsewhere in the open source community, especially at Yahoo! who was dealing with similar big data challenges of scale for their search engine.

Hadoop at Yahoo!

The computational challenges and solutions expressed in Google’s GFS paper provided a blueprint for Hadoop File System (HDFS)⁷, including the MapReduce implementation as a framework for distributed computing for data on HDFS-based storage systems. Donated to the Apache Software Foundation⁸ in April 2006, Apache Hadoop became a framework of a related modules: Hadoop Common, MapReduce, HDFS and Apache Hadoop YARN.⁹

Although Apache Hadoop had garnered a large open source community of contributors and adoption outside Yahoo! and led to two open source based commercial companies (Cloudera and Hortonworks, now merged), the MapReduce framework on HDFS had its few shortcomings.

For one it was hard to manage and administer, with cumbersome operational complexity. Second, its general batch-processing MapReduce API was verbose and required a lot of boiler-setup Java code, with brittle fault-tolerance. Third, with large batches of data jobs with many pairs of MR tasks, each pair of MR’s intermediate computed result is written to the local disk for the subsequent stage of its operation; this repeated performance of disk I/O took its toll. Large MR jobs could run for hours on end or even days.

And finally, even though Hadoop MR was conducive to large-scale jobs for general batch processing, it fell short to compute and combine other workloads such as machine learning, streaming, or interactive SQL like queries.

To satisfy these new workloads, engineers developed bespoke systems such as Apache Hive, Apache Storm, Apache Impala, Apache Giraph, Apache Drill, Apache Mahout, etc, with their own APIs and clustered configurations, further adding to the operational complexity of Hadoop and the steep learning curve for developers by orders of magnitude.

Learning from Alan Kay’s adage: “Simple things should be simple, complex things should be possible,” is it possible, then, to make Hadoop and MR simpler and faster?

Spark’s Early Years at AMPLab

Researchers at UC Berkeley’s AMPLab who had previously worked on Hadoop MapReduce entertained this question. They discovered MR was inefficient for interactive or iterative computing jobs and a complex framework to learn.

So from the onset, they embraced the idea of making it simple, faster, and easier. This endeavor started in 2009 at RAD Lab, which later became AMPLab (and now is RISELab).

Early papers published on Spark¹¹ demonstrated that it was 10-20x faster than Hadoop MapReduce for certain jobs. Today, it’s orders of magnitude faster.¹²

The central thrust of the Spark project was to bring ideas borrowed from Hadoop MapReduce into Spark but enhance it: make it highly fault-tolerant; embarrassingly parallel; support in-memory storage for intermediate results between iterative and interactive map and reduce computations; offer easy and composable APIs in multiple languages as a programming model; support other workloads in a unified manner. More on the idea of unification later after we explain what Spark is.

By 2013 within AMPLab and outside, Spark had garnered a widespread use, and some of its original creators and researchers—Matei Zaharia, Ali Ghodsi, Reynold Xin, Patrick Wendell, Ion Stoica, and Andy Konwinski—formed a company called Databricks and donated the Spark project to Apache Software Foundation (ASF), a vendor-neutral non-profit organization.

Subsequently, Databricks and the community of open-source developers worked to release Apache Spark 1.0¹³ in May 11, 2014, under the governance of ASF.

This first major release established the momentum for frequent future releases and contributions of notable features to Apache Spark from Databricks and over 100+ commercial vendors’ contributors to Apache Spark.

What is Apache Spark?

Since its first release, Apache Spark¹⁴ has become a unified engine designed for large-scale distributed data processing and machine learning on compute clusters, whether running on-premise at data centers or in the cloud.

Spark replaced Hadoop MapReduce with its in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporated libraries with composable APIs to do machine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).

But more importantly, it embraced (and continues) in its design philosophy of four characteristics:

Speed

Ease of Use

Modularity

Extensibility

Speed

First, Spark’s internal implementation benefits immensely from the hardware industry’s huge strides in both CPUs and Memory price and performance. So today’s commodity servers come cheap with 100s of GBs of memory and multiple cores, with the underlying UNIX-based operating system taking advantage of efficient multi-threading and parallel processing.

Second, Spark builds its query computations as a Directed Acyclic Graph (DAG); its DAG scheduler and Catalyst query optimizer construct an efficient computational graph that can be executed in usually in parallel stages, decomposing them as tasks and executing them in parallel across workers on the cluster. And third, its whole-stage code generation physical execution engine, Tungsten, generates compact code for execution (We will cover Catalyst and whole-stage code generation in later chapters.)

With all its intermediate results retained in memory and its limited disk I/O, this gives it a huge performance boost and speed.

Ease of Use

Spark achieves simplicity by providing a fundamental abstraction of a simple logical data structure called Resilient Distributed Data (RDD) upon which all other higher-level structured data abstractions, such as DataFrames and Datasets, are constructed. By providing a set of transformations and actions as operations, Spark has offered a simple programming model that you can use to build parallel applications.

“Using this simple extension [model], Spark can capture a wide range of processing workloads that previously needed separate engines, including SQL, streaming, machine learning, and graph processing¹⁵” as library modules.

Modularity

Spark operations can work across many workloads. That is, these operations as APIs can be expressed in the supported programming languages: Scala, Java, Python, SQL, and R. Spark offers unified libraries with well-documented APIs that include modules as core components: Spark SQL, Structured Streaming, Machine Learning (MLlib), and GraphX, combining all the workloads running under one engine.

You can write a single Spark application that can do it all. No need for distinct engines for disparate workloads; no need to learn separate APIs—with Spark, you get a unified processing engine for your workloads.

Extensibility

Spark focuses more on its fast, parallel, computation engine than on storage. That is, unlike Apache Hadoop that included both storage and computing, Spark decouples itself from storage and enables reading data from myriad sources, including Hadoop-supported file formats. It does this by reading data into its memory as RDDs and DataFrames (we will discuss more in later chapters).

As such, it offers flexibility to read data stored in Apache Hadoop, Apache Cassandra, Apache HBase, MongoDB, Apache Hive, RDBMS, and other diverse data sources. Its DataReaders and DataWriters can be extended to read data from other sources such as Apache Kafka, Kinesis, Azure Storage, Amazon S3 etc into its logical data abstraction, on which it can operate.

The community of Spark developers maintains a list of third-party Spark packages¹⁶ as part of its growing Spark ecosystem. This rich ecosystem of packages includes Spark connectors, extended DataSource readers from external data sources, performance monitors, and more (see Figure 1-2).

Unified Analytics

While the notion of unification is not unique to Spark, it existed as part of its design philosophy and evolution. In November 2016, ACM recognized Apache Spark and conferred its original creators the prestigious ACM Award for their paper asserting Apache Spark as the “Unified Engine of Big Data Processing.”¹⁷ The award-winning paper notes that Apache Spark replaces all the separate batch processing, graph, stream, and query engines like Storm, Impala, Dremel, Pregel etc. with a unified stack of components that address diverse workloads under a single distributed fast engine.

With a single engine, unifying SQL, ETL, batch compute and “classical” machine learning, developers can build their big data applications with Spark’s supported programming languages and use its simple, composable APIs. Many developers in the community would contend that it’s hard not to think of Spark when you think of processing big data at scale in a distributed manner.

Apache Spark Components as a Unified Stack

Spark offers four distinct components as libraries for diverse workloads: Spark SQL, Spark MLlib, Spark Structured Streaming, and GraphX. Each component offers simple and composable APIs. The component is separate from Spark’s core fault-tolerant engine, in that your Spark code written with these library APIs is translated into a DAG that is executed by the engine. So whether you write your Spark code using the provided Structured APIs in Java, R, Scala, SQL, or Python, the underlying code is decomposed into highly-compact byte code that is executed in the workers’ JVM across the cluster. Let’s look at each of these components in more detail.

Spark SQL

This module works well with structured data, meaning you can read data stored in an RDBMS table or from files with structured data in CSV, Text, JSON, Avro, ORC, or Parquet format and then construct permanent or temporary tables in Spark. Also, when using Spark’s Structured APIs in Java, Python, Scala or R, you can combine SQL-like queries to query the data just read into a Spark DataFrame.¹⁸ To date, Spark SQL is ANSI SQL:2003 compliant.¹⁹

For example, in this Scala code snippet, you can read from a JSON file stored on Amazon S3, create a temporary table, and issue a SQL-like query on the results read into memory as a Spark DataFrame.

Note: You can issue the same code snippet in Python, R or Java, and the generated bytecode will be identical, resulting in the same performance.

// read data off Amazon S3 bucket into a Spark DataFrame
            spark.read.json("s3://apache_spark/data/committers.json").createOrReplaceTempView("committers")
            // issue an SQL query and return the result as a Spark DataFrame
            val results = spark.sql("""SELECT name, org, module, release, num_commits
            FROM committers WHERE module = ‘mllib’ AND num_commits > 10 
            ORDER BY num_commits DESC""")

Spark MLlib

Spark comes with a library containing common classical machine learning (ML) algorithms called MLlib. Since Spark’s first release, the performance of this library component has improved because of Spark 2.x’s underlying engine enhancements. Today, MLlib provides data scientists with a popular list of machine learning algorithms, including classification, regression, clustering, collaborative filtering, build atop high-level DataFrame-based APIs to build predictive models.

These APIs allow you to do featurization, to extract or transform features, to build pipelines (for training and evaluating), and to persist models (for saving and reloading them) during deployment. Additional utilities include the use of common linear algebra operations and statistics. MLlib includes other low-level ML primitives, including a generic gradient descent optimization. Using just the high-level DataFame-based APIs, this Python code snippet encapsulates the basic operations a typical data scientist may do when building a model.

Note: Starting Apache Spark 1.6, the MLlib project split between two packages: spark.mllib and spark.ml. The latter is DataFrame-based APIs while the former is RDD-based APIs, which is now in maintenance mode. All new features and development by the community goes into spark.ml²⁰

This code snippet is just an illustration of the ease with which you can build your models—it feels Pythonic. More extensive examples will be discussed in chapters 11 and 12.

from pyspark.ml.classification import LogisticRegression
            ...
            training = spark.read.csv("s3://...")
            test = spark.read.csv("s3://...")
            
# Load training data
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
            
# Fit the model
lrModel = lr.fit(training)
            
# predict
            lrModel.transform(test)
...

All these new enhancements in spark.ml are designed for developer’s ease of use and for scaling training across the Spark cluster.

Spark Structured Streaming

Apache Spark 2.0 introduced an experimental Continuous Streaming model²¹ and Structured Streaming APIs²², built atop the Spark SQL engine and DataFrame-based APIs. By Spark 2.2, Structured Streaming was generally available, meaning that developers could use it in their production environments.

Necessary for big data developers to combine and react in real-time to both static data and streaming data from engines like Apache Kafka and other streaming sources, the new model viewed a stream as a continually growing table, with new rows of data appended at the end. And developers merely treated this as a structured table and issued queries against it as they would a static table.

This new model obviated the old DStreams model in Spark’s 1.x series, which we will discuss in more detail in chapter 9. Underneath the structured streaming model, the Spark SQL core engine handles all aspects of fault-tolerance and late-data semantics, allowing developers to focus on writing streaming applications with relative ease.

Furthermore, Spark 2.x extends its streaming data sources to include Apache Kafka, Kinesis, and HDFS-based or cloud storage.

For example, this code snippet shows a typical anatomy of a Structured Streaming application doing the Hello World “word count” of a streaming application.

Note: Take note of the composability of the high-level domain specific language (DSL) syntax of the DataFrame-based APIs transformation and computation. It has the look-and-feel of SQL-like aggregation but done in Python DataFrame API. We will expand on this in the later chapters.

# read from Kafka stream
            lines = (spark.readStream 
             .format("kafka") 
             .option("subscribe", "input") 
             .load())
            # perform transformation
            wordCounts = (lines.groupBy("value.cast(‘string’) as key") 
             .agg(count("*") as "value") )
            # write back out to the stream
            query = (wordCounts.writeStream() 
             .format("kafka") 
             .option("topic", "output"))

GraphX

As the name suggests, GraphX is a library for manipulating graphs (e.g., social networks graphs, routes and connections points, network topology) and performing graph-parallel computations. Contributed by many users in the community, it offers the standard graph algorithms for analysis, connections, and traversals. Algorithms include PageRank, Connected Components, and Triangle Counting. This code snippet shows a simple example of how to join two graphs using the GraphX APIs.

val graph = Graph(vertices, edges)
            messages = spark.textFile("hdfs://...")
            graph2 = graph.joinVertices(messages) {
             (id, vertex, msg) => ...
            }

Apache Spark’s Distributed Execution and Concepts

If you have read this far, you already know that Spark is a distributed data processing engine working collaboratively with its components on a cluster of machines. Before we explore programming Spark in the following chapters of this book, we need to understand how all of its components within Spark’s distributed architecture work together and communicate, and what are some of the deployment modes that render it flexible and relatively easy to configure and deploy? Let’s look at each of the individual components shown in Figure 1-4 and how they fit into the architecture.

At a high level in the Spark architecture, a Spark application consists of a driver program that is responsible to orchestrate parallel operations on the Spark cluster. The driver accesses all the distributed components—Spark Executor and Cluster Manager—in the cluster through a SparkSession. Let’s look at each of the individual components and how they interact with each other.

Spark Session

In Spark 2.0, the SparkSession became a unified conduit to all Spark operations and data. Not only did it subsumes previous entry points to Spark like SparkContext, SQLContext, HiveContext, SparkConf and StreamingContext, but it also made working with Spark simpler and easier. ²³

Because of this unified conduit, you can create JVM runtime parameters, create DataFrames and Datasets, read from data sources, access Catalog Metadata, and issue Spark SQL queries. This is the essence of simplicity in Spark 2.x: a single unified entry point to all of Spark’s functionality!

In a standalone Spark application, you can create a SparkSession using one of the high-level APIs in the programming language of your choice, whereas in Spark shells (more on this later in the next chapter), it’s created for you, and you can access it via a global variable called spark or sc.

For example, in a Spark application, you can create a SparkSession per JVM and use it to perform a number of Spark operations, whereas in Spark 1.x, you would have had to create individual contexts, introducing extra boilerplate code.

Note: Although in Spark 2.x, the SparkSession subsumes all other contexts, you can still access the individual contexts and its respective methods. This way, the community maintained backward compatibility. That is, your old 1.x code with SparkContext or SQLContext will still work.

import org.apache.spark.sql.SparkSession
            // build SparkSession
            val spark = SparkSession().builder()
            .appName("LearnSpark")
            .config("spark.sql.shuffle.partitions", 6)
            .getOrCreate()
            ...
            val people = spark.read.json("...")
            ...
            val resultsDF = spark.sql("SELECT city, pop, state, zip FROM zips_table")

Spark Driver

As part of the Spark application responsible for instantiating a SparkSession, the Spark Driver has multiple responsibilities: it communicates with the Cluster Manager (explained below); it requests resources from the Cluster Manager for Spark’s Executors JVMs; and it transforms all the Spark operations into DAG computations, schedules them, and distributes their execution as tasks across all Spark Executors.

Its interaction with the Cluster Manager is merely to get Spark Executor resources: JVMs, CPU, memory, disk etc. Once allocated, it communicates directly with Spark Executors.

Cluster Manager

The Cluster Manager is responsible for managing and allocating resources for a cluster of nodes on which your Spark application runs, regardless of the deployment modes (see the Table 1-1). Sometimes called Spark or Cluster “workers,” these managed cluster nodes launch individual Spark Executors. Currently, Spark supports four cluster managers: a built-in standalone cluster manager, Apache Hadoop YARN, Apache Mesos, and Kubernetes.

Spark Executor

The Spark Executor runs on each “worker” node in the cluster. Often called a Spark “worker,” it is responsible to launch Executors on which Spark’s tasks run. Usually, only a single Executor runs per node in various deployment modes.

Deployment Modes

An attractive feature of Spark is its support for myriad deployment modes and in various popular environments. Because the cluster manager is agnostic to where they run (as long as they can manage Spark workers’ Executors and fulfill resource requests), Spark can be deployed in some of the most popular environments -- such as Apache Mesos, Apache Hadoop YARN, and Kubernetes -- and can operate in different modes.

Mode	Spark Driver	Spark Executor	Cluster Manager
Local	Runs on a single JVM, like a laptop and a single node.	Runs on the same JVM as the driver	Runs on the same host.
Standalone	Can run on any node in the cluster	Each node in the cluster will launch its own Executor JVM	Can be allocated arbitrarily to any host in the cluster
YARN (client)	On a client, not part of the cluster	YARN’s NodeManager’s Container	YARN’s Resource Manager works with YARN’s Application Master to allocate the containers on NodeManagers for Executors
YARN (cluster)	Runs with the YARN’s Application Master	Same as YARN client mode	Same as the YARN client mode
Mesos (client)	Runs on a client, not part of Mesos cluster	Container within Mesos Agent	Mesos’ Master
Mesos (cluster)	Runs within one of Mesos’ Master	Same as client mode	Mesos’ Master
Kubernetes	Kubernetes pod	Each worker runs within its own pod	Kubernetes Master

Table 1-1 Cheat Sheet for Spark Deployment modes ²⁴

Distributed Data and Partitions

Actual physical data is distributed across an entire Spark cluster as partitions residing in either HDFS or cloud storage. Often more than one partition may reside on a node in a cluster.

While the data is distributed as partitions across the physical cluster, Spark treats them as a high-level logical data abstraction -- as an RDD or DataFrame in memory -- where each Spark worker’s Executor preferably (though not always possible) is allocated a task that can read the partition closest to it in the network (observing data locality). See Figure. 1-5.

Partitioning allows for efficient parallelism. A distributed scheme of breaking up data into chunks of partitions allows Spark Executors to process only data that is close to them, minimizing network bandwidth. That is, each Executor’s core or slot is assigned its own data partition to work on (Fig. 1-6).

For example, this code snippet will break up the physical data stored across clusters into 8 partitions, and each Executor will get some partitions to read into its memory.

log_df = spark.read.text(“path_to_large_text_file”).repartition(8)
            println(log_df.rdd.getNumPartitions())>
            Or this code will create a DataFrame of 10,000 integers distributed over 8 partitions in memory.
            df = spark.range(0, 10000, 1, 8)
            println(df.rdd.getNumPartitions())

Both code snippets will print out 8.

In chapters 3 and chapter 8, we will discuss how to tune and change partitioning configuration for maximum parallelism based on how many cores (or slots) you have on your Executors.

Developer’s Experience

Of all the developers’ delight, none is more attractive than a set of composable APIs that make developers productive, that is easy to use, and that is intuitive and expressive. One of Apache Spark’s appeals to developers has been its easy-to-use APIs, for operating on small to large datasets, across languages: Scala, Java, Python, SQL, and R.²⁵

One primary motivation behind Spark 2.x was unifying and simplifying Spark by limiting the number of concepts that developers have to grapple with. And Spark 2.x introduced higher-level abstraction APIs as domain-specific language constructs. These made the programming Spark highly expressive and a pleasant developer experience. You express what you want the task or operation to compute, not how to compute—let Spark ascertain how best to optimize your computation and do it for you. We will cover these Structured Spark APIs in chapter 3. But let us understand who are the Spark developers.

Who Uses Spark, and for What?

Not surprisingly, most developers who grapple with big data are data engineers, data scientists or machine learning engineers. They are drawn to Spark because it is a unified engine for big data processing and it allows them to build a range of applications using a single engine, with a familiar programming language.

More recently, Spark 2.4 has extended its execution engine to accommodate additional workloads such as deep learning. In particular, third-party packages like Horovod can be used. Unsurprisingly, then, the typical use cases extend not only to data science and data engineering but also to deep learning and machine learning tasks too.

Of course, many developers wear many hats and sometimes either do one or all three of the tasks, especially in startup companies or smaller engineering groups. Among all the tasks, however, data - massive amounts of data - is the foundation!

Data Science Tasks

As a discipline that has come to prominence in the era of big data, data scientists use data to tell stories. But before they can narrate the stories, they have to cleanse the data, explore the data to discover patterns, and build models to predict or suggest outcomes.

Some of these tasks require an academic background or knowledge in statistics, mathematics, computer science, and programming. Most of them are proficient in using analytical tools like SQL, comfortable with libraries like NumPy, Pandas, and conversant in programming languages like R and Python. But they also require knowledge of how to wrangle or transform data, how to use established algorithms in classification, regression or clustering for building models. Often, their tasks are iterative, interactive or ad hoc, or experimental to assert their hypothesis.

Fortunately, Spark supports these different tools: with the recent release of Spark, MLlib offers a common set of machine learning algorithms to build model pipelines, using high-level estimators, transformers, and data featurizers; Spark SQL and Spark Shells provide interactive and ad-hoc exploration of data quickly.

Additionally, Spark enables data scientists to tackle large data sets and scale their model training and evaluation, using a Spark cluster.

Often, data scientists, after building their model, need to work with other team members, who may be responsible for deploying models. Or they may need to work closely with others to build and transform raw, dirty data into clean data that is easily consumable or usable by other data scientists. For example, a classification or clustering model does not exist in isolation. It works in conjunction with other components like a web application or a streaming engine like Apache Kafka or part of a larger data pipeline. This pipeline is often built by data engineers.

Data Engineering Tasks

Data engineers have a strong understanding of software engineering principles and methodologies, and possess skills for building scalable data pipelines for a stated business use case. Data pipelines enable end-to-end transformations of raw data coming from myriad sources -- data is cleansed so that it can be consumed downstream by developers, stored in the cloud or in NoSQL or RDBMS for report generation, or made accessible to data analysts via BI tools.

Spark 2.x introduced an evolutionary streaming model called continuous applications with Structured Streaming.²⁶ With Structured Streaming APIs, data engineers can build complex data pipelines that include extracting, transforming, or loading (ETL) data from both real-time or static data sources. We will discuss the Structured Streaming model in chapter 9.

Data engineers use Spark because it provides a simple way to parallelize computations and hides all the complexity of distribution and fault-tolerance. Instead, data engineers focus on using high-level DataFrame-based APIs and DSL queries to do ETL reading and combining data from multiple sources.

With the Spark 2.x performance improvements, due to Catalyst optimizer for SQL and Tungsten for compact-code generation²⁷ ²⁸, life for data engineers is much easier. They can choose to use any of the three Spark APIs—RDDs, DataFrames, or Datasets—that suit the task, and reap some of the benefits of Spark 2.x²⁹

Machine Learning or Deep Learning Tasks

Apache Spark 2.4 has introduced a new gang scheduler, as part of Project Hydrogen³⁰, to accommodate the fault-tolerant needs of training and scheduling deep learning models in a distributed manner. So developers whose tasks demand deep learning techniques can use Spark along with deep and traditional machine learning techniques.

Whether you are a data engineer, data scientist or machine learning engineer, Spark is used for the following use cases:

process in parallel large data sets distributed across a cluster;

perform ad hoc or interactive queries to explore and visualize data sets;

build, train, and evaluate machine learning models using MLlib;

implement end-to-end data pipelines from myriad streams of data; and

analyze graph data sets and social networks.

Community Adoption and Expansion

Not surprisingly, Apache Spark struck a chord in the open source community, especially among data engineers and data scientists. Its four design characteristics mentioned above, and its inclusion as part of the vendor-neutral Apache Software Foundation, has fostered immense interest among the developer community.

Today, there are over 600 Apache Spark Meetup groups globally and close to half-million members.³¹ Every week, someone in the world is giving a talk at a meetup or a conference, giving a tutorial or sharing a blog on how to use Spark to build data pipelines, build ML models, or combine Spark with deep learning techniques. The Spark + AI Summit³² is the largest conference dedicated to the use of Spark for machine learning, data engineering, and data science across many verticals.

Since Spark’s first 1.0 release in May 2014, the community has had many minor and major releases, with the most recent major release of Spark 2.4 in 2018. This book will focus on Spark 2.x, but by the time of this book’s publication, the community will be close to releasing (or will have released) Spark 3.0. Most of the code in this book will work Spark 3.0, since the APIs have not changed.

Over its course of releases, Spark has continued to attract contributors from across the globe and from numerous organizations. Today, Spark has close to 1400 contributors, over 93 releases, 8,000 forks and over 23,810 commits in GitHub, as Figure 1-7 shows. And we hope that when you finish this book, you feel compelled to contribute too.

Now we can turn our attention to the fun of learning: where and how to start using Spark? In the next chapter, we’ll discuss how to download Spark and get started quickly in three simple steps.

¹ https://databricks.com/blog/2018/11/08/introducing-apache-spark-2-4.html

² http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

³ http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

⁴ http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

⁵ https://en.wikipedia.org/wiki/MapReduce

⁶ https://bowenli86.github.io/2016/10/23/distributed%20system/data/Big-Data-and-Google-s-Three-Papers-I-GFS-and-MapReduce/

⁷ http://storageconference.us/2010/Papers/MSST/Shvachko.pdf

⁸ https://www.apache.org/

⁹ https://en.wikipedia.org/wiki/Apache_Hadoop

¹⁰ https://brookewenig.github.io/SparkOverview.html#/20

¹¹ https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf

¹² https://spark.apache.org