0%

Book Description

Data is getting bigger, arriving faster, and coming in varied formats—and it all needs to be processed at scale for analytics or machine learning. How can you process such varied data workloads efficiently? Enter Apache Spark.

Updated to emphasize new features in Spark 2.x., this second edition shows data engineers and scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine-learning algorithms. Through discourse, code snippets, and notebooks, you’ll be able to:

  • Learn Python, SQL, Scala, or Java high-level APIs: DataFrames and Datasets
  • Peek under the hood of the Spark SQL engine to understand Spark transformations and performance
  • Inspect, tune, and debug your Spark operations with Spark configurations and Spark UI
  • Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka
  • Perform analytics on batch and streaming data using Structured Streaming
  • Build reliable data pipelines with open source Delta Lake and Spark
  • Develop machine learning pipelines with MLlib and productionize models using MLflow
  • Use open source Pandas framework Koalas and Spark for data transformation and feature engineering


Table of Contents

  1. 1. Introduction to Apache Spark: A Unified Analytics Engine
    1. The Genesis of Big Data and Distributed Computing at Google
    2. Hadoop at Yahoo!
    3. Spark’s Early Years at AMPLab
    4. What is Apache Spark?
    5. Unified Analytics
      1. Apache Spark Components as a Unified Stack
      2. Apache Spark’s Distributed Execution and Concepts
    6. Developer’s Experience
    7. Who Uses Spark, and for What?
      1. Data Science Tasks
      2. Data Engineering Tasks
      3. Machine Learning or Deep Learning Tasks
      4. Community Adoption and Expansion
  2. 2. Downloading Apache Spark and Getting Started
    1. Spark’s Directories and Files
    2. Step 2: Use Scala Shell or PySpark Shell
      1. Using Local Machine
    3. Step 3: Understand Spark Application Concepts
      1. Spark Application and SparkSession
      2. Spark Jobs
      3. Spark Stages
      4. Spark Tasks
      5. Transformations, Actions, and Lazy Evaluation
    4. Spark UI
      1. Databricks Community Edition
    5. First Standalone Application
      1. Using Local Machine
      2. Counting M&Ms for the Cookie Monster
      3. Building Standalone Applications in Scala
    6. Summary
  3. 3. Apache Spark’s Structured APIs
    1. Spark: What’s Underneath an RDD?
    2. Structuring Spark
      1. Key Merits and Benefits
    3. Structured APIs: DataFrames and Datasets APIs
      1. DataFrame API
      2. Common DataFrame Operations
      3. Datasets API
      4. DataFrames vs Datasets
      5. What about RDDs?
    4. Spark SQL and the Underlying Engine
      1. Catalyst Optimizer
    5. Summary
  4. 4. Spark SQL and DataFrames — Introduction to Built-in Data Sources
    1. Using Spark SQL in Spark Applications
      1. Basic Query Example
    2. SQL Tables and Views
    3. Data Sources for DataFrames and SQL Tables
      1. DataFrameReader
      2. DataFrameWriter
      3. Parquet
      4. JSON
      5. CSV
      6. Avro
      7. ORC
      8. Image
    4. Summary
  5. 5. Spark SQL and Datasets
    1. Single API for Java and Scala
      1. Scala Case Classes and JavaBeans for Datasets
    2. Working with Datasets
      1. Creating Sample Data
      2. Transforming Sample Data
    3. Memory Management for Datasets and DataFrames
    4. Dataset Encoders
      1. Spark’s Internal Format vs Java Object Format
      2. Serialization and Deserialization (SerDe)
    5. Costs of Using Datasets
      1. Strategies to Mitigate Costs
    6. Summary
  6. 6. Loading and Saving Your Data
    1. Motivation for Data Sources
    2. File Formats: Revisited
      1. Text Files
    3. Organizing Data for Efficient I/O
      1. Partitioning
      2. Bucketing
      3. Compression Schemes
    4. Saving as Parquet Files
      1. Delta Lake Storage Format
      2. Delta Lake Table
    5. Summary
  7. 7. Structured Streaming
    1. Evolution of Apache Spark Stream Processing Engine
      1. The Advent of Micro-batch Stream Processing
      2. Lessons Learnt from Spark Streaming (DStreams)
      3. The Philosophy of Structured Streaming
    2. The Programming Model of Structured Streaming
    3. The Fundamentals of a Structured Streaming Query
      1. Five Steps to Define a Streaming Query
      2. Under the Hood of an Active Streaming Query
      3. Recovering from Failures with Exactly-once Guarantees
      4. Monitoring an Active Query
    4. Streaming Data Sources and Sinks
      1. Files
      2. Apache Kafka
      3. Custom Streaming Sources and Sinks
    5. Data Transformations
      1. Incremental Execution and Streaming State
      2. Stateless Transformations
      3. Stateful Transformations
    6. Stateful Streaming Aggregations
      1. Non-time-based Streaming Aggregations
      2. Aggregations with event time windows
      3. Handling late data with watermarks
      4. Supported Output Modes
    7. Streaming Joins
      1. Stream-static Joins
      2. Stream-stream Joins
    8. Arbitrary stateful computations
      1. Modeling Arbitrary Stateful Operation with mapGroupsWithState
      2. Using timeouts to manage inactive groups
      3. Generalization with flatMapGroupsWithState
    9. Performance Tuning
    10. Summary