Index
A
- accumulators, Accumulators
- AccumulatorV2 interface, Accumulators
- actions, Lazy Evaluation
- add function, Accumulators
- agg API, Aggregates and groupBy, Grouped Operations on Datasets
- aggregateByKey function, Speeding up joins by assigning a known partitioner
- aggregateColumnFrequencies function, Sort and count values on each partition
- aggregations
- aggregates and groupBy, Aggregates and groupBy
- choosing aggregation operation for key/value data, Choosing an Aggregation Operation-Multiple RDD Operations
- computing aggregates over a window, Windowing
- extending Spark SQL with user-defined aggregate functions, Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- on each partition in Goldilocks final example, Aggregate to ((cell value, column index), count) on each partition
- on grouped data in Datasets, Grouped Operations on Datasets
- optimizing, using array as aggregation object, Using Smaller Data Structures
- reducing number of records by key, Why GroupByKey fails
- reusing existing objects in, Reusing Existing Objects
- speeding up wide transformations in, Working with Key/Value Data
- Structured Streaming aggregates for Naive Bayes, Machine learning with Structured Streaming
- with bad implicit conversions (example), Using Smaller Data Structures
- alias operator, Simple DataFrame transformations and SQL expressions
- Anaconda, using to add packages on CDH clusters, PySpark dependency management
- Apache Bahir project, Sources and Sinks
- Apache Parquet (see Parquet files)
- Apache Toree, How Eclair JS Works
- Append (save mode), Save Modes
- applications (Spark), The Spark Application
- ArrayBuffer, using a map or flatMap instead of, An Example
- arrays
- as operator, Simple DataFrame transformations and SQL expressions, Interoperability with RDDs, DataFrames, and Local Collections
B
- Bahir project, Sources and Sinks
- batch intervals, Batch Intervals
- batch predictions, Predicting
- batch serialization, Python debugging
- big data ecosystem, Spark's place in, How Spark Fits into the Big Data Ecosystem
- Binarizer pipeline stage, explain params, Explain Params
- broadcast hash joins, Speeding up joins using a broadcast hash join
- broadcast variables, Broadcast Variables
- broadcasting training models, Preparing textual data
- builds, adding Spark SQL and Hive components to regular sbt build, Spark SQL Dependencies
C
- C#, using with Spark, Spark on the Common Language Runtime (CLR)—C# and Friends
- C/C++
- cache function, Persist and cache
- caching
- case classes, Basics of Schemas
- Catalyst query optimizer, Datasets, Query Optimizer-Debugging Spark SQL Queries
- CDH clusters, adding packages with Anaconda, PySpark dependency management
- checkpoint function, Checkpointing
- checkpointing
- ChiSqSelector, Feature Scaling and Selection
- class tags
- classification algorithms
- classification model, training in MLlib, MLlib Model Training
- Classifier class, Custom estimators
- Clojure, Beyond Scala within the JVM
- cloneComplement function, Sampling
- cluster managers, How Spark Fits into the Big Data Ecosystem
- clustering algorithms
- clusters
- co-located RDDs, Leveraging Co-Located and Co-Partitioned RDDs
- co-partitioned RDDs, Leveraging Co-Located and Co-Partitioned RDDs
- coalesce function, Wide Versus Narrow Dependencies, Partitioners and Key/Value Data
- code examples from this book
- code generation, by query optimizer, Code Generation
- cogroup function, Choosing a Join Type, Co-Grouping, Leveraging Co-Located and Co-Partitioned RDDs
- CoGroupedRDD, Co-Grouping, Leveraging Co-Located and Co-Partitioned RDDs
- collect action, Functions on RDDs: Transformations Versus Actions
- collectAsMap action, Functions on RDDs: Transformations Versus Actions, Goldilocks Version 1: groupByKey Solution
- collections
- colocated joins, Core Spark Joins
- column operators (Spark SQL), Simple DataFrame transformations and SQL expressions
- combineByKey function, Choosing a Join Type, Choosing an Aggregation Operation
- Common Language Runtime (CLR), Spark on the Common Language Runtime (CLR)—C# and Friends
- components and packages, Spark Components and Packages-Conclusion
- configuration settings, Spark Tuning and Cluster Sizing-Spark settings conclusion
- console sink (blocking), Stream status and debugging
- copartition joins, Core Spark Joins
- copy function, Accumulators
- count function, Functions on RDDs: Transformations Versus Actions
- countByKey function, Actions on Key/Value Pairs
- countByKeyApprox function, Partial manual broadcast hash join
- countByValue function, Actions on Key/Value Pairs
- counters, verifying performance with, Spark Counters for Verifying Performance
- CSV (comma-separated values), Using Pipe and Friends
- CUDA, Going Beyond Scala
D
- DAGs (directed acyclic graphs), Lazy Evaluation
- DAG Scheduler for Spark jobs, The DAG
- data cleaning (ML library), Data Cleaning
- data encoding (ML library), Data Encoding-Data Cleaning
- data formats (see formats for reading/writing data)
- data loading and saving operations, Persist and cache
- data property accumulators, Accumulators
- Data Source API, Data Loading and Saving Functions
- data sources
- data structures, smaller, using to enhance performance, Using Smaller Data Structures
- DataFrameReader, DataFrameWriter and DataFrameReader
- DataFrames, DataFrames, Datasets, and Spark SQL
- computing difference between, Computing RDD Difference
- converting to/from Datasets, Interoperability with RDDs, DataFrames, and Local Collections
- converting to/from RDDs, RDDs
- creating from JDBC data sources, JDBC
- creating from local collections, Local collections
- data representation in, Data Representation in DataFrames and Datasets-Tungsten
- DataFrame API, DataFrame API-Data Representation in DataFrames and Datasets
- Goldilocks data (example), Goldilocks Version 0: Iterative Solution
- inspecting the schema, Basics of Schemas
- joins, DataFrame Joins
- PySpark, PySpark DataFrames and Datasets
- RDD transformations, Effective Transformations
- RDDs versus, DataFrames, Datasets, and Spark SQL
- registering/saving as Hive tables to perform SQL queries against, Plain Old SQL Queries and Interacting with Hive Data
- sample and randomSplit functions, Sampling
- Structured Streaming based on, Stream Processing with Spark
- testing, Testing DataFrames
- working with as RDDs, loss of type information, What Type of RDD Does Your Transformation Return?
- working with C#, Spark on the Common Language Runtime (CLR)—C# and Friends
- working with R language, How SparkR Works
- DataFrameWriter, DataFrameWriter and DataFrameReader
- Dataset API, Datasets
- (see also Datasets)
- up to date documentation on, Datasets
- Datasets, DataFrames, Datasets, and Spark SQL, Datasets-Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- compile-time strong typing, Compile-Time Strong Typing
- converting to RDDs, RDDs
- data representation in, Data Representation in DataFrames and Datasets
- easier functional transformations, Easier Functional (RDD “like”) Transformations
- grouping operations on, Grouped Operations on Datasets
- interoperability with RDDs, DataFrames, and local collections, Interoperability with RDDs, DataFrames, and Local Collections
- joins, Dataset Joins
- multi-Dataset relational transformations, Multi-Dataset Relational Transformations
- PySpark, PySpark DataFrames and Datasets
- RDD transformations, Effective Transformations
- relational transformations, Relational Transformations
- streaming aggregations on, Data Checkpoint Intervals
- use in Structured Streaming, Considerations for Structured Streaming
- versus RDDs, DataFrames, Datasets, and Spark SQL
- DataStreamWriter, Output operations
- debugging
- defaultCopy function, Custom transformers
- dense vectors, creating, Working with Spark vectors
- dependencies
- dependencies function, Immutability and the RDD Interface
- describe function, Aggregates and groupBy
- deserialized (storage level), Persist and cache
- directed acyclic graphs (DAGs), Lazy Evaluation
- DAG Scheduler for Spark jobs, The DAG
- disk space errors, Shuffle files, Out of Disk Space Errors
- distinct function, Choosing a Join Type, Set Operations
- distinct, reducing to on each partition (Goldilocks example), Goldilocks Version 4: Reduce to Distinct on Each Partition
- Docker-based Spark integration environments, Docker-based
- driver, Spark Job Scheduling, The Spark Application
- dropDuplicates function, Beyond row-by-row transformations
- DryadLINQ, How Spark Works
- DStreams, Stream Processing with Spark
- dynamic resource allocation, Resource Allocation Across Applications, Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?
E
- Eclair JS, Going Beyond Scala
- enableHiveSupport function, Getting Started with the SparkSession (or HiveContext or SQLContext)
- equality tests, Simple DataFrame transformations and SQL expressions
- equals function, Custom Partitioning
- ErrorIfExists (save mode), Save Modes
- Estimator interface, Extending Spark ML Pipelines with Your Own Algorithms, Custom estimators
- estimators, Working with Spark ML, Pipeline Stages
- evaluation, machine learning models
- Evaluator class, Automated model selection (parameter search)
- executors, The Spark Application
- explain params (ML pipeline stages), Explain Params
- explicit conversions
- explode function, Simple DataFrame transformations and SQL expressions
F
- fair scheduler, Default Spark Scheduler, Noisy Cluster Considerations
- fake class tags, Beyond Scala within the JVM
- fault tolerance
- feature selection and scaling
- feature transformers (ML library), Spark ML Organization and Imports
- features
- FIFO scheduler, Default Spark Scheduler, Noisy Cluster Considerations
- file sources in Spark Streaming, Sources and Sinks
- filter function, To Be a Spark Expert You Have to Learn a Little Scala Anyway
- filter pushdown in Spark SQL, Debugging Spark SQL Queries
- filterByRange function, Dictionary of OrderedRDDOperations
- filtering
- fit function, Explain Params, Training a Pipeline
- Flambo, Beyond Scala within the JVM
- flatMap function, To Be a Spark Expert You Have to Learn a Little Scala Anyway, Goldilocks Version 0: Iterative Solution, Map to (cell value, column index) pairs
- flatMapValues function, Dictionary of Mapping and Partitioning Functions PairRDDFunctions
- Flume, Sources and Sinks
- fold function, To Be a Spark Expert You Have to Learn a Little Scala Anyway
- fold operations, object reuse with, Reusing Existing Objects
- foldByKey function
- foreach function, Functions on RDDs: Transformations Versus Actions
- foreachPartition function, Reducing Setup Overhead
- foreachRDD function, Output operations
- formats for reading/writing data, Formats-Additional formats
- FORTRAN, Underneath Everything Is FORTRAN
- interacting with from Spark, using JNI, JNI
- fromML function, Spark ML Organization and Imports
- full outer joins, Choosing a Join Type, DataFrame Joins
- functions
G
- garbage collection (GC)
- getOrCreate function, Getting Started with the SparkSession (or HiveContext or SQLContext)
- getPartition function, Custom Partitioning
- GLM (generalized linear model), persisting, General Serving Considerations
- Goldilocks example, The Goldilocks Example-Actions on Key/Value Pairs
- review of all solutions, Goldilocks postmortem
- using PairRDDFunctions and OrderedRDDFunctions, How to Use PairRDDFunctions and OrderedRDDFunctions
- Version 0, iterative solution, Goldilocks Version 0: Iterative Solution
- Version 1, groupByKey solution, Goldilocks Version 1: groupByKey Solution
- Version 2, using secondary sort, Goldilocks Version 2: Secondary Sort-Performance
- Version 3, A Different Approach to Goldilocks-Goldilocks Version 3: Sort on Cell Values
- Version 4, Back to Goldilocks (Again)-Sort and find rank statistics
- GPUEnabler package, Getting to the GPU
- GPUs (graphics processing units), Getting to the GPU
- GraphX, Spark Components, GraphX
- groupBy function
- groupByKey function, Wide Versus Narrow Dependencies
- groupByKeyAndSortValues function, Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
- GroupedDataset object, Grouped Operations on Datasets
- GroupedRDDFunctions class, Types of RDDs
- grouping operations on Datasets, Grouped Operations on Datasets
- groupSorted function, Goldilocks Version 2: Secondary Sort
I
- if/else in Spark SQL, Simple DataFrame transformations and SQL expressions
- Ignore (save mode), Save Modes
- immutability of RDDs, Immutability and the RDD Interface
- implicit conversions
- in-memory persistence, In-Memory Persistence and Memory Management, Deciding if Recompute Is Inexpensive Enough
- IndexToString, Data Encoding
- inner joins, DataFrame Joins
- integration testing, Integration Testing-Verifying Performance
- intercepts, including in training a simple MLlib classification model, MLlib Model Training
- intermediate object creation, Using Smaller Data Structures
- intersection function, Set Operations
- IPython, How PySpark Works
- isZero function, Accumulators
- Iterable objects, Co-Grouping
- iterative algorithms, large query plans and, Large Query Plans and Iterative Algorithms
- iterative computations, reusing RDDs in, Iterative computations
- iterative solution (Goldilocks example), Goldilocks Version 0: Iterative Solution
- iterator function, Immutability and the RDD Interface
- iterator-to-iterator transformations with mapPartitions, Iterator-to-Iterator Transformations with mapPartitions-Set Operations, Reducing Setup Overhead, A Different Approach to Goldilocks
- iterators
J
- Janino, Code Generation
- JARs (Java Archives)
- Java, Spark Components
- accessing backing Java objects in PySpark, Accessing the backing Java objects and mixing Scala code
- Iterable versus Iterator objects, Space and Time Advantages
- iterator implementation, java.util.Iterator, What Is an Iterator-to-Iterator Transformation?
- object serialization, Tungsten versus, Tungsten
- RDDs composed of Java objects, converting to DataFrames, RDDs
- Scala API versus Java API, The Spark Scala API Is Easier to Use Than the Java API
- simple Java JNI, JNI
- System.loadLibrary function, JNI
- writing Spark code in, Beyond Scala within the JVM
- Java Native Access (JNA), Java Native Access (JNA)
- Java Native Interface (JNI), Going Beyond Scala, JNI-Java Native Access (JNA)
- java.util.Properties object, JDBC
- JavaBeans, RDDs composed of, converting to DataFrames, RDDs
- JavaConverters object, Beyond Scala within the JVM
- JavaDoubleRDD, Beyond Scala within the JVM
- javah command, JNI
- JavaPairRDD, Beyond Scala within the JVM
- JavaRDD class, Types of RDDs
- JavaScript, Eclair JS, Going Beyond Scala, How Eclair JS Works
- JBLAS library, JNI
- JDBC
- JdbcDialect, JDBC
- JDWP (Java Debug Wire Protocol), Attaching debuggers
- JNI (see Java Native Interface)
- jobs
- join function, Wide Versus Narrow Dependencies
- joins, Joins (SQL and Core)-Conclusion, Implications for Performance
- JPMML evaluator project, PMML, General Serving Considerations
- JSON, Using Pipe and Friends
- Julia (Spark.jl), Spark.jl (Julia Spark)
- Jupyter notebook, Debugging in notebooks
- JVMs (Java Virtual Machines), The Spark Application
- Jython, PySpark and, PySpark DataFrames and Datasets
K
- Kafka, Sources and Sinks
- key/value data, working with, Working with Key/Value Data-Conclusion
- actions on key/value pairs, Actions on Key/Value Pairs
- choosing an aggregation operation, Choosing an Aggregation Operation-Multiple RDD Operations
- dangers of groupByKey function, What’s So Dangerous About the groupByKey Function
- Goldilocks example, The Goldilocks Example-Actions on Key/Value Pairs
- Goldilocks example, Version 3, A Different Approach to Goldilocks-Goldilocks Version 3: Sort on Cell Values
- groupByKey solution to Goldilocks example, Goldilocks Version 1: groupByKey Solution
- multiple RDD operations (co-grouping), Multiple RDD Operations
- OrderedRDDFunctions class, dictionary of operations, Dictionary of OrderedRDDOperations
- partitioners, Partitioners and Key/Value Data-Dictionary of Mapping and Partitioning Functions PairRDDFunctions
- performance issues, Working with Key/Value Data
- repartitioning keyed data, Repartitioning
- secondary sort and repartitionAndSortWithinPartitions, Secondary Sort and repartitionAndSortWithinPartitions-Performance
- straggler detection and unbalanced data, Straggler Detection and Unbalanced Data-Goldilocks postmortem
- keys function, Dictionary of Mapping and Partitioning Functions PairRDDFunctions
- Kinesis, Sources and Sinks
- KMeans model, Model Evaluation
- kontextfrei library, Mocking RDDs
- Kryo serialization, Data Representation in DataFrames and Datasets, Kryo
L
- LabeledPoint class, Getting Started with MLlib (Organization and Imports)
- labels
- lambdas
- lazy evaluation, Lazy Evaluation
- left anti joins, DataFrame Joins
- left outer joins, Choosing a Join Type, DataFrame Joins
- left semi joins, DataFrame Joins
- libraries, Spark Components
- limiting results, using sorting in Spark SQL, Sorting
- linear algebra package, Spark ML Organization and Imports
- linear models, persisting, General Serving Considerations
- Local Checkpointing option, Checkpointing example
- local mode, How Spark Fits into the Big Data Ecosystem
- LocalRelation, Local collections
- log4J, Configuring logging
- logging, Out of Disk Space Errors-Accessing logs
- logical plan (query optimizer), Logical and Physical Plans
- LogisticRegressionModel, Model Evaluation
- lookUp function, Actions on Key/Value Pairs
- LRU caching, In-Memory Persistence and Memory Management, LRU Caching
M
- machine learning, Spark MLlib and ML-Conclusion
- choosing between Spark MLlib and Spark ML, Choosing Between Spark MLlib and Spark ML
- ML and MLlib packages, Spark Components
- modifying an existing algorithm, Custom estimators
- serving considerations in MLlib and ML library, General Serving Considerations
- with Structured Streaming, Machine learning with Structured Streaming
- working with ML library, Working with Spark ML-General Serving Considerations
- accessing individual pipeline stages, Accessing Individual Stages
- building a pipeline, Putting It All Together in a Pipeline
- data cleaning, Data Cleaning
- data encoding, Data Encoding-Data Cleaning
- data persistence, Data Persistence and Spark ML-Extending Spark ML Pipelines with Your Own Algorithms
- extending ML pipelines with your own algorithms, Extending Spark ML Pipelines with Your Own Algorithms-Conclusion
- getting started, organization and imports, Spark ML Organization and Imports
- models, Spark ML Models
- pipeline stages, Pipeline Stages
- training a pipeline, Training a Pipeline
- working with MLlib, Working with MLlib-Model Evaluation
- map function, To Be a Spark Expert You Have to Learn a Little Scala Anyway, Wide Versus Narrow Dependencies
- map-side combinations, aggregation operations, Preventing out-of-memory errors with aggregation operations
- mapGroups function, Grouped Operations on Datasets
- mapPartitions function, Reducing Setup Overhead, Dictionary of Mapping and Partitioning Functions PairRDDFunctions, Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
- mappedRDD, Implications for Fault Tolerance
- MapReduce
- mapValues function, Dictionary of Mapping and Partitioning Functions PairRDDFunctions
- Maven build manager, Spark Components, PySpark dependency management
- memory errors, Why GroupByKey fails
- memory management
- MemoryStreams, Sources and Sinks
- MEMORY_AND_DISK_2 storage option, Noisy Cluster Considerations
- MEMORY_ONLY storage level, Persist and cache
- MEMORY_ONLY_SER storage option, Persist and cache
- merge function, Accumulators
- meta-algorithms, ML library, Spark ML Organization and Imports
- metadata
- MinMaxScaler (Spark ML), Data Cleaning
- missing data, working with on DataFrames, Specialized DataFrame transformations for missing and noisy data
- ML library, Spark Components, Spark MLlib and ML
- (see also machine learning)
- MLeap project, support for PMML model export, Model and Pipeline Persistence and Serving with Spark ML
- MLlib, Spark Components, Spark MLlib and ML
- MLUtils object, kfold function, Model Evaluation
- Mobius, Spark on the Common Language Runtime (CLR)—C# and Friends
- model training
- models (machine learning)
- multi-DataFrame transformations, Multi-DataFrame Transformations
- multiple actions on the same RDD, Multiple actions on the same RDD
- MutableAggregationBuffer, Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- MySQL, including JDBC JAR in Spark Shell, JDBC
N
- na function, Specialized DataFrame transformations for missing and noisy data
- Naive Bayes algorithm
- NaN values, isNaN function on DataFrames, Specialized DataFrame transformations for missing and noisy data
- narrow dependencies, Immutability and the RDD Interface
- native loader decorator, JNI
- NewHadoopRDD class, Immutability and the RDD Interface
- noisy clusters, Noisy Cluster Considerations
- Normalizer, using in Spark ML, Data Cleaning, Putting It All Together in a Pipeline
- notebooks, debugging in, Debugging in notebooks
- null values, isNull function on DataFrames, Specialized DataFrame transformations for missing and noisy data
- numeric functions
- numPartitions function, Custom Partitioning
P
- packages, Spark Components and Packages
- PairRDDFunctions class, Types of RDDs, Working with Key/Value Data
- PairwiseRDD, PySpark RDDs
- parallel-ssh, installing packages via, PySpark dependency management
- parallelism value (SparkConf), Hash Partitioning
- parameters (pipeline stages in ML), Pipeline Stages
- Parquet files, Parquet, Data sources
- partitionBy function, Partitions (Discovery and Writing), Partitioners and Key/Value Data, Dictionary of Mapping and Partitioning Functions PairRDDFunctions
- partitioner function, Immutability and the RDD Interface
- partitioners
- partitions, Spark Model of Parallel Computing: RDDs
- partitions function, Immutability and the RDD Interface
- PartitionwiseSampledRDD, Sampling
- PCA (principal component analysis)
- performance
- considerations with aggregation operations, Dictionary of Aggregation Operations with Performance Considerations
- considerations with joins, Joins (SQL and Core)
- Goldilocks example, review of solutions, Goldilocks postmortem
- issues with key/value operations, Working with Key/Value Data
- narrow versus wide transformations, Implications for Performance
- PySpark DataFrames and Datasets versus RDDs, PySpark DataFrames and Datasets
- RDDs versus DataFrames, DataFrames, Datasets, and Spark SQL
- transformations, methods for improving, A Different Approach to Goldilocks
- user-defined functions and, Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- verifying, Verifying Performance-Projects for Verifying Performance
- Perl Script, calling from pipe interface, Using Pipe and Friends
- persist function, In-Memory Persistence and Memory Management, Iterative computations
- persistence
- persistencePriority function, In-Memory Persistence and Memory Management
- physical plan (query optimizer), Logical and Physical Plans
- pip installations, PySpark dependency management
- pipe interface, calling other languages from Spark, Using Pipe and Friends
- PipedRDD interface, PySpark RDDs
- pipelines (Spark ML)
- accessing individual stages, Accessing Individual Stages
- building a pipeline, Putting It All Together in a Pipeline
- extending with your own algorithms, Extending Spark ML Pipelines with Your Own Algorithms-Conclusion
- parameters, setting for pipeline stages, Pipeline Stages
- persistence in pipeline stages, Data Persistence and Spark ML
- Pipeline object, Putting It All Together in a Pipeline
- pipeline stages, Pipeline Stages
- support for Structured Streaming, Machine learning with Structured Streaming
- training a pipeline, Training a Pipeline
- transformers and estimators, Pipeline Stages
- PipelineStage interface, Extending Spark ML Pipelines with Your Own Algorithms
- PMML (Predictive Model Markup Language) models, Serving and Persistence
- PMMLExportable trait, PMML
- predictions
- Predictor class, Custom estimators
- preferredLocations function, Immutability and the RDD Interface
- printSchema function, Basics of Schemas
- programming languages, options with Spark, Going Beyond Scala-Conclusion
- properties (RDDs), Immutability and the RDD Interface
- property checking, using ScalaCheck, Property Checking with ScalaCheck-Computing RDD Difference
- psuedorandom number generators, creating, Reducing Setup Overhead
- Py4J, How PySpark Works, Accessing the backing Java objects and mixing Scala code
- Python, Spark Components
- IPython, How PySpark Works
- PySpark, Beyond Scala, and Beyond the JVM-Installing PySpark
- round-tripping through RDDs to cut query plans, Large Query Plans and Iterative Algorithms
- Scala performance versus, Scala Is More Performant Than Python
- Spark ML parameter documentation, Pipeline Stages
- Spark packages, Using Community Packages and Libraries
- user-defined function performance penalty, avoiding, Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
R
- R language, Spark Components
- RandomDataGenerator, Generating Large Datasets
- RandomRDDs, Generating Large Datasets
- RandomSampler trait, Sampling
- randomSplit function, Sampling
- range partitioning, Range Partitioning
- rank statistics, The Goldilocks Example
- (see also Goldilocks example)
- RDD class, Types of RDDs
- rdd function, Beyond Scala within the JVM
- RDDs (resilient distributed datasets), To Be a Spark Expert You Have to Learn a Little Scala Anyway, Spark Components, Spark Model of Parallel Computing: RDDs-Wide Versus Narrow Dependencies
- changing partitioning of, Partitioners and Key/Value Data
- computing difference between, Computing RDD Difference
- converting between Scala and Java, Beyond Scala within the JVM
- converting to data formats for use over pipe interface, Using Pipe and Friends
- converting to/from Datasets, Interoperability with RDDs, DataFrames, and Local Collections
- data storage space, DataFrame versus, Tungsten
- DataFrames and Datasets versus, DataFrames, Datasets, and Spark SQL
- DStreams, Stream Processing with Spark
- functions on, transformations versus actions, Functions on RDDs: Transformations Versus Actions
- immutability and the RDD interface, Immutability and the RDD Interface
- in-memory persistence and memory management, In-Memory Persistence and Memory Management
- joins, Core Spark Joins-Partial manual broadcast hash join
- lazy evaluation of, Lazy Evaluation
- mock RDDs for use in testing, Mocking RDDs
- operations with multiple RDDs and key/value data, Multiple RDD Operations
- performance, DataFrames versus, DataFrames, Datasets, and Spark SQL
- PySpark, PySpark RDDs
- reading and writing in Spark SQL, RDDs
- returned by transformations, types of, What Type of RDD Does Your Transformation Return?
- reusing, Reusing RDDs-Interaction with Accumulators
- round-tripping through to cut query plans, Large Query Plans and Iterative Algorithms
- sampling, Sampling
- testing transformations, Regular Spark jobs (testing with RDDs)
- transformations, Effective Transformations
- (see also transformations)
- types of, Types of RDDs
- wide versus narrow dependencies, Wide Versus Narrow Dependencies
- readStream function, Data sources
- receivers, Receivers
- recommendation algorithms, ML library, Spark ML Organization and Imports, Spark ML Models
- recomputing RDDs, deciding if it is inexpensive enough, Deciding if Recompute Is Inexpensive Enough
- record type informaton in RDDs, What Type of RDD Does Your Transformation Return?
- reduce function, To Be a Spark Expert You Have to Learn a Little Scala Anyway, Functions on RDDs: Transformations Versus Actions
- reduceByKey function, Wide Versus Narrow Dependencies, Speeding up joins by assigning a known partitioner, Partial manual broadcast hash join
- reduceByKeyAndWindow function, Considerations for DStreams
- registerJavaFunction, Accessing the backing Java objects and mixing Scala code
- regression algorithms
- relational transformations (Datasets), Relational Transformations
- repartition function, Wide Versus Narrow Dependencies, The Special Case of coalesce, Partitioners and Key/Value Data
- repartitionAndSortWithinPartitions function, Dictionary of OrderedRDDOperations
- replication (storage level), Persist and cache
- reset function, Accumulators
- resetAndCopy function, Accumulators
- resource allocation, Resource Allocation Across Applications, Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?-Number and Size of Partitions
- right outer joins, Choosing a Join Type, DataFrame Joins
- row equality, checking DataFrames for, Testing DataFrames
- Row objects
S
- sample function, Functions on RDDs: Transformations Versus Actions, Sampling
- sampleByKey function, Dictionary of Mapping and Partitioning Functions PairRDDFunctions, Sampling
- sampleByKeyExact function, Sampling
- sampling, Sampling
- save modes (Spark SQL), Save Modes
- Saveable trait, Saveable (internal format)
- saveAsObjectFile function, Functions on RDDs: Transformations Versus Actions
- saveAsSequenceFile function, Functions on RDDs: Transformations Versus Actions
- saveAsTextFile function, Functions on RDDs: Transformations Versus Actions
- SBT
- sbt-spark-package plug-in, Managing Spark Dependencies, Creating a Spark Package
- Scala, Why Scala?, Spark Components
- advantages for Spark development, To Be a Spark Expert You Have to Learn a Little Scala Anyway
- flatMap operation on iterators and collections, Goldilocks Version 0: Iterative Solution
- learning, resources for, Learning Scala
- Quasi Quotes, Code Generation
- RDDs in, converting to/from Java, Beyond Scala within the JVM
- reasons not to use for Spark development, Why Not Scala?
- simple Scala JNI, JNI
- Spark SQL Scala operators, Simple DataFrame transformations and SQL expressions
- type parameters, syntax of, What Type of RDD Does Your Transformation Return?
- ScalaCheck, property checking with, Property Checking with ScalaCheck-Computing RDD Difference
- scaling features in MLlib, Feature Scaling and Selection
- schema function, Basics of Schemas
- schemas
- adding schema information to data converted from RDDs to DataFrames, RDDs
- additional schema information in Datasets and DataFrames, DataFrames, Datasets, and Spark SQL
- DataFrames, working with as RDDs, What Type of RDD Does Your Transformation Return?
- inferring the schema from JSON data, Avoiding Hive JARs, JSON
- sampling schema inference, streaming and, Data sources
- Spark SQL, basics of, Basics of Schemas-DataFrame API
- specifying schema for local collection conversion to DataFrame, Local collections
- secondary sort and repartitionAndSortWithinPartitions, Secondary Sort and repartitionAndSortWithinPartitions-Performance
- select operator, Simple DataFrame transformations and SQL expressions
- selecting features in MLlib, Feature Scaling and Selection
- self joins, Self joins
- serialization, Persist and cache
- serialization/deserialization
- set-like operations
- setParameterName function, Explain Params
- settings (see configuration settings)
- setup overhead, reducing, Reducing Setup Overhead-Accumulators
- shared variables, Shared Variables
- shuffle files, Implications for Performance
- shuffle joins, Core Spark Joins
- shuffled hash joins, Choosing an Execution Plan
- ShuffleDependency object, Immutability and the RDD Interface, Wide Versus Narrow Dependencies
- ShuffledRDD class, Types of RDDs
- shuffles, Wide Versus Narrow Dependencies, Narrow Versus Wide Transformations
- sinks, custom, for Structured Streaming, Custom sinks
- slideDuration, Considerations for DStreams
- sort function, Wide Versus Narrow Dependencies
- sortBy function, Goldilocks Version 0: Iterative Solution
- sortByKey function, Dictionary of OrderedRDDOperations, Sort and count values on each partition
- sorting
- sources (see data sources)
- Spark
- about, What Is Spark and Why Performance Matters
- components, Spark Components
- design principles, How Spark Works
- in big data ecosystem, How Spark Fits into the Big Data Ecosystem
- job scheduling, Spark Job Scheduling-The Anatomy of a Spark Job
- jobs, anatomy of, The Anatomy of a Spark Job-Tasks
- libraries, Spark Components
- model of parallel computing, RDDs, Spark Model of Parallel Computing: RDDs-Wide Versus Narrow Dependencies
- performance, importance of, What Is Spark and Why Performance Matters
- Scala and, Why Scala?
- versions, Spark Versions
- Spark Core, Spark Components
- Spark Jobserver, Projects for Verifying Performance
- Spark Packages, PySpark dependency management
- Spark SQL, Spark Components
- components being built on top of, Spark Components and Packages
- data loading and saving functions, Data Loading and Saving Functions
- data representation in DataFrames and Datasets, Data Representation in DataFrames and Datasets
- DataFrame API, DataFrame API-Data Representation in DataFrames and Datasets
- DataFrames and Datasets, DataFrames, Datasets, and Spark SQL
- Datasets, Datasets-Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- debugging queries, Debugging Spark SQL Queries
- dependencies, Spark SQL Dependencies-Avoiding Hive JARs
- extending with user-defined functions and user-defined aggregate functions, Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
- getting started with SparkSession, Getting Started with the SparkSession (or HiveContext or SQLContext)
- JDBC/ODBC server, JDBC/ODBC Server
- joins, Spark SQL Joins-Dataset Joins
- performance in Python, PySpark DataFrames and Datasets
- query optimizer, Query Optimizer-Debugging Spark SQL Queries
- Scala and Java interoperability, Beyond Scala within the JVM
- schemas, Basics of Schemas-DataFrame API
- SQLContext and HiveContext entry points, Getting Started with the SparkSession (or HiveContext or SQLContext)
- windowing, Windowing
- Spark Streaming, Spark Components
- spark-perf package, Projects for Verifying Performance
- spark-sql-perf project, Generating Large Datasets
- spark-validator project, Job Validation
- Spark.jl (Julia Spark), Spark.jl (Julia Spark)
- SparkConf object, The Spark Application
- SparkContext, Immutability and the RDD Interface
- sparkling, Beyond Scala within the JVM
- SparkListener, Spark Counters for Verifying Performance
- Sparklyr library, How SparkR Works
- SparkR, How SparkR Works-Spark.jl (Julia Spark)
- SparkSession, Immutability and the RDD Interface
- SparkVector, Getting Started with MLlib (Organization and Imports), Predicting
- sparse vectors, creating, Working with Spark vectors
- SQL
- SQLContext, Getting Started with the SparkSession (or HiveContext or SQLContext)
- stages, Stages
- static allocation of resources, Resource Allocation Across Applications
- status function, StreamingQuery, Stream status and debugging
- storage levels, Persist and cache
- straggler tasks, Working with Key/Value Data
- stratified sampling, Sampling
- stream processing with Spark, Stream Processing with Spark-GraphX
- streaming (see stream processing with Spark; Spark Streaming)
- StreamingActionBase class, Streaming
- StreamingSuiteBase class, Streaming
- StringIndexer, Data Encoding
- StringIndexerModel, Data Encoding
- strings
- StructField case class, Basics of Schemas
- StructType case class, Basics of Schemas
- Structured Streaming, Stream Processing with Spark
- subtract function, Set Operations
- supervised learning
- SWIG, writing wrappers with, JNI
T
- Tachyon, Persist and cache, Alluxio (nee Tachyon)
- take function, Functions on RDDs: Transformations Versus Actions
- tasks, The Spark Application
- TaskScheduler, The DAG, Tasks
- testing, Testing and Validation-Conclusion
- Testing Spark: Best Practices (speech), Projects for Verifying Performance
- text files, saving DStreams as, Output operations
- textual data, encoding for features
- this.type, Reusing Existing Objects
- toDebugString function, Types of RDDs
- tokenizer, using with HashingTF, Data Encoding
- toLocalIterator function, Regular Spark jobs (testing with RDDs)
- Toree, How Eclair JS Works
- training
- transform function, Pipeline Stages, Explain Params
- transformations, Effective Transformations-Conclusion
- actions versus, on RDDs, Functions on RDDs: Transformations Versus Actions
- DataFrame, Transformations-Sorting
- easier functional transformations with Datasets, Easier Functional (RDD “like”) Transformations
- in sinks, Custom sinks
- iterator-to-iterator, with mapPartitions, Iterator-to-Iterator Transformations with mapPartitions-Set Operations
- methods for improving performance of, A Different Approach to Goldilocks
- minimizing object creation, Minimizing Object Creation
- multi-Dataset relational transformations, Multi-Dataset Relational Transformations
- narrow versus wide dependencies, Wide Versus Narrow Dependencies, Narrow Versus Wide Transformations-The Special Case of coalesce
- preserving partitioning information across, Preserving Partitioning Information Across Transformations
- reducing setup overhead, Reducing Setup Overhead-Accumulators
- relational transformations with Datasets, Relational Transformations
- reusing RDDs, Reusing RDDs-Interaction with Accumulators
- stage boundaries, Tasks
- testing, Regular Spark jobs (testing with RDDs)
- types of RDD returned by, What Type of RDD Does Your Transformation Return?
- Transformer interface, Extending Spark ML Pipelines with Your Own Algorithms
- transformers, Pipeline Stages
- transformSchema function, Custom transformers
- tree algorithms (MLlib), Getting Started with MLlib (Organization and Imports)
- treeAggregate function, Preventing out-of-memory errors with aggregation operations
- Tungsten, Tungsten, The Future, Serialization Options
- tuning and cluster sizing (see configuration settings)
- tuples
- types
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.