Home Page Icon
Home Page
Table of Contents for
Getting ready
Close
Getting ready
by Tomasz Drabas, Denny Lee
PySpark Cookbook
Title Page
Copyright and Credits
PySpark Cookbook
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the authors
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Sections
Getting ready
How to do it...
How it works...
There's more...
See also
Get in touch
Reviews
Installing and Configuring Spark
Introduction
Installing Spark requirements
Getting ready
How to do it...
How it works...
There's more...
Installing Java
Installing Python
Installing R
Installing Scala
Installing Maven
Updating PATH
Installing Spark from sources
Getting ready
How to do it...
How it works...
There's more...
See also
Installing Spark from binaries
Getting ready
How to do it...
How it works...
There's more...
Configuring a local instance of Spark
Getting ready
How to do it...
How it works...
See also
Configuring a multi-node instance of Spark
Getting ready
How to do it...
How it works...
See also
Installing Jupyter
Getting ready
How to do it...
How it works...
There's more...
See also
Configuring a session in Jupyter
Getting ready
How to do it...
How it works...
There's more...
See also
Working with Cloudera Spark images
Getting ready
How to do it...
How it works...
Abstracting Data with RDDs
Introduction
Creating RDDs
Getting ready 
How to do it...
How it works...
Spark context parallelize method
.take(...) method
Reading data from files
Getting ready 
How to do it...
How it works...
.textFile(...) method
.map(...) method
Partitions and performance
Overview of RDD transformations
Getting ready
How to do it...
.map(...) transformation
.filter(...) transformation
.flatMap(...) transformation
.distinct() transformation
.sample(...) transformation
.join(...) transformation
.repartition(...) transformation
.zipWithIndex() transformation
.reduceByKey(...) transformation
.sortByKey(...) transformation
.union(...) transformation
.mapPartitionsWithIndex(...) transformation
How it works...
Overview of RDD actions
Getting ready
How to do it...
.take(...) action
.collect() action
.reduce(...) action
.count() action
.saveAsTextFile(...) action
How it works...
Pitfalls of using RDDs
Getting ready
How to do it...
How it works...
Abstracting Data with DataFrames
Introduction
Creating DataFrames
Getting ready
How to do it...
How it works...
There's more...
From JSON
From CSV
See also
Accessing underlying RDDs
Getting ready
How to do it...
How it works...
Performance optimizations
Getting ready
How to do it...
How it works...
There's more...
See also
Inferring the schema using reflection
Getting ready
How to do it...
How it works...
See also
Specifying the schema programmatically
Getting ready
How to do it...
How it works...
See also
Creating a temporary table
Getting ready
How to do it...
How it works...
There's more...
Using SQL to interact with DataFrames
Getting ready
How to do it...
How it works...
There's more...
Overview of DataFrame transformations
Getting ready
How to do it...
The .select(...) transformation
The .filter(...) transformation
The .groupBy(...) transformation
The .orderBy(...) transformation
The .withColumn(...) transformation
The .join(...) transformation
The .unionAll(...) transformation
The .distinct(...) transformation
The .repartition(...) transformation
The .fillna(...) transformation
The .dropna(...) transformation
The .dropDuplicates(...) transformation
The .summary() and .describe() transformations
The .freqItems(...) transformation
See also
Overview of DataFrame actions
Getting ready
How to do it...
The .show(...) action
The .collect() action
The .take(...) action
The .toPandas() action
See also
Preparing Data for Modeling
Introduction
Handling duplicates
Getting ready
How to do it...
How it works...
There's more...
Only IDs differ
ID collisions
Handling missing observations
Getting ready
How to do it...
How it works...
Missing observations per row
Missing observations per column
There's more...
See also
Handling outliers
Getting ready
How to do it...
How it works...
See also
Exploring descriptive statistics
Getting ready
How to do it...
How it works...
There's more...
Descriptive statistics for aggregated columns
See also
Computing correlations
Getting ready
How to do it...
How it works...
There's more...
Drawing histograms
Getting ready
How to do it...
How it works...
There's more...
See also
Visualizing interactions between features
Getting ready
How to do it...
How it works...
There's more...
Machine Learning with MLlib
Loading the data
Getting ready
How to do it...
How it works...
There's more...
Exploring the data
Getting ready
How to do it...
How it works...
Numerical features
Categorical features
There's more...
See also
Testing the data
Getting ready
How to do it...
How it works...
See also...
Transforming the data
Getting ready
How to do it...
How it works...
There's more...
See also...
Standardizing the data
Getting ready
How to do it...
How it works...
Creating an RDD for training
Getting ready
How to do it...
Classification
Regression
How it works...
There's more...
See also
Predicting hours of work for census respondents
Getting ready
How to do it...
How it works...
Forecasting the income levels of census respondents
Getting ready
How to do it...
How it works...
There's more...
Building a clustering models
Getting ready
How to do it...
How it works...
There's more...
See also
Computing performance statistics
Getting ready
How to do it...
How it works...
Regression metrics
Classification metrics
See also
Machine Learning with the ML Module
Introducing Transformers
Getting ready
How to do it...
How it works...
There's more...
See also
Introducing Estimators
Getting ready
How to do it...
How it works...
There's more...
Introducing Pipelines
Getting ready
How to do it...
How it works...
See also
Selecting the most predictable features
Getting ready
How to do it...
How it works...
There's more...
See also
Predicting forest coverage types
Getting ready
How to do it...
How it works...
There's more...
Estimating forest elevation
Getting ready
How to do it...
How it works...
There's more...
Clustering forest cover types
Getting ready
How to do it...
How it works...
See also
Tuning hyperparameters
Getting ready
How to do it...
How it works...
There's more...
Extracting features from text
Getting ready
How to do it...
How it works...
There's more...
See also
Discretizing continuous variables
Getting ready
How to do it...
How it works...
Standardizing continuous variables
Getting ready
How to do it...
How it works...
Topic mining
Getting ready
How to do it...
How it works...
Structured Streaming with PySpark
Introduction
Understanding Spark Streaming
Understanding DStreams
Getting ready
How to do it...
Terminal 1 – Netcat window
Terminal 2 – Spark Streaming window
How it works...
There's more...
Understanding global aggregations
Getting ready
How to do it...
Terminal 1 – Netcat window
Terminal 2 – Spark Streaming window
How it works...
Continuous aggregation with structured streaming
Getting ready
How to do it...
Terminal 1 – Netcat window
Terminal 2 – Spark Streaming window
How it works...
GraphFrames – Graph Theory with PySpark
Introduction
Installing GraphFrames
Getting ready
How to do it...
How it works...
Preparing the data
Getting ready
How to do it...
How it works...
There's more...
Building the graph
How to do it...
How it works...
Running queries against the graph
Getting ready
How to do it...
How it works...
Understanding the graph
Getting ready
How to do it...
How it works...
Using PageRank to determine airport ranking
Getting ready
How to do it...
How it works...
Finding the fewest number of connections
Getting ready
How to do it...
How it works...
There's more...
See also
Visualizing the graph
Getting ready
How to do it...
How it works...
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Finding the fewest number of connections
Next
Next Chapter
How to do it...
Getting ready
Ensure that you have created the
graph
GraphFrame from the preceding subsections.
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset