Home Page Icon
Home Page
Table of Contents for
Large Scale Machine Learning with Spark
Close
Large Scale Machine Learning with Spark
by Md. Mahedi Kaysar, Md. Rezaul Karim
Large Scale Machine Learning with Spark
Large Scale Machine Learning with Spark
Large Scale Machine Learning with Spark
Credits
About the Authors
About the Reviewer
www.Packtpub.com
Why subscribe?
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Introduction to Data Analytics with Spark
Spark overview
Spark basics
Beauties of Spark
New computing paradigm with Spark
Traditional distributed computing
Moving code to the data
RDD – a new computing paradigm
Spark ecosystem
Spark core engine
Spark SQL
DataFrames and datasets unification
Spark streaming
Graph computation – GraphX
Machine learning and Spark ML pipelines
Statistical computation – SparkR
Spark machine learning libraries
Machine learning with Spark
Spark MLlib
Data types
Basic statistics
Classification and regression
Recommender system development
Clustering
Dimensionality reduction
Feature extraction and transformation
Frequent pattern mining
Spark ML
Installing and getting started with Spark
Packaging your application with dependencies
Running a sample machine learning application
Running a Spark application from the Spark shell
Running a Spark application on the local cluster
Running a Spark application on the EC2 cluster
References
Summary
2. Machine Learning Best Practices
What is machine learning?
Machine learning in modern literature
Machine learning and computer science
Machine learning in statistics and data analytics
Typical machine learning workflow
Machine learning tasks
Supervised learning
Unsupervised learning
Reinforcement learning
Recommender system
Semi-supervised learning
Practical machine learning problems
Machine learning classes
Classification and clustering
Rule extraction and regression
Most widely used machine learning problems
Large scale machine learning APIs in Spark
Spark machine learning libraries
Spark MLlib
Spark ML
Important notes for practitioners
Practical machine learning best practices
Best practice before developing an ML application
Good machine learning and data science worth huge
Best practice – feature engineering and algorithmic performance
Beware of overfitting and underfitting
Stay tuned and combining Spark MLlib with Spark ML
Making ML applications modular and simplifying pipeline synthesis
Thinking of an innovative ML system
Thinking and becoming smarter about Big Data complexities
Applying machine learning to dynamic data
Best practice after developing an ML application
How to enable real-time ML visualization
Do some error analysis
Keeping your ML application tuned
Keeping your ML application adaptive and scale-up
Choosing the right algorithm for your application
Considerations when choosing an algorithm
Accuracy
Training time
Linearity
Talking to your data when choosing an algorithm
Number of parameters
How large is your training set?
Number of features
Special notes on widely used ML algorithms
Logistic regression and linear regression
Recommendation systems
Decision trees
Random forests
Decision forests, decision jungles, and variants
Bayesian methods
Summary
3. Understanding the Problem by Understanding the Data
Analyzing and preparing your data
Data preparation process
Data selection
Data pre–processing
Data transformation
Resilient Distributed Dataset basics
Reading the Datasets
Reading from files
Reading from a text file
Reading multiple text files from a directory
Reading from existing collections
Pre–processing with RDD
Getting insight from the SMSSpamCollection dataset
Working with the key/value pair
mapToPair()
More about transformation
map and flatMap
groupByKey, reduceByKey, and aggregateByKey
sortByKey and sortBy
Dataset basics
Reading datasets to create the Dataset
Reading from the files
Reading from the Hive
Pre-processing with Dataset
More about Dataset manipulation
Running SQL queries on Dataset
Creating Dataset from the Java Bean
Dataset from string and typed class
Comparison between RDD, DataFrame and Dataset
Spark and data scientists workflow
Deeper into Spark
Shared variables
Broadcast variables
Accumulators
Summary
4. Extracting Knowledge through Feature Engineering
The state of the art of feature engineering
Feature extraction versus feature selection
Importance of feature engineering
Feature engineering and data exploration
Feature extraction – creating features out of data
Feature selection – filtering features from data
Importance of feature selection
Feature selection versus dimensionality reduction
Best practices in feature engineering
Understanding the data
Innovative way of feature extraction
Feature engineering with Spark
Machine learning pipeline – an overview
Pipeline – an example with Spark ML
Feature transformation, extraction, and selection
Transformation – RegexTokenizer
Transformation – StringIndexer
Transformation – StopWordsRemover
Extraction – TF
Extraction – IDF
Selection – ChiSqSelector
Advanced feature engineering
Feature construction
Feature learning
Iterative process of feature engineering
Deep learning
Summary
5. Supervised and Unsupervised Learning by Examples
Machine learning classes
Supervised learning
Supervised learning example
Supervised learning with Spark - an example
Air-flight delay analysis using Spark
Loading and parsing the Dataset
Feature extraction
Preparing the training and testing set
Training the model
Testing the model
Unsupervised learning
Unsupervised learning example
Unsupervised learning with Spark - an example
K-means clustering of the neighborhood
Recommender system
Collaborative filtering in Spark
Advanced learning and generalizations
Generalizations of supervised learning
Summary
6. Building Scalable Machine Learning Pipelines
Spark machine learning pipeline APIs
Dataset abstraction
Pipeline
Cancer-diagnosis pipeline with Spark
Breast-cancer-diagnosis pipeline with Spark
Background study
Dataset collection
Dataset description and preparation
Problem formalization
Developing a cancer-diagnosis pipeline with Spark ML
Cancer-prognosis pipeline with Spark
Dataset exploration
Breast-cancer-prognosis pipeline with Spark ML/MLlib
Market basket analysis with Spark Core
Background
Motivations
Exploring the dataset
Problem statements
Large-scale market basket analysis using Spark
The algorithm solution using Spark Core
Tuning and setting the correct parameters in SAMBA
OCR pipeline with Spark
Exploring and preparing the data
OCR pipeline with Spark ML and Spark MLlib
Topic modeling using Spark MLlib and ML
Topic modeling with Spark MLlib
Scalability
Credit risk analysis pipeline with Spark
What is credit risk analysis? Why is it important?
Developing a credit risk analysis pipeline with Spark ML
The dataset exploration
Credit risk pipeline with Spark ML
Performance tuning and suggestions
Scaling the ML pipelines
Size matters
Size versus skewness considerations
Cost and infrastructure
Tips and performance considerations
Summary
7. Tuning Machine Learning Models
Details about machine learning model tuning
Typical challenges in model tuning
Evaluating machine learning models
Evaluating a regression model
Evaluating a binary classification model
Evaluating a multiclass classification model
Evaluating a clustering model
Validation and evaluation techniques
Parameter tuning for machine learning models
Hyperparameter tuning
Grid search parameter tuning
Random search parameter tuning
Cross-validation
Hypothesis testing
Hypothesis testing using ChiSqTestResult of Spark MLlib
Hypothesis testing using the Kolmogorov–Smirnov test from Spark MLlib
Streaming significance testing of Spark MLlib
Machine learning model selection
Model selection via the cross-validation technique
Cross-validation and Spark
Cross-validation using Spark ML for SPAM filtering a dataset
Model selection via training validation split
Linear regression–based model selection for an OCR dataset
Logistic regression-based model selection for the cancer dataset
Summary
8. Adapting Your Machine Learning Models
Adapting machine learning models
Technical overview
The generalization of ML models
Generalized linear regression
Generalized linear regression with Spark
Adapting through incremental algorithms
Incremental support vector machine
Adapting SVMs for new data with Spark
Incremental neural networks
Multilayer perceptron classification with Spark
Incremental Bayesian networks
Classification using Naive Bayes with Spark
Adapting through reusing ML models
Problem statements and objectives
Data exploration
Developing a heart diseases predictive model
Machine learning in dynamic environments
Online learning
Statistical learning model
Adversarial model
Summary
9. Advanced Machine Learning with Streaming and Graph Data
Developing real-time ML pipelines
Streaming data collection as unstructured text data
Labeling the data towards making the supervised machine learning
Creating and building the model
Real-time predictive analytics
Tuning the ML model for improvement and model evaluation
Model adaptability and deployment
Time series and social network analysis
Time series analysis
Social network analysis
Movie recommendation using Spark
Model-based movie recommendation using Spark MLlib
Data exploration
Movie recommendation using Spark MLlib
Developing a real-time ML pipeline from streaming
Real-time tweet data collection from Twitter
Tweet collection using TwitterUtils API of Spark
Topic modeling using Spark
ML pipeline on graph data and semi-supervised graph-based learning
Introduction to GraphX
Getting and parsing graph data using the GraphX API
Finding the connected components
Summary
10. Configuring and Working with External Libraries
Third-party ML libraries with Spark
Using external libraries with Spark Core
Time series analysis using the Cloudera Spark-TS package
Time series data
Configuring Spark-TS
TimeSeriesRDD
Configuring SparkR with RStudio
Configuring Hadoop run-time on Windows
Summary
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Table of Contents
Next
Next Chapter
Large Scale Machine Learning with Spark
Large Scale Machine Learning with Spark
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset