Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Practical Machine Learning with R

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Brindha Priyadarshini Jeyaraman, Ludvig Renbo Olsen, and Monicah Wambugu

Technical Reviewers: Anil Kumar and Rohan Chikorde

Managing Editor: Steffi Monterio and Snehal Tambe

Acquisitions Editors: Koushik Sen

Production Editor: Samita Warang

Editorial Board: Shubhopriya Banerjee, Mayank Bhardwaj, Ewan Buckingham, Mahesh Dhyani, Taabish Khan, Manasa Kumar, Alex Mazonowicz, Pramod Menon, Bridget Neale, Dominic Pereira, Shiny Poojary, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First Published: August 2019

Production Reference: 1300819

ISBN: 978-1-83855-013-4

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface i

Chapter 1: An Introduction to Machine Learning 1

Introduction 2

The Machine Learning Process 2

Raw Data 4

Data Pre-Processing 4

The Data Splitting Process 4

The Training Process 5

Evaluation Process 5

Deployment Process 7

Process Flow for Making Predictions 7

Introduction to R 8

Exercise 1: Reading from a CSV File in RStudio 8

Exercise 2: Performing Operations on a Dataframe 9

Exploratory Data Analysis (EDA) 10

View Built-in Datasets in R 10

Exercise 3: Loading Built-in Datasets 12

Exercise 4: Viewing Summaries of Data 18

Visualizing the Data 21

Activity 1: Finding the Distribution of Diabetic Patients in the PimaIndiansDiabetes Dataset 30

Activity 2: Grouping the PimaIndiansDiabetes Data 31

Activity 3: Performing EDA on the PimaIndiansDiabetes Dataset 32

Machine Learning Models 34

Types of Prediction 34

Supervised Learning 35

Unsupervised Learning 37

Applications of Machine Learning 37

Regression 38

Exercise 5: Building a Linear Classifier in R 40

Activity 4: Building Linear Models for the GermanCredit Dataset 42

Activity 5: Using Multiple Variables for a Regression Model for the Boston Housing Dataset 42

Summary 44

Chapter 2: Data Cleaning and Pre-processing 47

Introduction 48

Advanced Operations on Data Frames 49

Exercise 6: Sorting the Data Frame 49

Join Operations 52

Pre-Processing of Data Frames 55

Exercise 7: Centering Variables 55

Exercise 8: Normalizing the Variables 57

Exercise 9: Scaling the Variables 58

Activity 6: Centering and Scaling the Variables 60

Extracting the Principle Components 60

Exercise 10: Extracting the Principle Components 61

Subsetting Data 63

Exercise 11: Subsetting a Data Frame 63

Data Transposes 65

Identifying the Input and Output Variables 66

Identifying the Category of Prediction 67

Handling Missing Values, Duplicates, and Outliers 67

Handling Missing Values 67

Exercise 12: Identifying the Missing Values 67

Techniques for Handling Missing Values 70

Exercise 13: Imputing Using the MICE Package 70

Exercise 14: Performing Predictive Mean Matching 72

Handling Duplicates 74

Exercise 15: Identifying Duplicates 74

Techniques Used to Handle Duplicate Values 76

Handling Outliers 76

Exercise 16: Identifying Outlier Values 76

Techniques Used to Handle Outliers 78

Exercise 17: Predicting Values to Handle Outliers 78

Handling Missing Data 79

Exercise 18: Handling Missing Values 80

Activity 7: Identifying Outliers 81

Pre-Processing Categorical Data 82

Handling Imbalanced Datasets 82

Undersampling 84

Exercise 19: Undersampling a Dataset 84

Oversampling 85

Exercise 20: Oversampling 85

ROSE 85

Exercise 21: Oversampling using ROSE 86

SMOTE 86

Exercise 22: Implementing the SMOTE Technique 87

Activity 8: Oversampling and Undersampling using SMOTE 88

Activity 9: Sampling and Oversampling using ROSE 89

Summary 90

Chapter 3: Feature Engineering 93

Introduction 94

Types of Features 95

Datatype-Based Features 95

Date and Time Features 96

Exercise 23: Creating Date Features 96

Exercise 24: Creating Time Features 98

Time Series Features 99

Exercise 25: Binning 100

Activity 10: Creating Time Series Features – Binning 102

Summary Statistics 104

Exercise 26: Finding Description of Features 104

Standardizing and Rescaling 106

Handling Categorical Variables 107

Skewness 108

Exercise 27: Computing Skewness 108

Activity 11: Identifying Skewness 109

Reducing Skewness Using Log Transform 111

Exercise 28: Using Log Transform 111

Derived Features or Domain-Specific Features 112

Adding Features to a Data Frame 112

Exercise 29: Adding a New Column to an R Data Frame 112

Handling Redundant Features 114

Exercise 30: Identifying Redundant Features 114

Text Features 116

Exercise 31: Automatically Generating Text Features 118

Feature Selection 121

Correlation Analysis 121

Exercise 32: Plotting Correlation between Two Variables 122

P-Value 124

Exercise 33: Calculating the P-Value 124

Recursive Feature Elimination 126

Exercise 34: Implementing Recursive Feature Elimination 126

PCA 129

Exercise 35: Implementing PCA 129

Activity 12: Generating PCA 130

Ranking Features 131

Variable Importance Approach with Learning Vector Quantization 131

Exercise 36: Implementing LVQ 131

Variable Importance Approach Using Random Forests 134

Exercise 37: Finding Variable Importance in the PimaIndiansDiabetes Dataset 134

Activity 13: Implementing the Random Forest Approach 136

Variable Importance Approach Using a Logistic Regression Model 137

Exercise 38: Implementing the Logistic Regression Model 137

Determining Variable Importance Using rpart 138

Exercise 39: Variable Importance Using rpart for the PimaIndiansDiabetes Data 138

Activity 14: Selecting Features Using Variable Importance 140

Summary 143

Chapter 4: Introduction to neuralnet and Evaluation Methods 145

Introduction 146

Classification 146

Binary Classification 147

Exercise 40: Preparing the Dataset 147

Balanced Partitioning Using the groupdata2 Package 148

Exercise 41: Partitioning the Dataset 149

Exercise 42: Creating Balanced Partitions 152

Leakage 154

Exercise 43: Ensuring an Equal Number of Observations Per Class 155

Standardizing 157

Neural Networks with neuralnet 158

Activity 15: Training a Neural Network 160

Model Selection 162

Evaluation Metrics 162

Accuracy 162

Precision 162

Recall 163

Exercise 44: Creating a Confusion Matrix 164

Exercise 45: Creating Baseline Evaluations 166

Over and Underfitting 170

Adding Layers and Nodes in neuralnet 171

Cross-Validation 173

Creating Folds 176

Exercise 46: Writing a Cross-Validation Training Loop 177

Activity 16: Training and Comparing Neural Network Architectures 179

Activity 17: Training and Comparing Neural Network Architectures with Cross-Validation 182

Multiclass Classification Overview 184

Summary 186

Chapter 5: Linear and Logistic Regression Models 189

Introduction 190

Regression 190

Linear Regression 193

Exercise 47: Training Linear Regression Models 196

R2 201

Exercise 48: Plotting Model Predictions 201

Exercise 49: Incrementally Adding Predictors 205

Comparing Linear Regression Models 210

Evaluation Metrics 210

MAE 211

RMSE 211

Differences between MAE and RMSE 212

Exercise 50: Comparing Models with the cvms Package 214

Interactions 217

Exercise 51: Adding Interaction Terms to Our Model 221

Should We Standardize Predictors? 226

Repeated Cross-Validation 228

Exercise 52: Running Repeated Cross-Validation 228

Exercise 53: Validating Models with validate() 232

Activity 18: Implementing Linear Regression 234

Log-Transforming Predictors 236

Exercise 54: Log-Transforming Predictors 237

Logistic Regression 241

Exercise 55: Training Logistic Regression Models 242

Exercise 56: Creating Binomial Baseline Evaluations with cvms 252

Exercise 57: Creating Gaussian Baseline Evaluations with cvms 254

Regression and Classification with Decision Trees 256

Exercise 58: Training Random Forest Models 257

Model Selection by Multiple Disagreeing Metrics 259

Pareto Dominance 259

Exercise 59: Plotting the Pareto Front 259

Activity 19: Classifying Room Types 264

Summary 267

Chapter 6: Unsupervised Learning 269

Introduction 270

Overview of Unsupervised Learning (Clustering) 271

Hard versus Soft Clusters 272

Flat versus Hierarchical Clustering 273

Monothetic versus Polythetic Clustering 276

Exercise 60: Monothetic and Hierarchical Clustering on a Binary Dataset 276

DIANA 279

Exercise 61: Implement Hierarchical Clustering Using DIANA 279

AGNES 283

Exercise 62: Agglomerative Clustering Using AGNES 284

Distance Metrics in Clustering 286

Exercise 63: Calculate Dissimilarity Matrices Using Euclidean and Manhattan Distance 287

Correlation-Based Distance Metrics 290

Exercise 64: Apply Correlation-Based Metrics 292

Applications of Clustering 294

k-means Clustering 295

Exploratory Data Analysis Using Scatter Plots 295

The Elbow Method 296

Exercise 65: Implementation of k-means Clustering in R 297

Activity 20: Perform DIANA, AGNES, and k-means on the Built-In Motor Car Dataset 305

Summary 308

Appendix 311

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.