Practical Machine Learning with R

Practical Machine Learning with R

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Brindha Priyadarshini Jeyaraman, Ludvig Renbo Olsen, and Monicah Wambugu

Technical Reviewers: Anil Kumar and Rohan Chikorde

Managing Editor: Steffi Monterio and Snehal Tambe

Acquisitions Editors: Koushik Sen

Production Editor: Samita Warang

Editorial Board: Shubhopriya Banerjee, Mayank Bhardwaj, Ewan Buckingham, Mahesh Dhyani, Taabish Khan, Manasa Kumar, Alex Mazonowicz, Pramod Menon, Bridget Neale, Dominic Pereira, Shiny Poojary, Erol Staveley, Ankita Thakur, Nitesh Thakur, and Jonathan Wray

First Published: August 2019

Production Reference: 1300819

ISBN: 978-1-83855-013-4

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface 

Chapter 1: An Introduction to Machine Learning

Introduction

The Machine Learning Process

Raw Data

Data Pre-Processing

The Data Splitting Process

The Training Process

Evaluation Process

Deployment Process

Process Flow for Making Predictions

Introduction to R

Exercise 1: Reading from a CSV File in RStudio

Exercise 2: Performing Operations on a Dataframe

Exploratory Data Analysis (EDA)

View Built-in Datasets in R

Exercise 3: Loading Built-in Datasets

Exercise 4: Viewing Summaries of Data

Visualizing the Data

Activity 1: Finding the Distribution of Diabetic Patients in the PimaIndiansDiabetes Dataset

Activity 2: Grouping the PimaIndiansDiabetes Data

Activity 3: Performing EDA on the PimaIndiansDiabetes Dataset

Machine Learning Models

Types of Prediction

Supervised Learning

Unsupervised Learning

Applications of Machine Learning

Regression

Exercise 5: Building a Linear Classifier in R

Activity 4: Building Linear Models for the GermanCredit Dataset

Activity 5: Using Multiple Variables for a Regression Model for the Boston Housing Dataset

Summary

Chapter 2: Data Cleaning and Pre-processing

Introduction

Advanced Operations on Data Frames

Exercise 6: Sorting the Data Frame

Join Operations

Pre-Processing of Data Frames

Exercise 7: Centering Variables

Exercise 8: Normalizing the Variables

Exercise 9: Scaling the Variables

Activity 6: Centering and Scaling the Variables

Extracting the Principle Components

Exercise 10: Extracting the Principle Components

Subsetting Data

Exercise 11: Subsetting a Data Frame

Data Transposes

Identifying the Input and Output Variables

Identifying the Category of Prediction

Handling Missing Values, Duplicates, and Outliers

Handling Missing Values

Exercise 12: Identifying the Missing Values

Techniques for Handling Missing Values

Exercise 13: Imputing Using the MICE Package

Exercise 14: Performing Predictive Mean Matching

Handling Duplicates

Exercise 15: Identifying Duplicates

Techniques Used to Handle Duplicate Values

Handling Outliers

Exercise 16: Identifying Outlier Values

Techniques Used to Handle Outliers

Exercise 17: Predicting Values to Handle Outliers

Handling Missing Data

Exercise 18: Handling Missing Values

Activity 7: Identifying Outliers

Pre-Processing Categorical Data

Handling Imbalanced Datasets

Undersampling

Exercise 19: Undersampling a Dataset

Oversampling

Exercise 20: Oversampling

ROSE

Exercise 21: Oversampling using ROSE

SMOTE

Exercise 22: Implementing the SMOTE Technique

Activity 8: Oversampling and Undersampling using SMOTE

Activity 9: Sampling and Oversampling using ROSE

Summary

Chapter 3: Feature Engineering

Introduction

Types of Features

Datatype-Based Features

Date and Time Features

Exercise 23: Creating Date Features

Exercise 24: Creating Time Features

Time Series Features

Exercise 25: Binning

Activity 10: Creating Time Series Features – Binning

Summary Statistics

Exercise 26: Finding Description of Features

Standardizing and Rescaling

Handling Categorical Variables

Skewness

Exercise 27: Computing Skewness

Activity 11: Identifying Skewness

Reducing Skewness Using Log Transform

Exercise 28: Using Log Transform

Derived Features or Domain-Specific Features

Adding Features to a Data Frame

Exercise 29: Adding a New Column to an R Data Frame

Handling Redundant Features

Exercise 30: Identifying Redundant Features

Text Features

Exercise 31: Automatically Generating Text Features

Feature Selection

Correlation Analysis

Exercise 32: Plotting Correlation between Two Variables

P-Value

Exercise 33: Calculating the P-Value

Recursive Feature Elimination

Exercise 34: Implementing Recursive Feature Elimination

PCA

Exercise 35: Implementing PCA

Activity 12: Generating PCA

Ranking Features

Variable Importance Approach with Learning Vector Quantization

Exercise 36: Implementing LVQ

Variable Importance Approach Using Random Forests

Exercise 37: Finding Variable Importance in the PimaIndiansDiabetes Dataset

Activity 13: Implementing the Random Forest Approach

Variable Importance Approach Using a Logistic Regression Model

Exercise 38: Implementing the Logistic Regression Model

Determining Variable Importance Using rpart

Exercise 39: Variable Importance Using rpart for the PimaIndiansDiabetes Data

Activity 14: Selecting Features Using Variable Importance

Summary

Chapter 4: Introduction to neuralnet and Evaluation Methods

Introduction

Classification

Binary Classification

Exercise 40: Preparing the Dataset

Balanced Partitioning Using the groupdata2 Package

Exercise 41: Partitioning the Dataset

Exercise 42: Creating Balanced Partitions

Leakage

Exercise 43: Ensuring an Equal Number of Observations Per Class

Standardizing

Neural Networks with neuralnet

Activity 15: Training a Neural Network

Model Selection

Evaluation Metrics

Accuracy

Precision

Recall

Exercise 44: Creating a Confusion Matrix

Exercise 45: Creating Baseline Evaluations

Over and Underfitting

Adding Layers and Nodes in neuralnet

Cross-Validation

Creating Folds

Exercise 46: Writing a Cross-Validation Training Loop

Activity 16: Training and Comparing Neural Network Architectures

Activity 17: Training and Comparing Neural Network Architectures with Cross-Validation

Multiclass Classification Overview

Summary

Chapter 5: Linear and Logistic Regression Models

Introduction

Regression

Linear Regression

Exercise 47: Training Linear Regression Models

R2

Exercise 48: Plotting Model Predictions

Exercise 49: Incrementally Adding Predictors

Comparing Linear Regression Models

Evaluation Metrics

MAE

RMSE

Differences between MAE and RMSE

Exercise 50: Comparing Models with the cvms Package

Interactions

Exercise 51: Adding Interaction Terms to Our Model

Should We Standardize Predictors?

Repeated Cross-Validation

Exercise 52: Running Repeated Cross-Validation

Exercise 53: Validating Models with validate()

Activity 18: Implementing Linear Regression

Log-Transforming Predictors

Exercise 54: Log-Transforming Predictors

Logistic Regression

Exercise 55: Training Logistic Regression Models

Exercise 56: Creating Binomial Baseline Evaluations with cvms

Exercise 57: Creating Gaussian Baseline Evaluations with cvms

Regression and Classification with Decision Trees

Exercise 58: Training Random Forest Models

Model Selection by Multiple Disagreeing Metrics

Pareto Dominance

Exercise 59: Plotting the Pareto Front

Activity 19: Classifying Room Types

Summary

Chapter 6: Unsupervised Learning

Introduction

Overview of Unsupervised Learning (Clustering)

Hard versus Soft Clusters

Flat versus Hierarchical Clustering

Monothetic versus Polythetic Clustering

Exercise 60: Monothetic and Hierarchical Clustering on a Binary Dataset

DIANA

Exercise 61: Implement Hierarchical Clustering Using DIANA

AGNES

Exercise 62: Agglomerative Clustering Using AGNES

Distance Metrics in Clustering

Exercise 63: Calculate Dissimilarity Matrices Using Euclidean and Manhattan Distance

Correlation-Based Distance Metrics

Exercise 64: Apply Correlation-Based Metrics

Applications of Clustering

k-means Clustering

Exploratory Data Analysis Using Scatter Plots

The Elbow Method

Exercise 65: Implementation of k-means Clustering in R

Activity 20: Perform DIANA, AGNES, and k-means on the Built-In Motor Car Dataset

Summary

Appendix   

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset