0%

Book Description

Understand how machine learning works and get hands-on experience of using R to build algorithms that can solve various real-world problems

Key Features

  • Gain a comprehensive overview of different machine learning techniques
  • Explore various methods for selecting a particular algorithm
  • Implement a machine learning project from problem definition through to the final model

Book Description

With huge amounts of data being generated every moment, businesses need applications that apply complex mathematical calculations to data repeatedly and at speed. With machine learning techniques and R, you can easily develop these kinds of applications in an efficient way.

Practical Machine Learning with R begins by helping you grasp the basics of machine learning methods, while also highlighting how and why they work. You will understand how to get these algorithms to work in practice, rather than focusing on mathematical derivations. As you progress from one chapter to another, you will gain hands-on experience of building a machine learning solution in R. Next, using R packages such as rpart, random forest, and multiple imputation by chained equations (MICE), you will learn to implement algorithms including neural net classifier, decision trees, and linear and non-linear regression. As you progress through the book, you'll delve into various machine learning techniques for both supervised and unsupervised learning approaches. In addition to this, you'll gain insights into partitioning the datasets and mechanisms to evaluate the results from each model and be able to compare them.

By the end of this book, you will have gained expertise in solving your business problems, starting by forming a good problem statement, selecting the most appropriate model to solve your problem, and then ensuring that you do not overtrain it.

What you will learn

  • Define a problem that can be solved by training a machine learning model
  • Obtain, verify and clean data before transforming it into the correct format for use
  • Perform exploratory analysis and extract features from data
  • Build models for neural net, linear and non-linear regression, classification, and clustering
  • Evaluate the performance of a model with the right metrics
  • Implement a classification problem using the neural net package
  • Employ a decision tree using the random forest library

Who this book is for

If you are a data analyst, data scientist, or a business analyst who wants to understand the process of machine learning and apply it to a real dataset using R, this book is just what you need. Data scientists who use Python and want to implement their machine learning solutions using R will also find this book very useful. The book will also enable novice programmers to start their journey in data science. Basic knowledge of any programming language is all you need to get started.

Table of Contents

  1. Preface
    1. About the Book
      1. About the Authors
      2. Description
      3. Learning Objectives
      4. Audience
      5. Approach
      6. Minimum Hardware Requirements
      7. Software Requirements
      8. Conventions
      9. Installation and Setup
      10. Installing R
      11. Installing R Studio
      12. Installing Libraries
      13. Installing the Code Bundle
      14. Additional Resources
  2. Chapter 1
  3. An Introduction to Machine Learning
    1. Introduction
    2. The Machine Learning Process
      1. Raw Data
      2. Data Pre-Processing
      3. The Data Splitting Process
      4. The Training Process
      5. Evaluation Process
      6. Deployment Process
      7. Process Flow for Making Predictions
    3. Introduction to R
      1. Exercise 1: Reading from a CSV File in RStudio
      2. Exercise 2: Performing Operations on a Dataframe
      3. Exploratory Data Analysis (EDA)
      4. View Built-in Datasets in R
      5. Exercise 3: Loading Built-in Datasets
      6. Exercise 4: Viewing Summaries of Data
      7. Visualizing the Data
      8. Activity 1: Finding the Distribution of Diabetic Patients in the PimaIndiansDiabetes Dataset
      9. Activity 2: Grouping the PimaIndiansDiabetes Data
      10. Activity 3: Performing EDA on the PimaIndiansDiabetes Dataset
    4. Machine Learning Models
      1. Types of Prediction
      2. Supervised Learning
      3. Unsupervised Learning
      4. Applications of Machine Learning
    5. Regression
      1. Exercise 5: Building a Linear Classifier in R
      2. Activity 4: Building Linear Models for the GermanCredit Dataset
      3. Activity 5: Using Multiple Variables for a Regression Model for the Boston Housing Dataset
    6. Summary
  4. Chapter 2
  5. Data Cleaning and Pre-processing
    1. Introduction
    2. Advanced Operations on Data Frames
      1. Exercise 6: Sorting the Data Frame
      2. Join Operations
      3. Pre-Processing of Data Frames
      4. Exercise 7: Centering Variables
      5. Exercise 8: Normalizing the Variables
      6. Exercise 9: Scaling the Variables
      7. Activity 6: Centering and Scaling the Variables
      8. Extracting the Principle Components
      9. Exercise 10: Extracting the Principle Components
      10. Subsetting Data
      11. Exercise 11: Subsetting a Data Frame
      12. Data Transposes
    3. Identifying the Input and Output Variables
    4. Identifying the Category of Prediction
    5. Handling Missing Values, Duplicates, and Outliers
      1. Handling Missing Values
      2. Exercise 12: Identifying the Missing Values
      3. Techniques for Handling Missing Values
      4. Exercise 13: Imputing Using the MICE Package
      5. Exercise 14: Performing Predictive Mean Matching
      6. Handling Duplicates
      7. Exercise 15: Identifying Duplicates
      8. Techniques Used to Handle Duplicate Values
    6. Handling Outliers
      1. Exercise 16: Identifying Outlier Values
      2. Techniques Used to Handle Outliers
      3. Exercise 17: Predicting Values to Handle Outliers
      4. Handling Missing Data
      5. Exercise 18: Handling Missing Values
      6. Activity 7: Identifying Outliers
      7. Pre-Processing Categorical Data
      8. Handling Imbalanced Datasets
      9. Undersampling
      10. Exercise 19: Undersampling a Dataset
      11. Oversampling
      12. Exercise 20: Oversampling
      13. ROSE
      14. Exercise 21: Oversampling using ROSE
      15. SMOTE
      16. Exercise 22: Implementing the SMOTE Technique
      17. Activity 8: Oversampling and Undersampling using SMOTE
      18. Activity 9: Sampling and Oversampling using ROSE
    7. Summary
  6. Chapter 3
  7. Feature Engineering
    1. Introduction
    2. Types of Features
      1. Datatype-Based Features
      2. Date and Time Features
      3. Exercise 23: Creating Date Features
      4. Exercise 24: Creating Time Features
    3. Time Series Features
      1. Exercise 25: Binning
      2. Activity 10: Creating Time Series Features – Binning
      3. Summary Statistics
      4. Exercise 26: Finding Description of Features
      5. Standardizing and Rescaling
    4. Handling Categorical Variables
      1. Skewness
      2. Exercise 27: Computing Skewness
      3. Activity 11: Identifying Skewness
      4. Reducing Skewness Using Log Transform
      5. Exercise 28: Using Log Transform
    5. Derived Features or Domain-Specific Features
    6. Adding Features to a Data Frame
      1. Exercise 29: Adding a New Column to an R Data Frame
    7. Handling Redundant Features
      1. Exercise 30: Identifying Redundant Features
      2. Text Features
      3. Exercise 31: Automatically Generating Text Features
    8. Feature Selection
      1. Correlation Analysis
      2. Exercise 32: Plotting Correlation between Two Variables
      3. P-Value
      4. Exercise 33: Calculating the P-Value
      5. Recursive Feature Elimination
      6. Exercise 34: Implementing Recursive Feature Elimination
      7. PCA
      8. Exercise 35: Implementing PCA
      9. Activity 12: Generating PCA
      10. Ranking Features
      11. Variable Importance Approach with Learning Vector Quantization
      12. Exercise 36: Implementing LVQ
      13. Variable Importance Approach Using Random Forests
      14. Exercise 37: Finding Variable Importance in the PimaIndiansDiabetes Dataset
      15. Activity 13: Implementing the Random Forest Approach
      16. Variable Importance Approach Using a Logistic Regression Model
      17. Exercise 38: Implementing the Logistic Regression Model
      18. Determining Variable Importance Using rpart
      19. Exercise 39: Variable Importance Using rpart for the PimaIndiansDiabetes Data
      20. Activity 14: Selecting Features Using Variable Importance
    9. Summary
  8. Chapter 4
  9. Introduction to neuralnet and Evaluation Methods
    1. Introduction
    2. Classification
      1. Binary Classification
      2. Exercise 40: Preparing the Dataset
      3. Balanced Partitioning Using the groupdata2 Package
      4. Exercise 41: Partitioning the Dataset
      5. Exercise 42: Creating Balanced Partitions
      6. Leakage
      7. Exercise 43: Ensuring an Equal Number of Observations Per Class
      8. Standardizing
      9. Neural Networks with neuralnet
      10. Activity 15: Training a Neural Network
    3. Model Selection
      1. Evaluation Metrics
      2. Accuracy
      3. Precision
      4. Recall
      5. Exercise 44: Creating a Confusion Matrix
      6. Exercise 45: Creating Baseline Evaluations
      7. Over and Underfitting
      8. Adding Layers and Nodes in neuralnet
      9. Cross-Validation
      10. Creating Folds
      11. Exercise 46: Writing a Cross-Validation Training Loop
      12. Activity 16: Training and Comparing Neural Network Architectures
      13. Activity 17: Training and Comparing Neural Network Architectures with Cross-Validation
    4. Multiclass Classification Overview
    5. Summary
  10. Chapter 5
  11. Linear and Logistic Regression Models
    1. Introduction
    2. Regression
    3. Linear Regression
      1. Exercise 47: Training Linear Regression Models
      2. R2
      3. Exercise 48: Plotting Model Predictions
      4. Exercise 49: Incrementally Adding Predictors
      5. Comparing Linear Regression Models
      6. Evaluation Metrics
      7. MAE
      8. RMSE
      9. Differences between MAE and RMSE
      10. Exercise 50: Comparing Models with the cvms Package
      11. Interactions
      12. Exercise 51: Adding Interaction Terms to Our Model
      13. Should We Standardize Predictors?
      14. Repeated Cross-Validation
      15. Exercise 52: Running Repeated Cross-Validation
      16. Exercise 53: Validating Models with validate()
      17. Activity 18: Implementing Linear Regression
      18. Log-Transforming Predictors
      19. Exercise 54: Log-Transforming Predictors
    4. Logistic Regression
      1. Exercise 55: Training Logistic Regression Models
      2. Exercise 56: Creating Binomial Baseline Evaluations with cvms
      3. Exercise 57: Creating Gaussian Baseline Evaluations with cvms
    5. Regression and Classification with Decision Trees
      1. Exercise 58: Training Random Forest Models
    6. Model Selection by Multiple Disagreeing Metrics
      1. Pareto Dominance
      2. Exercise 59: Plotting the Pareto Front
      3. Activity 19: Classifying Room Types
    7. Summary
  12. Chapter 6
  13. Unsupervised Learning
    1. Introduction
    2. Overview of Unsupervised Learning (Clustering)
      1. Hard versus Soft Clusters
      2. Flat versus Hierarchical Clustering
      3. Monothetic versus Polythetic Clustering
      4. Exercise 60: Monothetic and Hierarchical Clustering on a Binary Dataset
    3. DIANA
      1. Exercise 61: Implement Hierarchical Clustering Using DIANA
      2. AGNES
      3. Exercise 62: Agglomerative Clustering Using AGNES
      4. Distance Metrics in Clustering
      5. Exercise 63: Calculate Dissimilarity Matrices Using Euclidean and Manhattan Distance
      6. Correlation-Based Distance Metrics
      7. Exercise 64: Apply Correlation-Based Metrics
    4. Applications of Clustering
    5. k-means Clustering
      1. Exploratory Data Analysis Using Scatter Plots
      2. The Elbow Method
      3. Exercise 65: Implementation of k-means Clustering in R
      4. Activity 20: Perform DIANA, AGNES, and k-means on the Built-In Motor Car Dataset
    6. Summary
  14. Appendix
    1. Chapter 1: An Introduction to Machine Learning
      1. Activity 1: Finding the Distribution of Diabetic Patients in the PimaIndiansDiabetes Dataset
      2. Activity 2: Grouping the PimaIndiansDiabetes Data
      3. Activity 3: Performing EDA on the PimaIndiansDiabetes Dataset
      4. Activity 4: Building Linear Models for the GermanCredit Dataset
      5. Activity 5: Using Multiple Variables for a Regression Model for the Boston Housing Dataset
    2. Chapter 2: Data Cleaning and Pre-processing
      1. Activity 6: Pre-processing using Center and Scale
      2. Activity 7: Identifying Outliers
      3. Activity 8: Oversampling and Undersampling
      4. Activity 9: Sampling and OverSampling using ROSE
      5. Solution:
    3. Chapter 3: Feature Engineering
      1. Activity 10: Calculating Time series Feature – Binning
      2. Activity 11: Identifying Skewness
      3. Activity 12: Generating PCA
      4. Activity 13: Implementing the Random Forest Approach
      5. Activity 14: Selecting Features Using Variable Importance
    4. Chapter 4: Introduction to neuralnet and Evaluation Methods
      1. Activity 15: Training a Neural Network
      2. Activity 16: Training and Comparing Neural Network Architectures
      3. Activity 17: Training and Comparing Neural Network Architectures with Cross-Validation
    5. Chapter 5: Linear and Logistic Regression Models
      1. Activity 18: Implementing Linear Regression
      2. Activity 19: Classifying Room Types
    6. Chapter 6: Unsupervised Learning
      1. Activity 20: Perform DIANA, AGNES, and k-means on the Built-In Motor Car Dataset