List of Figures

Chapter 1. The data science process

Figure 1.1. The lifecycle of a data science project: loops within loops

Figure 1.2. The fraction of defaulting loans by credit history category. The dark region of each bar represents the fraction of loans in that category that defaulted.

Figure 1.3. A decision tree model for finding bad loan applications. The outcome nodes show confidence scores.

Figure 1.4. Example slide from an executive presentation

Chapter 2. Starting with R and data

Figure 2.1. Chapter 2 mental model

Figure 2.2. Car data viewed as a table

Figure 2.3. Scatter plot of income (PINCP) as a function of age (AGEP)

Chapter 3. Exploring data

Figure 3.1. Chapter 3 mental model

Figure 3.2. Some information is easier to read from a graph, and some from a summary.

Figure 3.3. The density plot of age

Figure 3.4. A unimodal distribution (solid curve) can usually be modeled as coming from a single population of users. With a bimodal distribution (dashed curve), your data often comes from two populations of users.

Figure 3.5. A histogram tells you where your data is concentrated. It also visually highlights outliers and anomalies.

Figure 3.6. Density plots show where data is concentrated.

Figure 3.7. The density plot of income on a log10 scale highlights details of the income distribution that are harder to see in a regular density plot.

Figure 3.8. Bar charts show the distribution of categorical variables.

Figure 3.9. A horizontal bar chart can be easier to read when there are several categories with long names.

Figure 3.10. Using a dot plot and sorting by count makes the data even easier to read.

Figure 3.11. Example of a line plot

Figure 3.12. A scatter plot of income versus age

Figure 3.13. A scatter plot of income versus age, with a smoothing curve

Figure 3.14. Fraction of customers with health insurance, as a function of age

Figure 3.15. Hexbin plot of income versus age, with a smoothing curve superimposed

Figure 3.16. Health insurance versus marital status: stacked bar chart

Figure 3.17. Health insurance versus marital status: side-by-side bar chart

Figure 3.18. Health insurance versus marital status: shadow plot

Figure 3.19. Health insurance versus marital status: filled bar chart

Figure 3.20. Distribution of marital status by housing type: side-by-side bar chart

Figure 3.21. Distribution of marital status by housing type: faceted side-by-side bar chart

Figure 3.22. Comparing the distribution of marital status for widowed and never married populations

Figure 3.23. ShadowHist comparison of the age distributions of widowed and never married populations

Figure 3.24. Faceted plot of the age distributions of different marital statuses

Chapter 4. Managing data

Figure 4.1. Chapter 4 mental model

Figure 4.2. Even a few missing values can lose all your data.

Figure 4.3. Creating a new level for missing categorical values

Figure 4.4. Income data with missing values

Figure 4.5. Replacing missing values with the mean

Figure 4.6. Replacing missing values with the mean and adding an indicator column to track the altered values

Figure 4.7. Creating and applying a simple treatment plan

Figure 4.8. Is a 35-year-old young?

Figure 4.9. Faceted graph: is a 35-year-old young?

Figure 4.10. A nearly lognormal distribution and its log

Figure 4.11. Signed log lets you visualize non-positive data on a logarithmic scale.

Figure 4.12. Splitting data into training and test (or training, calibration, and test) sets

Figure 4.13. Example of a dataset with customers and households

Figure 4.14. Sampling the dataset by household rather than customer

Figure 4.15. Recording the data source, collection date, and treatment date with data

Chapter 5. Data engineering and data shaping

Figure 5.1. Chapter 5 mental model

Figure 5.2. Example iris plot

Figure 5.3. Selecting columns and rows

Figure 5.4. Removing rows with missing values

Figure 5.5. Ordering rows

Figure 5.6. Adding or altering columns

Figure 5.7. Ozone plot example

Figure 5.8. Filling in missing values

Figure 5.9. Ozone plot again

Figure 5.10. Aggregating rows

Figure 5.11. Iris plot

Figure 5.12. Unioning rows

Figure 5.13. Unioning columns

Figure 5.14. Left join

Figure 5.15. Inner join

Figure 5.16. Full join

Figure 5.17. Passenger deaths plot

Figure 5.18. Wide-to-tall conversion

Figure 5.19. Faceted passenger death plot

Figure 5.20. Chick count and weight over time

Figure 5.21. Moving from tall to wide form

Chapter 6. Choosing and evaluating models

Figure 6.1. Mental model

Figure 6.2. Assigning products to product categories

Figure 6.3. Notional example of determining the probability that a transaction is fraudulent

Figure 6.4. Notional example of clustering your customers by purchase pattern and purchase amount

Figure 6.5. Notional example of finding purchase patterns in your data

Figure 6.6. Schematic of model construction and evaluation

Figure 6.7. A notional illustration of overfitting

Figure 6.8. Splitting data into training and test (or training, calibration, and test) sets

Figure 6.9. Partitioning data for 3-fold cross-validation

Figure 6.10. Accuracy

Figure 6.11. Precision

Figure 6.12. Recall

Figure 6.13. Specificity

Figure 6.14. Scoring residuals

Figure 6.15. Distribution of scores broken up by known classes

Figure 6.16. ROC curve for the email spam example

Figure 6.17. ROC curve for an ideal model that classifies perfectly

Figure 6.18. Log likelihood of a spam filter prediction

Figure 6.19. Log likelihood penalizes mismatches between the prediction and the true class label.

Figure 6.20. Some kinds of models are easier to manually inspect than others.

Figure 6.21. Example of a document and the words that most strongly contributed to its classification as “atheist” by the model

Figure 6.22. Visualize the explanation of the model’s prediction.

Figure 6.23. Notional sketch of how LIME works

Figure 6.24. Explanations of the two iris examples

Figure 6.25. Distributions of petal and sepal dimensions by species

Figure 6.26. Creating a document-term matrix

Figure 6.27. Distribution of test prediction scores

Figure 6.28. Explanation of the prediction on the sample review

Figure 6.29. Text explanation of the prediction in listing 6.26

Figure 6.30. Explanation visualizations for the two sample reviews in listing 6.27

Figure 6.31. Explanation visualizations for test_10294

Chapter 7. Linear and logistic regression

Figure 7.1. Mental model

Figure 7.2. The linear relationship between daily_cals_down and pounds_lost

equation 7.1. The expression for a linear regression model

Figure 7.3. Fit versus actuals for y=x2

Figure 7.4. Building a linear model using lm()

Figure 7.5. Making predictions with a linear regression model

Figure 7.6. Plot of actual log income as a function of predicted log income

Figure 7.7. Plot of residual error as a function of prediction

Figure 7.8. An example of systematic errors in model predictions

Figure 7.9. The model coefficients

Figure 7.10. Model summary

Figure 7.11. Model summary coefficient columns

Figure 7.12. Mapping the odds of a flight delay to log-odds

Figure 7.13. Mapping log-odds to the probability of a flight delay via the sigmoid function

equation 7.2. The expression for a logistic regression model

Figure 7.14. Distribution of score broken up by positive examples (TRUE) and negative examples (FALSE)

Figure 7.15. Reproduction of the spam filter score distributions from chapter 6

Figure 7.16. Enrichment (top) and recall (bottom) plotted as functions of threshold for the training set

Figure 7.17. Coefficients of the logistic regression model

Figure 7.18. Schematic of cv.glmnet()

Figure 7.19. Coefficients of the ridge regression model

Figure 7.20. Coefficients of the lasso regression model

Figure 7.21. Schematic of using cva.glmnet to pick alpha

Figure 7.22. Cross-validation error as a function of alpha

Chapter 8. Advanced data preparation

Figure 8.1. Mental model

Figure 8.2. vtreat three-way split strategy

Figure 8.3. KDD2009 churn rate

Figure 8.4. vtreat variable preparation

Figure 8.5. Preparing held-out data

Figure 8.6. vtreat three-way split strategy again

Figure 8.7. vtreat cross-frame strategy

Figure 8.8. Distribution of the glm model’s scores on test data

Figure 8.9. glm recall and enrichment as a function of threshold

Figure 8.10. The first few rows of the auto_mpg data

Figure 8.11. The two vtreat phases

Figure 8.12. Our simple example data: raw

Figure 8.13. Our simple example data: treated

Chapter 9. Unsupervised methods

Figure 9.1. Mental model

Figure 9.2. An example of data in three clusters

Figure 9.3. Manhattan vs. Euclidean distance

Figure 9.4. Cosine similarity

Figure 9.5. Comparison of Fr.Veg and RedMeat variables, unscaled (top) and scaled (bottom)

Figure 9.6. Dendrogram of countries clustered by protein consumption

Figure 9.7. The idea behind principal components analysis

Figure 9.8. Plot of countries clustered by protein consumption, projected onto the first two principal components

Figure 9.9. Jaccard similarity

Figure 9.10. Cluster WSS and total WSS for a set of four clusters

Figure 9.11. WSS as a function of k for the protein data

Figure 9.12. Total sum of squares for a set of four clusters

Figure 9.13. BSS and WSS as a function of k

Figure 9.14. The Calinski-Harabasz index as a function of k

Figure 9.15. The protein data dendrogram with two clusters

Figure 9.16. The k-means procedure. The two cluster centers are represented by the outlined star and diamond.

Figure 9.17. Top: Comparison of the (scaled) CH and average silhouette width indices for kmeans clusterings. Bottom: Comparison of CH indices for kmeans and hclust clusterings.

Figure 9.18. A density plot of basket sizes

Chapter 10. Exploring advanced methods

Figure 10.1. Mental model

Figure 10.2. Mortality rates of men and women as a function of body mass index

Figure 10.3. Example decision tree (from chapter 1)

Figure 10.4. Decision tree model for spam filtering

Figure 10.5. Bagging decision trees

Figure 10.6. Growing a random forest

Figure 10.7. Out-of-bag samples for datum x1

Figure 10.8. Calculating variable importance of variable v1

Figure 10.9. Plot of the most important variables in the spam model, as measured by accuracy

Figure 10.10. Building up a gradient-boosted tree model

Figure 10.11. Cross-validated log loss as a function of ensemble size

Figure 10.12. The effect of BMI on mortality: linear model vs. GAM

Figure 10.13. A spline that has been fit through a series of points

Figure 10.14. Linear model’s predictions vs. actual response. The solid line is the line of perfect prediction (prediction == actual).

Figure 10.15. GAM’s predictions vs. actual response. The solid line is the theoretical line of perfect prediction (prediction == actual).

Figure 10.16. Top: The non-linear function s(PWGT) discovered by gam(), as output by plot(gam_model). Bottom: The same spline superimposed over the training data.

Figure 10.17. Smoothing curves of each of the four input variables plotted against birth weight, compared with the splines discovered by gam(). All curves have been shifted to be zero mean for comparison of shape.

Figure 10.18. Notional illustration of a kernel transform (based on Cristianini and Shawe-Taylor, 2000)

Figure 10.19. The spiral counterexample

Figure 10.20. Identity kernel failing to learn the spiral concept

Figure 10.21. Radial kernel successfully learning the spiral concept

Figure 10.22. Notional illustration of SVM

Chapter 11. Documentation and deployment

Figure 11.1. Mental model

Figure 11.2. R markdown process schematic

Figure 11.3. The R markdown process

Figure 11.4. knitr documentation of Buzz data load

Figure 11.5. knitr documentation of Buzz data load 2019:

Figure 11.6. roxygen@-generated online help

Figure 11.7. Version control saving the day

Figure 11.8. RStudio new project pane

Figure 11.9. RStudio Git controls

Figure 11.10. Multiple repositories working together

Figure 11.11. git pull: rebase versus merge

Figure 11.12. Launching the Shiny server from RStudio

Figure 11.13. Interacting with the Shiny application

Figure 11.14. Top of HTML form that asks server for Buzz classification on submit

Figure 11.15. The top of the first tree (of 500) from the random forest model

Figure 11.16. Annotating CASE/WHEN paths

Chapter 12. Producing effective presentations

Figure 12.1. Mental model

Figure 12.2. Motivation for project

Figure 12.3. Stating the project goal

Figure 12.4. Describing the project and its results

Figure 12.5. Discussing your work in more detail

Figure 12.6. Optional slide on the modeling method

Figure 12.7. Discussing future work

Figure 12.8. Motivation for project

Figure 12.9. User workflow before and after the model

Figure 12.10. Present the model’s benefits from the users’ perspective.

Figure 12.11. Provide technical details that are relevant to users.

Figure 12.12. Describe how users will interact with the model.

Figure 12.13. An example instructional slide

Figure 12.14. Ask users for feedback.

Figure 12.15. Introducing the project

Figure 12.16. Discussing related work

Figure 12.17. Introducing the pilot study

Figure 12.18. Discussing model inputs and modeling approach

Figure 12.19. Showing model performance

Figure 12.20. Discussing future work

Appendix A. Starting with R and other tools

Figure A.1. RStudio file-browsing controls

Figure A.2. Downloading the book materials from GitHub

Figure A.3. Cloning the book repository

Figure A.4. RStudio options

Figure A.5. rquery operation plan diagram

Appendix B. Important statistical concepts

Figure B.1. The normal distribution with mean 0 and standard deviation 1

Figure B.2. The empirical distribution of points drawn from a normal with mean 0 and standard deviation 1. The dotted line represents the theoretical normal distribution.

Figure B.3. Illustrating x < qnorm(0.75)

Figure B.4. Top: The lognormal distribution X such that mean(log(X)) = 0 and sd(log(X)) = 1. The dashed line is the theoretical distribution, and the solid line is the distribution of a random lognormal sample. Bottom: The solid line is the distribution of log(X).

Figure B.5. The 75th percentile of the lognormal distribution with meanlog = 1, sdlog = 0

Figure B.6. The binomial distributions for 50 coin tosses, with coins of various fairnesses (probability of landing on heads)

Figure B.7. The observed distribution of the count of girls in 100 classrooms of size 20, when the population is 50% female. The theoretical distribution is shown with the dashed line.

Figure B.8. Earned income versus capital gains

Figure B.9. Biased earned income vs. capital gains

Figure B.10. View of rows from the bioavailability dataset

