Chapter 1. The data science process
Figure 1.1. The lifecycle of a data science project: loops within loops
Chapter 2. Starting with R and data
Figure 2.1. Chapter 2 mental model
Figure 2.2. Car data viewed as a table
Figure 2.3. Scatter plot of income (PINCP) as a function of age (AGEP)
Chapter 3. Exploring data
Figure 3.1. Chapter 3 mental model
Figure 3.2. Some information is easier to read from a graph, and some from a summary.
Figure 3.3. The density plot of age
Figure 3.6. Density plots show where data is concentrated.
Figure 3.8. Bar charts show the distribution of categorical variables.
Figure 3.10. Using a dot plot and sorting by count makes the data even easier to read.
Figure 3.11. Example of a line plot
Figure 3.12. A scatter plot of income versus age
Figure 3.13. A scatter plot of income versus age, with a smoothing curve
Figure 3.14. Fraction of customers with health insurance, as a function of age
Figure 3.15. Hexbin plot of income versus age, with a smoothing curve superimposed
Figure 3.16. Health insurance versus marital status: stacked bar chart
Figure 3.17. Health insurance versus marital status: side-by-side bar chart
Figure 3.18. Health insurance versus marital status: shadow plot
Figure 3.19. Health insurance versus marital status: filled bar chart
Figure 3.20. Distribution of marital status by housing type: side-by-side bar chart
Figure 3.21. Distribution of marital status by housing type: faceted side-by-side bar chart
Figure 3.22. Comparing the distribution of marital status for widowed and never married populations
Figure 3.23. ShadowHist comparison of the age distributions of widowed and never married populations
Figure 3.24. Faceted plot of the age distributions of different marital statuses
Chapter 4. Managing data
Figure 4.1. Chapter 4 mental model
Figure 4.2. Even a few missing values can lose all your data.
Figure 4.3. Creating a new level for missing categorical values
Figure 4.4. Income data with missing values
Figure 4.5. Replacing missing values with the mean
Figure 4.7. Creating and applying a simple treatment plan
Figure 4.8. Is a 35-year-old young?
Figure 4.9. Faceted graph: is a 35-year-old young?
Figure 4.10. A nearly lognormal distribution and its log
Figure 4.11. Signed log lets you visualize non-positive data on a logarithmic scale.
Figure 4.12. Splitting data into training and test (or training, calibration, and test) sets
Figure 4.13. Example of a dataset with customers and households
Figure 4.14. Sampling the dataset by household rather than customer
Figure 4.15. Recording the data source, collection date, and treatment date with data
Chapter 5. Data engineering and data shaping
Figure 5.1. Chapter 5 mental model
Figure 5.3. Selecting columns and rows
Figure 5.4. Removing rows with missing values
Figure 5.6. Adding or altering columns
Figure 5.7. Ozone plot example
Figure 5.8. Filling in missing values
Figure 5.17. Passenger deaths plot
Figure 5.18. Wide-to-tall conversion
Figure 5.19. Faceted passenger death plot
Chapter 6. Choosing and evaluating models
Figure 6.2. Assigning products to product categories
Figure 6.3. Notional example of determining the probability that a transaction is fraudulent
Figure 6.4. Notional example of clustering your customers by purchase pattern and purchase amount
Figure 6.5. Notional example of finding purchase patterns in your data
Figure 6.6. Schematic of model construction and evaluation
Figure 6.7. A notional illustration of overfitting
Figure 6.8. Splitting data into training and test (or training, calibration, and test) sets
Figure 6.9. Partitioning data for 3-fold cross-validation
Figure 6.14. Scoring residuals
Figure 6.15. Distribution of scores broken up by known classes
Figure 6.16. ROC curve for the email spam example
Figure 6.17. ROC curve for an ideal model that classifies perfectly
Figure 6.18. Log likelihood of a spam filter prediction
Figure 6.19. Log likelihood penalizes mismatches between the prediction and the true class label.
Figure 6.20. Some kinds of models are easier to manually inspect than others.
Figure 6.22. Visualize the explanation of the model’s prediction.
Figure 6.23. Notional sketch of how LIME works
Figure 6.24. Explanations of the two iris examples
Figure 6.25. Distributions of petal and sepal dimensions by species
Figure 6.26. Creating a document-term matrix
Figure 6.27. Distribution of test prediction scores
Figure 6.28. Explanation of the prediction on the sample review
Figure 6.29. Text explanation of the prediction in listing 6.26
Figure 6.30. Explanation visualizations for the two sample reviews in listing 6.27
Chapter 7. Linear and logistic regression
Figure 7.2. The linear relationship between daily_cals_down and pounds_lost
equation 7.1. The expression for a linear regression model
Figure 7.3. Fit versus actuals for y=x2
Figure 7.4. Building a linear model using lm()
Figure 7.5. Making predictions with a linear regression model
Figure 7.6. Plot of actual log income as a function of predicted log income
Figure 7.7. Plot of residual error as a function of prediction
Figure 7.8. An example of systematic errors in model predictions
Figure 7.9. The model coefficients
Figure 7.11. Model summary coefficient columns
Figure 7.12. Mapping the odds of a flight delay to log-odds
Figure 7.13. Mapping log-odds to the probability of a flight delay via the sigmoid function
equation 7.2. The expression for a logistic regression model
Figure 7.15. Reproduction of the spam filter score distributions from chapter 6
Figure 7.17. Coefficients of the logistic regression model
Figure 7.18. Schematic of cv.glmnet()
Figure 7.19. Coefficients of the ridge regression model
Figure 7.20. Coefficients of the lasso regression model
Chapter 8. Advanced data preparation
Figure 8.2. vtreat three-way split strategy
Figure 8.3. KDD2009 churn rate
Figure 8.4. vtreat variable preparation
Figure 8.5. Preparing held-out data
Figure 8.6. vtreat three-way split strategy again
Figure 8.7. vtreat cross-frame strategy
Figure 8.8. Distribution of the glm model’s scores on test data
Figure 8.9. glm recall and enrichment as a function of threshold
Figure 8.10. The first few rows of the auto_mpg data
Figure 8.11. The two vtreat phases
Chapter 9. Unsupervised methods
Figure 9.2. An example of data in three clusters
Figure 9.3. Manhattan vs. Euclidean distance
Figure 9.5. Comparison of Fr.Veg and RedMeat variables, unscaled (top) and scaled (bottom)
Figure 9.6. Dendrogram of countries clustered by protein consumption
Figure 9.7. The idea behind principal components analysis
Figure 9.9. Jaccard similarity
Figure 9.10. Cluster WSS and total WSS for a set of four clusters
Figure 9.11. WSS as a function of k for the protein data
Figure 9.12. Total sum of squares for a set of four clusters
Figure 9.13. BSS and WSS as a function of k
Figure 9.14. The Calinski-Harabasz index as a function of k
Chapter 10. Exploring advanced methods
Figure 10.2. Mortality rates of men and women as a function of body mass index
Figure 10.3. Example decision tree (from chapter 1)
Figure 10.4. Decision tree model for spam filtering
Figure 10.5. Bagging decision trees
Figure 10.6. Growing a random forest
Figure 10.7. Out-of-bag samples for datum x1
Figure 10.8. Calculating variable importance of variable v1
Figure 10.9. Plot of the most important variables in the spam model, as measured by accuracy
Figure 10.10. Building up a gradient-boosted tree model
Figure 10.11. Cross-validated log loss as a function of ensemble size
Figure 10.12. The effect of BMI on mortality: linear model vs. GAM
Figure 10.13. A spline that has been fit through a series of points
Figure 10.19. The spiral counterexample
Figure 10.20. Identity kernel failing to learn the spiral concept
Figure 10.21. Radial kernel successfully learning the spiral concept
Chapter 11. Documentation and deployment
Figure 11.2. R markdown process schematic
Figure 11.3. The R markdown process
Figure 11.4. knitr documentation of Buzz data load
Figure 11.5. knitr documentation of Buzz data load 2019: buzzm.md
Figure 11.6. roxygen@-generated online help
Figure 11.7. Version control saving the day
Figure 11.8. RStudio new project pane
Figure 11.9. RStudio Git controls
Figure 11.10. Multiple repositories working together
Figure 11.11. git pull: rebase versus merge
Figure 11.12. Launching the Shiny server from RStudio
Figure 11.13. Interacting with the Shiny application
Figure 11.14. Top of HTML form that asks server for Buzz classification on submit
Figure 11.15. The top of the first tree (of 500) from the random forest model
Chapter 12. Producing effective presentations
Figure 12.2. Motivation for project
Figure 12.3. Stating the project goal
Figure 12.4. Describing the project and its results
Figure 12.5. Discussing your work in more detail
Figure 12.6. Optional slide on the modeling method
Figure 12.7. Discussing future work
Figure 12.8. Motivation for project
Figure 12.9. User workflow before and after the model
Figure 12.10. Present the model’s benefits from the users’ perspective.
Figure 12.11. Provide technical details that are relevant to users.
Figure 12.12. Describe how users will interact with the model.
Figure 12.13. An example instructional slide
Figure 12.14. Ask users for feedback.
Figure 12.15. Introducing the project
Figure 12.16. Discussing related work
Figure 12.17. Introducing the pilot study
Figure 12.18. Discussing model inputs and modeling approach
Appendix A. Starting with R and other tools
Figure A.1. RStudio file-browsing controls
Figure A.2. Downloading the book materials from GitHub
Appendix B. Important statistical concepts
Figure B.1. The normal distribution with mean 0 and standard deviation 1
Figure B.3. Illustrating x < qnorm(0.75)
Figure B.5. The 75th percentile of the lognormal distribution with meanlog = 1, sdlog = 0
Figure B.8. Earned income versus capital gains