List of Listings

Chapter 1. The data science process

Listing 1.1. Calculating the confusion matrix

Chapter 2. Starting with R and data

Listing 2.1. Reading the UCI car data

Listing 2.2. Exploring the car data

Listing 2.3. Loading the credit dataset

Listing 2.4. Setting column names

Listing 2.5. Transforming the car data

Listing 2.6. Summary of Good_Loan and Purpose

Listing 2.7. PUMS data provenance documentation (PDSwR2/PUMS/download/LoadPUMS.Rmd)

Listing 2.8. Loading data into R from a relational database

Listing 2.9. Loading data from a database

Listing 2.10. Remapping values and selecting rows from data

Listing 2.11. Plotting the data

Chapter 3. Exploring data

Listing 3.1. The summary() command

Listing 3.2. Will the variable is_employed be useful for modeling?

Listing 3.3. Examples of invalid values and outliers

Listing 3.4. Looking at the data range of a variable

Listing 3.5. Checking units; mistakes can lead to spectacular errors

Listing 3.6. Plotting a histogram

Listing 3.7. Producing a density plot

Listing 3.8. Creating a log-scaled density plot

Listing 3.9. Producing a horizontal bar chart

Listing 3.10. Producing a dot plot with sorted categories

Listing 3.11. Producing a line plot

Listing 3.12. Examining the correlation between age and income

Listing 3.13. Creating a scatterplot of age and income

Listing 3.14. Producing a hexbin plot

Listing 3.15. Specifying different styles of bar chart

Listing 3.16. Plotting a bar chart with and without facets

Listing 3.17. Comparing population densities across categories

Listing 3.18. Comparing population densities across categories with ShadowHist()

Chapter 4. Managing data

Listing 4.1. Treating the age and income variables

Listing 4.2. Treating the gas_usage variable

Listing 4.3. Counting the missing values in each variable

Listing 4.4. Creating and applying a treatment plan

Listing 4.5. Comparing the treated data to the original

Listing 4.6. Examining the data treatment

Listing 4.7. Normalizing income by state

Listing 4.8. Normalizing by mean age

Listing 4.9. Centering and scaling age

Listing 4.10. Centering and scaling multiple numeric variables

Listing 4.11. Treating new data before feeding it to a model

Listing 4.12. Splitting into test and training using a random group mark

Listing 4.13. Ensuring test/train split doesn’t split inside a household

Chapter 6. Choosing and evaluating models

Listing 6.1. Building and applying a logistic regression spam model

Listing 6.2. Spam classifications

Listing 6.3. Spam confusion matrix

Listing 6.4. Entering the Akismet confusion matrix by hand

Listing 6.5. Seeing filter performance change when spam proportions change

Listing 6.6. Fitting the cricket model and making predictions

Listing 6.7. Calculating RMSE

Listing 6.8. Calculating R-squared

Listing 6.9. Making a double density plot

Listing 6.10. Plotting the receiver operating characteristic curve

Listing 6.11. Calculating log likelihood

Listing 6.12. Computing the null model’s log likelihood

Listing 6.13. Computing the deviance and pseudo R-squared

Listing 6.14. Loading the iris dataset

Listing 6.15. Fitting a model to the iris training data

Listing 6.16. Evaluating the iris model

Listing 6.17. Building a LIME explainer from the model and training data

Listing 6.18. An example iris datum

Listing 6.19. Explaining the iris example

Listing 6.20. More iris examples

Listing 6.21. Loading the IMDB training data

Listing 6.22. Converting the texts and fitting the model

Listing 6.23. Evaluate the review classifier

Listing 6.24. Building an explainer for a text classifier

Listing 6.25. Explaining the model’s prediction on a review

Listing 6.26. Explaining the model’s prediction

Listing 6.27. Examining two more reviews

Chapter 7. Linear and logistic regression

Listing 7.1. Loading the PUMS data and fitting a model

Listing 7.2. Plotting log income as a function of predicted log income

Listing 7.3. Plotting residuals income as a function of predicted log income

Listing 7.4. Computing R-squared

Listing 7.5. Calculating root mean square error

Listing 7.6. Summarizing residuals

Listing 7.7. Loading the CDC data

Listing 7.8. Building the model formula

Listing 7.9. Fitting the logistic regression model

Listing 7.10. Applying the logistic regression model

Listing 7.11. Preserving marginal probabilities with logistic regression

Listing 7.12. Plotting distribution of prediction score grouped by known outcome

Listing 7.13. Exploring modeling trade-offs

Listing 7.14. Evaluating the chosen model

Listing 7.15. The model coefficients

Listing 7.16. The model summary

Listing 7.17. Computing deviance

Listing 7.18. Calculating the pseudo R-squared

Listing 7.19. Calculating the significance of the observed fit

Listing 7.20. Calculating the Akaike information criterion

Listing 7.21. Preparing the cars data

Listing 7.22. Fitting a logistic regression model

Listing 7.23. Looking at the model summary

Listing 7.24. Looking at the logistic model’s coefficients

Listing 7.25. The logistic model’s test performance

Listing 7.26. Fitting the ridge regression model

Listing 7.27. Looking at the ridge model’s coefficients

Listing 7.28. Looking at the ridge model’s test performance

Listing 7.29. The lasso model’s coefficients

Listing 7.30. The lasso model’s test performance

Listing 7.31. Cross-validating for both alpha and lambda

Listing 7.32. Finding the minimum error alpha

Listing 7.33. Fitting and evaluating the elastic net model

Chapter 8. Advanced data preparation

Listing 8.1. Preparing the KDD data for analysis

Listing 8.2. Attempting to model without preparation

Listing 8.3. Trying just one variable

Listing 8.4. Basic data preparation for classification

Listing 8.5. Preparing data with vtreat

Listing 8.6. Advanced data preparation for classification

Listing 8.7. Basic variable recoding and selection

Listing 8.8. An information-free dataset

Listing 8.9. The dangers of reusing data

Listing 8.10. Using mkCrossFrameNExperiment()

Chapter 9. Unsupervised methods

Listing 9.1. Reading the protein data

Listing 9.2. Rescaling the dataset

Listing 9.3. Hierarchical clustering

Listing 9.4. Extracting the clusters found by hclust()

Listing 9.5. Projecting the clusters on the first two principal components

Listing 9.6. Running clusterboot() on the protein data

Listing 9.7. Calculating total within sum of squares

Listing 9.8. Plotting WSS for a range of k

Listing 9.9. Plotting BSS and WSS as a function of k

Listing 9.10. The Calinski-Harabasz index

Listing 9.11. Running k-means with k = 5

Listing 9.12. Plotting cluster criteria

Listing 9.13. Running clusterboot() with k-means

Listing 9.14. A function to assign points to a cluster

Listing 9.15. Generating and clustering synthetic data

Listing 9.16. Unscaling the centers

Listing 9.17. An example of assigning points to clusters

Listing 9.18. Reading in the book data

Listing 9.19. Examining the transaction data

Listing 9.20. Examining the size distribution

Listing 9.21. Counting how often each book occurs

Listing 9.22. Finding the 10 most frequently occurring books

Listing 9.23. Finding the association rules

Listing 9.24. Scoring rules

Listing 9.25. Getting the five most confident rules

Listing 9.26. Finding rules with restrictions

Listing 9.27. Inspecting rules

Listing 9.28. Inspecting rules with restrictions

Chapter 10. Exploring advanced methods

Listing 10.1. Preparing Spambase data and evaluating a decision tree model

Listing 10.2. Bagging decision trees

Listing 10.3. Using random forests

Listing 10.4. randomForest variable importances

Listing 10.5. Fitting with fewer variables

Listing 10.6. Loading the iris data

Listing 10.7. Cross-validating to determine model size

Listing 10.8. Fitting an xgboost model

Listing 10.9. Loading the natality data

Listing 10.10. Using vtreat to prepare data for xgboost

Listing 10.11. Fitting and applying an xgboost model for birth weight

Listing 10.12. Preparing an artificial problem

Listing 10.13. Applying linear regression to the artificial example

Listing 10.14. Applying GAM to the artificial example

Listing 10.15. Comparing linear regression and GAM performance

Listing 10.16. Extracting a learned spline from a GAM

Listing 10.17. Applying linear regression (with and without GAM) to health data

Listing 10.18. Plotting GAM results

Listing 10.19. Checking GAM model performance on holdout data

Listing 10.20. GLM logistic regression

Listing 10.21. GAM logistic regression

Listing 10.22. Setting up the spirals data as a classification problem

Listing 10.23. SVM with a poor choice of kernel

Listing 10.24. SVM with a good choice of kernel

Listing 10.25. An artificial kernel example

Chapter 11. Documentation and deployment

Listing 11.1. R-annotated Markdown

Listing 11.2. Using the system() command to compute a file hash

Listing 11.3. Calculating model performance

Listing 11.4. Saving data

Listing 11.5. Example code comments

Listing 11.6. Checking your project status

Listing 11.7. Checking your project history

Listing 11.8. Finding out who committed what

Listing 11.9. Finding line-based differences between two committed versions

Listing 11.10. git remote

Listing 11.11. Buzz model as an R-based HTTP service

Listing 11.12. Calling the Buzz HTTP service

Appendix A. Starting with R and other tools

Listing A.1. Trying a few R commands

Listing A.2. Binding values to function arguments

Listing A.3. Demonstrating side effects

Listing A.4. R truth tables for Boolean operators

Listing A.5. Call-by-value effect

Listing A.6. Examples of R indexing operators

Listing A.7. R’s treatment of unexpected factor levels

Listing A.8. Confirming lm() encodes new strings correctly

Appendix B. Important statistical concepts

Listing B.1. Plotting the theoretical normal density

Listing B.2. Plotting an empirical normal density

Listing B.3. Working with the normal CDF

Listing B.4. Plotting x < qnorm(0.75)

Listing B.5. Demonstrating some properties of the lognormal distribution

Listing B.6. Plotting the lognormal distribution

Listing B.7. Plotting the binomial distribution

Listing B.8. Working with the theoretical binomial distribution

Listing B.9. Simulating a binomial distribution

Listing B.10. Working with the binomial distribution

Listing B.11. Working with the binomial CDF

Listing B.12. Building simulated A/B test data

Listing B.13. Summarizing the A/B test into a contingency table

Listing B.14. Calculating the observed A and B conversion rates

Listing B.15. Calculating the significance of the observed difference in rates

Listing B.16. Computing frequentist significance

Listing B.17. Building synthetic uncorrelated income

Listing B.18. Calculating the (non)significance of the observed correlation

Listing B.19. Misleading significance result from biased observations

Listing B.20. Plotting biased view of income and capital gains

Listing B.21. Summarizing our synthetic biological data

Listing B.22. Building data that improves over time

Listing B.23. A bad model (due to omitted variable bias)

Listing B.24. A better model

