Chapter 1. The data science process
Chapter 2. Starting with R and data
Listing 2.1. Reading the UCI car data
Listing 2.2. Exploring the car data
Listing 2.3. Loading the credit dataset
Listing 2.4. Setting column names
Listing 2.5. Transforming the car data
Listing 2.6. Summary of Good_Loan and Purpose
Listing 2.7. PUMS data provenance documentation (PDSwR2/PUMS/download/LoadPUMS.Rmd)
Listing 2.8. Loading data into R from a relational database
Listing 2.9. Loading data from a database
Chapter 3. Exploring data
Listing 3.1. The summary() command
Listing 3.2. Will the variable is_employed be useful for modeling?
Listing 3.3. Examples of invalid values and outliers
Listing 3.4. Looking at the data range of a variable
Listing 3.5. Checking units; mistakes can lead to spectacular errors
Listing 3.6. Plotting a histogram
Listing 3.7. Producing a density plot
Listing 3.8. Creating a log-scaled density plot
Listing 3.9. Producing a horizontal bar chart
Listing 3.10. Producing a dot plot with sorted categories
Listing 3.11. Producing a line plot
Listing 3.12. Examining the correlation between age and income
Listing 3.13. Creating a scatterplot of age and income
Listing 3.14. Producing a hexbin plot
Listing 3.15. Specifying different styles of bar chart
Listing 3.16. Plotting a bar chart with and without facets
Listing 3.17. Comparing population densities across categories
Listing 3.18. Comparing population densities across categories with ShadowHist()
Chapter 4. Managing data
Listing 4.1. Treating the age and income variables
Listing 4.2. Treating the gas_usage variable
Listing 4.3. Counting the missing values in each variable
Listing 4.4. Creating and applying a treatment plan
Listing 4.5. Comparing the treated data to the original
Listing 4.6. Examining the data treatment
Listing 4.7. Normalizing income by state
Listing 4.8. Normalizing by mean age
Listing 4.9. Centering and scaling age
Listing 4.10. Centering and scaling multiple numeric variables
Listing 4.11. Treating new data before feeding it to a model
Listing 4.12. Splitting into test and training using a random group mark
Listing 4.13. Ensuring test/train split doesn’t split inside a household
Chapter 6. Choosing and evaluating models
Listing 6.1. Building and applying a logistic regression spam model
Listing 6.2. Spam classifications
Listing 6.3. Spam confusion matrix
Listing 6.4. Entering the Akismet confusion matrix by hand
Listing 6.5. Seeing filter performance change when spam proportions change
Listing 6.6. Fitting the cricket model and making predictions
Listing 6.8. Calculating R-squared
Listing 6.9. Making a double density plot
Listing 6.10. Plotting the receiver operating characteristic curve
Listing 6.11. Calculating log likelihood
Listing 6.12. Computing the null model’s log likelihood
Listing 6.13. Computing the deviance and pseudo R-squared
Listing 6.14. Loading the iris dataset
Listing 6.15. Fitting a model to the iris training data
Listing 6.16. Evaluating the iris model
Listing 6.17. Building a LIME explainer from the model and training data
Listing 6.18. An example iris datum
Listing 6.19. Explaining the iris example
Listing 6.20. More iris examples
Listing 6.21. Loading the IMDB training data
Listing 6.22. Converting the texts and fitting the model
Listing 6.23. Evaluate the review classifier
Listing 6.24. Building an explainer for a text classifier
Listing 6.25. Explaining the model’s prediction on a review
Chapter 7. Linear and logistic regression
Listing 7.1. Loading the PUMS data and fitting a model
Listing 7.2. Plotting log income as a function of predicted log income
Listing 7.3. Plotting residuals income as a function of predicted log income
Listing 7.4. Computing R-squared
Listing 7.5. Calculating root mean square error
Listing 7.6. Summarizing residuals
Listing 7.7. Loading the CDC data
Listing 7.8. Building the model formula
Listing 7.9. Fitting the logistic regression model
Listing 7.10. Applying the logistic regression model
Listing 7.11. Preserving marginal probabilities with logistic regression
Listing 7.12. Plotting distribution of prediction score grouped by known outcome
Listing 7.13. Exploring modeling trade-offs
Listing 7.14. Evaluating the chosen model
Listing 7.15. The model coefficients
Listing 7.16. The model summary
Listing 7.17. Computing deviance
Listing 7.18. Calculating the pseudo R-squared
Listing 7.19. Calculating the significance of the observed fit
Listing 7.20. Calculating the Akaike information criterion
Listing 7.21. Preparing the cars data
Listing 7.22. Fitting a logistic regression model
Listing 7.23. Looking at the model summary
Listing 7.24. Looking at the logistic model’s coefficients
Listing 7.25. The logistic model’s test performance
Listing 7.26. Fitting the ridge regression model
Listing 7.27. Looking at the ridge model’s coefficients
Listing 7.28. Looking at the ridge model’s test performance
Listing 7.29. The lasso model’s coefficients
Listing 7.30. The lasso model’s test performance
Listing 7.31. Cross-validating for both alpha and lambda
Chapter 8. Advanced data preparation
Listing 8.1. Preparing the KDD data for analysis
Listing 8.2. Attempting to model without preparation
Listing 8.3. Trying just one variable
Listing 8.4. Basic data preparation for classification
Listing 8.5. Preparing data with vtreat
Listing 8.6. Advanced data preparation for classification
Listing 8.7. Basic variable recoding and selection
Listing 8.8. An information-free dataset
Chapter 9. Unsupervised methods
Listing 9.1. Reading the protein data
Listing 9.2. Rescaling the dataset
Listing 9.3. Hierarchical clustering
Listing 9.4. Extracting the clusters found by hclust()
Listing 9.5. Projecting the clusters on the first two principal components
Listing 9.6. Running clusterboot() on the protein data
Listing 9.7. Calculating total within sum of squares
Listing 9.8. Plotting WSS for a range of k
Listing 9.9. Plotting BSS and WSS as a function of k
Listing 9.10. The Calinski-Harabasz index
Listing 9.11. Running k-means with k = 5
Listing 9.12. Plotting cluster criteria
Listing 9.13. Running clusterboot() with k-means
Listing 9.14. A function to assign points to a cluster
Listing 9.15. Generating and clustering synthetic data
Listing 9.16. Unscaling the centers
Listing 9.17. An example of assigning points to clusters
Listing 9.18. Reading in the book data
Listing 9.19. Examining the transaction data
Listing 9.20. Examining the size distribution
Listing 9.21. Counting how often each book occurs
Listing 9.22. Finding the 10 most frequently occurring books
Listing 9.23. Finding the association rules
Listing 9.25. Getting the five most confident rules
Listing 9.26. Finding rules with restrictions
Chapter 10. Exploring advanced methods
Listing 10.1. Preparing Spambase data and evaluating a decision tree model
Listing 10.2. Bagging decision trees
Listing 10.3. Using random forests
Listing 10.4. randomForest variable importances
Listing 10.5. Fitting with fewer variables
Listing 10.6. Loading the iris data
Listing 10.7. Cross-validating to determine model size
Listing 10.8. Fitting an xgboost model
Listing 10.9. Loading the natality data
Listing 10.10. Using vtreat to prepare data for xgboost
Listing 10.11. Fitting and applying an xgboost model for birth weight
Listing 10.12. Preparing an artificial problem
Listing 10.13. Applying linear regression to the artificial example
Listing 10.14. Applying GAM to the artificial example
Listing 10.15. Comparing linear regression and GAM performance
Listing 10.16. Extracting a learned spline from a GAM
Listing 10.17. Applying linear regression (with and without GAM) to health data
Listing 10.18. Plotting GAM results
Listing 10.19. Checking GAM model performance on holdout data
Listing 10.20. GLM logistic regression
Listing 10.21. GAM logistic regression
Listing 10.22. Setting up the spirals data as a classification problem
Listing 10.23. SVM with a poor choice of kernel
Chapter 11. Documentation and deployment
Listing 11.1. R-annotated Markdown
Listing 11.2. Using the system() command to compute a file hash
Listing 11.3. Calculating model performance
Listing 11.5. Example code comments
Listing 11.6. Checking your project status
Listing 11.7. Checking your project history
Listing 11.8. Finding out who committed what
Listing 11.9. Finding line-based differences between two committed versions
Appendix A. Starting with R and other tools
Listing A.1. Trying a few R commands
Listing A.2. Binding values to function arguments
Listing A.3. Demonstrating side effects
Listing A.4. R truth tables for Boolean operators
Listing A.5. Call-by-value effect
Listing A.6. Examples of R indexing operators
Appendix B. Important statistical concepts
Listing B.1. Plotting the theoretical normal density
Listing B.2. Plotting an empirical normal density
Listing B.3. Working with the normal CDF
Listing B.4. Plotting x < qnorm(0.75)
Listing B.5. Demonstrating some properties of the lognormal distribution
Listing B.6. Plotting the lognormal distribution
Listing B.7. Plotting the binomial distribution
Listing B.8. Working with the theoretical binomial distribution
Listing B.9. Simulating a binomial distribution
Listing B.10. Working with the binomial distribution
Listing B.11. Working with the binomial CDF
Listing B.12. Building simulated A/B test data
Listing B.13. Summarizing the A/B test into a contingency table
Listing B.14. Calculating the observed A and B conversion rates
Listing B.15. Calculating the significance of the observed difference in rates
Listing B.16. Computing frequentist significance
Listing B.17. Building synthetic uncorrelated income
Listing B.18. Calculating the (non)significance of the observed correlation
Listing B.19. Misleading significance result from biased observations
Listing B.20. Plotting biased view of income and capital gains
Listing B.21. Summarizing our synthetic biological data
Listing B.22. Building data that improves over time