About This Book

This book is about data science: a field that uses results from statistics, machine learning, and computer science to create predictive models. Because of the broad nature of data science, it’s important to discuss it a bit and to outline the approach we take in this book.

What is data science?

The statistician William S. Cleveland defined data science as an interdisciplinary field larger than statistics itself. We define data science as managing the process that can transform hypotheses and data into actionable predictions. Typical predictive analytic goals include predicting who will win an election, what products will sell well together, which loans will default, and which advertisements will be clicked on. The data scientist is responsible for acquiring and managing the data, choosing the modeling technique, writing the code, and verifying the results.

Because data science draws on so many disciplines, it’s often a “second calling.” Many of the best data scientists we meet started as programmers, statisticians, business intelligence analysts, or scientists. By adding a few more techniques to their repertoire, they became excellent data scientists. That observation drives this book: we introduce the practical skills needed by the data scientist by concretely working through all of the common project steps on real data. Some steps you’ll know better than we do, some you’ll pick up quickly, and some you may need to research further.

Much of the theoretical basis of data science comes from statistics. But data science as we know it is strongly influenced by technology and software engineering methodologies, and has largely evolved in heavily computer science– and information technology– driven groups. We can call out some of the engineering flavor of data science by listing some famous examples:

  • Amazon’s product recommendation systems
  • Google’s advertisement valuation systems
  • LinkedIn’s contact recommendation system
  • Twitter’s trending topics
  • Walmart’s consumer demand projection systems

These systems share a lot of features:

  • All of these systems are built off large datasets. That’s not to say they’re all in the realm of big data. But none of them could’ve been successful if they’d only used small datasets. To manage the data, these systems require concepts from computer science: database theory, parallel programming theory, streaming data techniques, and data warehousing.
  • Most of these systems are online or live. Rather than producing a single report or analysis, the data science team deploys a decision procedure or scoring procedure to either directly make decisions or directly show results to a large number of end users. The production deployment is the last chance to get things right, as the data scientist can’t always be around to explain defects.
  • All of these systems are allowed to make mistakes at some non-negotiable rate.
  • None of these systems are concerned with cause. They’re successful when they find useful correlations and are not held to correctly sorting cause from effect.

This book teaches the principles and tools needed to build systems like these. We teach the common tasks, steps, and tools used to successfully deliver such projects. Our emphasis is on the whole process—project management, working with others, and presenting results to nonspecialists.

Roadmap

This book covers the following:

  • Managing the data science process itself. The data scientist must have the ability to measure and track their own project.
  • Applying many of the most powerful statistical and machine learning techniques used in data science projects. Think of this book as a series of explicitly worked exercises in using the R programming language to perform actual data science work.
  • Preparing presentations for the various stakeholders: management, users, deployment team, and so on. You must be able to explain your work in concrete terms to mixed audiences with words in their common usage, not in whatever technical definition is insisted on in a given field. You can’t get away with just throwing data science project results over the fence.

We’ve arranged the book topics in an order that we feel increases understanding. The material is organized as follows.

Part 1 describes the basic goals and techniques of the data science process, emphasizing collaboration and data. Chapter 1 discusses how to work as a data scientist. Chapter 2 works through loading data into R and shows how to start working with R.

Chapter 3 teaches what to first look for in data and the important steps in characterizing and understanding data. Data must be prepared for analysis, and data issues will need to be corrected. Chapter 4 demonstrates how to correct the issues identified in chapter 3.

Chapter 5 covers one more data preparation step: basic data wrangling. Data is not always available to the data scientist in a form or “shape” best suited for analysis. R provides many tools for manipulating and reshaping data into the appropriate structure; they are covered in this chapter.

Part 2 moves from characterizing and preparing data to building effective predictive models. Chapter 6 supplies a mapping of business needs to technical evaluation and modeling techniques. It covers the standard metrics and procedures used to evaluate model performance, and one specialized technique, LIME, for explaining specific predictions made by a model.

Chapter 7 covers basic linear models: linear regression, logistic regression, and regularized linear models. Linear models are the workhorses of many analytical tasks, and are especially helpful for identifying key variables and gaining insight into the structure of a problem. A solid understanding of them is immensely valuable for a data scientist.

Chapter 8 temporarily moves away from the modeling task to cover more advanced data treatment: how to prepare messy real-world data for the modeling step. Because understanding how these data treatment methods work requires some understanding of linear models and of model evaluation metrics, it seemed best to defer this topic until part 2.

Chapter 9 covers unsupervised methods: modeling methods that do not use labeled training data. Chapter 10 covers more advanced modeling methods that increase prediction performance and fix specific modeling issues. The topics covered include tree-based ensembles, generalized additive models, and support vector machines.

Part 3 moves away from modeling and back to process. We show how to deliver results. Chapter 11 demonstrates how to manage, document, and deploy your models. You’ll learn how to create effective presentations for different audiences in chapter 12.

The appendixes include additional technical details about R, statistics, and more tools that are available. Appendix A shows how to install R, get started working, and work with other tools (such as SQL). Appendix B is a refresher on a few key statistical ideas.

The material is organized in terms of goals and tasks, bringing in tools as they’re needed. The topics in each chapter are discussed in the context of a representative project with an associated dataset. You’ll work through a number of substantial projects over the course of this book. All the datasets referred to in this book are at the book’s GitHub repository, https://github.com/WinVector/PDSwR2. You can download the entire repository as a single zip file (one of GitHub’s services), clone the repository to your machine, or copy individual files as needed.

Audience

To work the examples in this book, you’ll need some familiarity with R and statistics. We recommend you have some good introductory texts already on hand. You don’t need to be expert in R before starting the book, but you will need to be familiar with it.

To start with R, we recommend Beyond Spreadsheets with R by Jonathan Carroll (Manning, 20108) or R in Action by Robert Kabacoff (now available in a second edition: http://www.manning.com/kabacoff2/), along with the text’s associated website, Quick-R (http://www.statmethods.net). For statistics, we recommend Statistics, Fourth Edition, by David Freedman, Robert Pisani, and Roger Purves (W. W. Norton & Company, 2007).

In general, here’s what we expect from our ideal reader:

  • An interest in working examples. By working through the examples, you’ll learn at least one way to perform all steps of a project. You must be willing to attempt simple scripting and programming to get the full value of this book. For each example we work, you should try variations and expect both some failures (where your variations don’t work) and some successes (where your variations outperform our example analyses).
  • Some familiarity with the R statistical system and the will to write short scripts and programs in R. In addition to Kabacoff, we list a few good books in appendix C. We’ll work specific problems in R; you’ll need to run the examples and read additional documentation to understand variations of the commands we didn’t demonstrate.
  • Some comfort with basic statistical concepts such as probabilities, means, standard deviations, and significance. We’ll introduce these concepts as needed, but you may need to read additional references as we work through examples. We’ll define some terms and refer to some topic references and blogs where appropriate. But we expect you will have to perform some of your own internet searches on certain topics.
  • A computer (macOS, Linux, or Windows) to install R and other tools on, as well as internet access to download tools and datasets. We strongly suggest working through the examples, examining R help() on various methods, and following up with some of the additional references.

What is not in this book?

  • This book is not an R manual. We use R to concretely demonstrate the important steps of data science projects. We teach enough R for you to work through the examples, but a reader unfamiliar with R will want to refer to appendix A as well as to the many excellent R books and tutorials already available.
  • This book is not a set of case studies. We emphasize methodology and technique. Example data and code is given only to make sure we’re giving concrete, usable advice.
  • This book is not a big data book. We feel most significant data science occurs at a database or file manageable scale (often larger than memory, but still small enough to be easy to manage). Valuable data that maps measured conditions to dependent outcomes tends to be expensive to produce, and that tends to bound its size. For some report generation, data mining, and natural language processing, you’ll have to move into the area of big data.
  • This is not a theoretical book. We don’t emphasize the absolute rigorous theory of any one technique. The goal of data science is to be flexible, have a number of good techniques available, and be willing to research a technique more deeply if it appears to apply to the problem at hand. We prefer R code notation over beautifully typeset equations even in our text, as the R code can be directly used.
  • This is not a machine learning tinkerer’s book. We emphasize methods that are already implemented in R. For each method, we work through the theory of operation and show where the method excels. We usually don’t discuss how to implement them (even when implementation is easy), as excellent R implementations are already available.

Code conventions and downloads

This book is example driven. We supply prepared example data at the GitHub repository (https://github.com/WinVector/PDSwR2), with R code and links back to original sources. You can explore this repository online or clone it onto your own machine. We also supply the code to produce all results and almost all graphs found in the book as a zip file (https://github.com/WinVector/PDSwR2/raw/master/CodeExamples.zip), since copying code from the zip file can be easier than copying and pasting from the book. Instructions on how to download, install, and get started with all the suggested tools and example data can be found in appendix A, in section A.1.

We encourage you to try the example R code as you read the text; even when we’re discussing fairly abstract aspects of data science, we’ll illustrate examples with concrete data and code. Every chapter includes links to the specific dataset(s) that it references.

In this book, code is set with a fixed-width font like this to distinguish it from regular text. Concrete variables and values are formatted similarly, whereas abstract math will be in italic font like this. R code is written without any command-line prompts such as > (which is often seen when displaying R code, but not to be typed in as new R code). Inline results are prefixed by R’s comment character #. In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page space in the book. In rare cases, even this was not enough, and listings include line-continuation markers (). Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts.

Working with this book

Practical Data Science with R is best read while working at least some of the examples. To do this we suggest you install R, RStudio, and the packages commonly used in the book. We share instructions on how to do this in section A.1 of appendix A. We also suggest you download all the examples, which include code and data, from our GitHub repository at https://github.com/WinVector/PDSwR2.

Downloading the book’s supporting materials/repository

The contents of the repository can be downloaded as a zip file by using the “download as zip” GitHub feature, as shown in the following figure, from the GitHub URL https://github.com/WinVector/PDSwR2.

Clicking on the “Download ZIP” link should download the compressed contents of the package (or you can try a direct link to the ZIP material: https://github.com/WinVector/PDSwR2/archive/master.zip). Or, if you are familiar with working with the Git source control system from the command line, you can do this with the following command from a Bash shell (not from R):

git clone https://github.com/WinVector/PDSwR2.git

In all examples, we assume you have either cloned the repository or downloaded and unzipped the contents. This will produce a directory named PDSwR2. Paths we discuss will start with this directory. For example, if we mention working with PDSwR2/UCICar, we mean to work with the contents of the UCICar subdirectory of wherever you unpacked PDSwR2. You can change R’s working directory through the setwd() command (please type help(setwd) in the R console for some details). Or, if you are using RStudio, the file-browsing pane can also set the working directory from an option on the pane’s gear/more menu. All of the code examples from this book are included in the directory PDSwR2/CodeExamples, so you should not need to type them in (though to run them you will have to be working in the appropriate data directory—not in the directory you find the code in).

The examples in this book are supplied in lieu of explicit exercises. We suggest working through the examples and trying variations. For example, in section 2.3.1, where we show how to relate expected income to schooling and gender, it makes sense to try relating income to employment status or even age. Data science requires curiosity about programming, functions, data, variables, and relations, and the earlier you find surprises in your data, the easier they are to work through.

Book forum

Purchase of Practical Data Science with R includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and from other users. To access the forum, go to https://forums.manning.com/forums/practical-data-science-with-r-second-edition. You can also learn more about Manning's forums and the rules of conduct at https://forums.manning.com/forums/about.

Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking them some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset