Chapter 2. Starting with R and data

This chapter covers

  • Starting to work with R and data
  • Mastering R’s data frame structure
  • Loading data into R
  • Recoding data for later analysis

This chapter works through how to start working with R and how to import data into R from diverse sources. This will prepare you to work examples throughout the rest of the book.

Figure 2.1 is a diagram representing a mental model for the book that has been reshaded to emphasize the purpose of this chapter: starting to work with R and importing data into R. The overall diagram shows the data science process diagram from chapter 1 combined with a rebus form of the book title. In each chapter, we will reshade this mental model to indicate the parts of the data science process we are emphasizing. For example: in this chapter, we are mastering the initial steps of collecting and managing data, and touching on issues of practicality, data, and R (but not yet the art of science).

Figure 2.1. Chapter 2 mental model

Many data science projects start when someone points the analyst toward a bunch of data, and the analyst is left to make sense of it.[5] Your first thought may be to use ad hoc tools and spreadsheets to sort through it, but you will quickly realize that you’re taking more time tinkering with the tools than actually analyzing the data. Luckily, there’s a better way: using R. By the end of the chapter, you’ll be able to confidently use R to extract, transform, and load data for analysis.

5

We assume the reader is interested in working as an analyst, statistician, or data scientist, so we will alternate using these terms to represent people similar to the reader.

R without data is like going to the theater to watch the curtain go up and down.

Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories

2.1. Starting with R

R is open source software that runs well on Unix, Linux, Apple’s macOS, and Microsoft Windows. This book will concentrate on how to work as a data scientist. However, to work the examples, the reader must be familiar with R programming. If you want to pick up some prerequisite knowledge, we suggest consulting free manuals from CRAN (the main R package repository: https://cran.r-project.org/manuals.html) and other online materials. A number of good books for starting with R include these:

  • R in Action, Second Edition, Robert Kabacoff, Manning, 2015
  • Beyond Spreadsheets with R, Jonathan Carroll, Manning, 2018
  • The Art of R Programming, Norman Matloff, No Starch Press, 2011
  • R for Everyone, Second Edition, Jared P. Lander, Addison-Wesley, 2017

Each book has a different teaching style, and some include material on statistics, machine learning, and data engineering. A little research may tell you which books work well for you. This book will concentrate on working substantial data science examples, demonstrating the steps needed to overcome typical issues found in your own future real-world applications.

It is our opinion that data science is repeatable: the same job rerun on the same data should give a similar quality result (the exact result may vary due to numeric issues, timing issues, issues arising from parallelism, and issues around pseudo-random numbers). In fact, we should insist on repeatability. This is why we are discussing programming in a data science book. Programming is the reliable way to specify a reusable sequence of operations. With this in mind, one should always consider a data refresh (getting newer, corrected, or larger data) as a good thing, because rerunning an analysis should be, by design, very easy. An analysis that has a number of steps performed by hand is never going to be easy to repeat.

2.1.1. Installing R, tools, and examples

We suggest you follows the steps in section A.1 of appendix A to install R, packages, tools, and the book examples.

Look for help

R includes a very nice help system. To get help on an R command, just run the help() command in the R console. For example, to see details about how to change directories, you would type help(setwd). You must know the name of the function to get help, so we strongly recommend keeping notes. For some simple functions, we will not explain the function and leave it to the reader to call help() to work out what the function does.

2.1.2. R programming

In this section, we will briefly describe some R programming conventions, semantics, and style issues. Details can be found in package-specific documentation, the R help() system, and by trying variations of the examples we present here. Here, we’ll concentrate on aspects that differ from other common programming languages, and conventions that we emphasize in the book. This should help you get into an R frame of mind.

There are a number of common R coding style guides. Coding style is an attempt to make things more consistent, clear, and readable. This book will follow a style variation we have found to be very effective in teaching and code maintenance. Obviously, our style is just one among many, and is in no way mandatory. Good starting references include these:

We will try to minimize differences from current convention and call out where we have such differences. We also recommend “R tips and tricks” from the author’s blog.[6]

6

R is a rich and broad language, often with many ways to accomplish the same task. This represents a bit of an initial learning curve, as the meaning of R programs can be hard to discern until you get familiar with the notation. However, time spent reviewing some of the basic notation is well rewarded, as it will make working through the coming substantial examples much easier. We understand the grammar of R is itself uninteresting to the reader coming here to learn data science methods and practices (our exact target audience!), but this small initial exertion prevents a lot of confusion later. We will use this section to describe a bit of R’s notation and meaning, concentrating on that which is particularly useful and surprising. All the following are small and basic points, but many of them are subtle and worth experimenting with.

Prefer working code

Prefer programs, scripts, or code that works but does not yet do what you want. Instead of writing a large, untested program or script that embodies every desired step of analysis, write a program that performs a step correctly, and then iteratively revise the script to perform more steps correctly. This discipline of moving from a working revision usually gets to final correct results much faster than attempting to debug a large, faulty system into correctness.

Examples and the comment character (#)

In examples, we will show R commands as free text, with the results prefixed by the hash mark, #, which is R’s comment character. In many examples, we will include the results after the commands, prefixed with the comment character. R printing usually includes array cell indices in square braces and often involves line wrapping. For example, printing the integers 1 through 25 looks like the following:

print(seq_len(25))
# [1]  1  2  3  4  5  6  7  8  9 10 11 12
# [13] 13 14 15 16 17 18 19 20 21 22 23 24
# [25] 25

Notice the numbers were wrapped to three lines, and each line starts with the index of the first cell reported on the line inside square brackets. Sometimes we will not show results, an extra encouragement to work these particular examples.

Printing

R has a number of rules that turn implicit or automatic printing on and off. Some packages such as ggplot2 use printing to trigger their intended operations. Typing a value usually triggers printing the value. Care must be taken in a function or a for loop, as in these contexts, R’s automatic printing of results is disabled. Printing of very large objects can be a problem, so you want to avoid printing objects of unknown size. Implicit printing can often be forced by adding extra parentheses such as in “(x <- 5)”.

Vectors and lists

Vectors (sequential arrays of values) are fundamental R data structures. Lists can hold different types in each slot; vectors can only hold the same primitive or atomic type in each slot. In addition to numeric indexing, both vectors and lists support name-keys. Retrieving items from a list or vector can be done by the operators shown next.

Vector indexing

R vectors and lists are indexed from 1, and not from 0 as with many other programming languages.

example_vector <- c(10, 20, 30)                1
example_list <- list(a = 10, b = 20, c = 30)   2

example_vector[1]                              3
 ## [1] 10
example_list[1]
## $a
## [1] 10

example_vector[[2]]                            4
 ## [1] 20
example_list[[2]]
## [1] 20

example_vector[c(FALSE, TRUE, TRUE)]           5
 ## [1] 20 30
example_list[c(FALSE, TRUE, TRUE)]
## $b
## [1] 20
##
## $c
## [1] 30

example_list$b                                 6
 ## [1] 20

example_list[["b"]]
## [1] 20

  • 1 Builds an example vector. c() is R’s concatenate operator—it builds longer vectors and lists from shorter ones without nesting. For example, c(1) is just the number 1, and c(1, c(2, 3)) is equivalent to c(1, 2, 3), which in turn is the integers 1 through 3 (though stored in a floating-point format).
  • 2 Builds an example list
  • 3 Demonstrates vector and list use of []. Notice that for the list, [] returns a new short list, not the item.
  • 4 Demonstrates vector and list use of [[]]. In common cases, [[]] forces a single item to be returned, though for nested lists of complex type, this item itself could be a list.
  • 5 Vectors and lists can be indexed by vectors of logicals, integers, and (if the vector or list has names) characters.
  • 6 For named examples, the syntax example_list$b is essentially a short-hand for example_list[["b"]] (the same is true for named vectors).

We will not always share so many notes for every example, but we invite the reader to work as if there were such notes by calling help() on every function or command used. Also, we very much encourage trying variations. In R “errors” are just R’s way of saying it safely refused to complete an ill-formed operation (an error does not indicate “crash,” and results are not corrupted). So fear of errors should not limit experiments.

x <- 1:5
print(x)                                                                   1
# [1] 1 2 3 4 5

x <- cumsumMISSPELLED(x)                                                   2
# Error in cumsumMISSPELLED(x) : could not find function "cumsumMISSPELLED"

print(x)                                                                   3
# [1] 1 2 3 4 5

x <- cumsum(x)                                                             4
print(x)
# [1]  1  3  6 10 15

  • 1 Defines a value we are interested in and stores it in the variable x
  • 2 Attempts, and fails, to assign a new result to x
  • 3 Notice that in addition to supplying a useful error message, R preserves the original value of x.
  • 4 Tries the operation again, using the correct spelling of cumsum(). cumsum(), short for cumulative sum, is a useful function that computes running totals quickly.

Another aspect of vectors in R is that most R operations are vectorized. A function or operator is called vectorized when applying it to a vector is shorthand for applying a function to each entry of the vector independently. For example, the function nchar() counts how many characters are in a string. In R this function can be used on a single string, or on a vector of strings.

Lists and vectors are R’s map structures

Lists and vectors are R’s map structures. They can map strings to arbitrary objects. The primary list operations [], match (), and %in% are vectorized. This means that, when applied to a vector of values, they return a vector of results by performing one lookup per entry. For pulling individual elements out of a list, use the double-bracket notation [[]].

nchar("a string")
# [1] 8

nchar(c("a", "aa", "aaa", "aaaa"))
# [1] 1 2 3 4
Logical operations

R’s logical operators come in two flavors. R has standard infix scalar-valued operators that expect only one value and have the same behavior and same names as you would see in C or Java: && and ||. R also has vectorized infix operators that work on vectors of logical values: & and |. Be sure to always use the scalar versions (&& and ||) in situations such as if statements, and the vectorized versions (& and |) when processing logical vectors.

NULL and NANA (not available) values

In R NULL is just a synonym for the empty or length-zero vector formed by using the concatenate operator c() with no arguments. For example, when we type c() into the R console, we will see the value NULL returned. In R NULL is not any sort of invalid pointer (as it is in most C/Java-related languages). NULL is simply a length-zero vector. Concatenating NULL is a safe and well-defined operation (in fact it’s a “no operation” or “no-op” that does nothing). For example, c(c(), 1, NULL) is perfectly valid and returns the value 1.

NA stands for “not available” and is fairly unique to R. Most any simple type can take on the value NA. For example, the vector c("a", NA, "c") is a vector of three character strings where we do not know the value of the second entry. Having NA is a great convenience as it allows us to annotate missing or unavailable values in place, which can be critical in data processing. NA behaves a little bit like the NaN value in floating-point arithmetic,[7] except we are not restricted to using it only with floating-point types. Also, NA means “not available,” not invalid (as NaN denotes), so NA has some convenient rules such as the logical expression FALSE & NA simplifying to FALSE.

7

The limits of floating-point arithmetic, or how real numbers are commonly approximated in computers, is a common source of confusion and problems when working with numeric data. To appreciate the issues of working with numeric data, we recommend data scientists read David Goldberg’s 1991 Computing Surveys. “What Every Computer Scientist Should Know About Floating-Point Arithmetic” has been publicly shared from this issue (https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html).

Identifiers

Identifiers or symbol names are how R refers to variables and functions. The Google R Style Guide insists on writing symbol names in what is called “CamelCase” (word boundaries in names are denoted by uppercase letters, as in “CamelCase” itself). The Advanced R guide recommends an underscore style where names inside identifiers are broken up with underscores (such as “day_one” instead of “DayOne”). Also, many R users use a dot to break up names with identifiers (such as “day.one”). In particular, important built-in R types such as data.frame and packages such as data.table use the dot notation convention.

We recommend using the underscore notation, but find we often must switch between conventions when working with others. If possible, avoid the dot convention, as this notation is usually used for other purposes in object-oriented languages and databases, and so needlessly confuses others.[8]

8

The dot notation likely comes from the Lisp world (which strongly influenced R) and the aversion to underscores likely is a holdover from when “_” was one of the usable assignment operators in R (it is no longer used as an assignment operator in R).

Line breaks

It is generally recommended to keep R source code lines at 80 columns or fewer. R accepts multiple-line statements as long as where the statement ends is unambiguous. For example, to break the single statement “1 + 2" into multiple lines, write the code as follows:

1 +
  2

Do not write code like the following, as the first line is itself a valid statement, creating ambiguity:

1
  + 2

The rule is this: force a syntax error every time reading the statement across multiple lines terminates early.

Semicolons

R allows semicolons as end-of-statement markers, but does not require them. Most style guides recommend not using semicolons in R code and certainly not using them at ends of lines.

Assignment

R has many assignment operators (see table 2.1); the preferred one is <-. = can be used for assignment in R, but is also used to bind argument values to argument names by name during function calls (so there is some potential ambiguity in using =).

Table 2.1. Primary R assignment operators

Operator

Purpose

Example

<- Assign the value on the right to the symbol on the left. x <- 5 # assign the value of 5 to the symbol x
= Assign the value on the right to the symbol on the left. x = 5 # assign the value of 5 to the symbol x
-> Assign left to right, instead of the traditional right to left. 5 -> x # assign the value of 5 to the symbol x
Left-hand sides of assignments

Many popular programming languages only allow assignment of values into variable name or symbols. R allows slice expressions on the left-hand sides of assignments, and both numeric and logical array indexing. This allows for very powerful array-slicing commands and coding styles. For example, we can replace all the missing values (denoted by “NA") in a vector with zero as shown in the following example:

d <- data.frame(x = c(1, NA, 3))    1
print(d)
#    x
# 1  1
# 2 NA
# 3  3                              2

d$x[is.na(d$x)] <- 0                3
print(d)
#   x
# 1 1
# 2 0
# 3 3

  • 1 “data.frame” is R’s tabular data type, and the most important data type in R. A data.frame holds data organized in rows and columns.
  • 2 When printing data.frames, row numbers are shown in the first (unnamed) column, and column values are shown under their matching column names.
  • 3 We can place a slice or selection of the x column of d on the left-hand side of an assignment to easily replace all NA values with zero.
Factors

R can handle many kinds of data: numeric, logical, integer, strings (called character types), and factors. Factors are an R type that encodes a fixed set of strings as integers. Factors can save a lot on storage while appearing to behave as strings. However, factors can have potentially confusing interactions with the as.numeric() command (which returns the factor codes for factors, but parses text for character types). Factors also encode the entire set of allowed values, which is useful—but can make combining data from different sources (that saw different sets of values) a bit of a chore. To avoid issues, we suggest delaying conversion of strings to factors until late in an analysis. This is usually accomplished by adding the argument stringsAsFactors = FALSE to functions such as data.frame() or read.table(). We, however, do encourage using factors when you have a reason, such as wanting to use summary() or preparing to produce dummy indicators (see “A bit more on factor coding” after listing 2.10 for more details on dummy indicators and their relation to factors).

Named arguments

R is centered around applying functions to data. Functions that take a large number of arguments rapidly become confusing and illegible. This is why R includes a named argument feature. As an example, if we wanted to set our working directory to “/tmp” we would usually use the setwd() command like so: setwd("/tmp"). However, help(setwd) shows us the first argument to setwd() has the name dir, so we could also write this as setwd(dir = "/tmp"). This becomes useful for functions that have a large number of arguments, and for setting optional function arguments. Note: named arguments must be set by =, and not by an assignment operator such as <-.

If you have a procedure with 10 parameters, you probably missed some.

Alan Perlis, “Epigrams on Programming,” ACM SIGPLAN Notices 17

Package notation

In R there are two primary ways to use a function from a package. The first is to attach the package with the library() command and then use the function name. The second is to use the package name and then :: to name the function. An example of this second method is stats::sd(1:5). The :: notation is good to avoid ambiguity or to leave a reminder of which package the function came from for when you read your own code later.

Value semantics

R is unusual in that it efficiently simulates “copy by value" semantics. Any time a user has two references to data, each evolves independently: changes to one do not affect the other. This is very desirable for part-time programmers and eliminates a large class of possible aliasing bugs when writing code. We give a quick example here:

d <- data.frame(x = 1, y = 2)     1
d2 <- d                           2
d$x <- 5                          3

print(d)
#   x y
# 1 5 2

print(d2)
#   x y
# 1 1 2

  • 1 Creates some example data and refers to it by the name d
  • 2 Creates an additional reference d2 to the same data
  • 3 Alters the value referred to by d

Notice d2 keeps the old value of 1 for x. This feature allows for very convenient and safe coding. Many programming languages protect references or pointers in function calls in this manner; however, R protects complex values and does so in all situations (not just function calls). Some care has to be taken when you want to share back changes, such as invoking a final assignment such as d2 <- d after all desired changes have been made. In our experience, R’s value isolation semantics prevents far more issues than the copy-back inconvenience it introduces.

Organizing intermediate values

Long sequences of calculations can become difficult to read, debug, and maintain. To avoid this, we suggest reserving the variable named “.” to store intermediate values. The idea is this: work slow to move fast. For example: a common data science problem is to sort revenue records and then calculate what fraction of total revenue is achieved up to a given sorting key. In R this can be done easily by breaking this task into small steps:

data <- data.frame(revenue = c(2, 1, 2),                        1
                   sort_key = c("b", "c", "a"),
                   stringsAsFactors = FALSE)
print(data)
#   revenue sort_key
# 1       2        b
# 2       1        c
# 3       2        a

. <- data                                                       2
. <- .[order(.$sort_key), , drop = FALSE]                       3
.$ordered_sum_revenue <- cumsum(.$revenue)
.$fraction_revenue_seen <- .$ordered_sum_revenue/sum(.$revenue)
result <- .                                                     4

print(result)
#   revenue sort_key ordered_sum_revenue fraction_revenue_seen
# 3       2        a                   2                   0.4
# 1       2        b                   4                   0.8
# 2       1        c                   5                   1.0

  • 1 Our notional, or example, data.
  • 2 Assign our data to a temporary variable named “.”. The original values will remain available in the “data” variable, making it easy to restart the calculation from the beginning if necessary.
  • 3 Use the order command to sort the rows. drop = FALSE is not strictly needed, but it is good to get in the habit of including it. For single-column data.frames without the drop = FALSE argument, the [,] indexing operator will convert the result to a vector, which is almost never the R user's true intent. The drop = FALSE argument turns off this conversion, and it is a good idea to include it “just in case” and a definite requirement when either the data.frame has a single column or when we don’t know if the data.frame has more than one column (as the data.frame comes from somewhere else).
  • 4 Assigns the result away from “.” to a more memorable variable name

The R package dplyr replaces the dot notation with what is called piped notation (supplied by a another package named magrittr, and similar to the JavaScript method, chaining). Because the dplyr is very popular, you are likely to see code written in this style, and we will use this style from time to time to help prepare you for such code. However, it is important to remember that dplyr is merely a popular alternative to standard R code, and not a superior alternative.

library("dplyr")

result <- data %>%
  arrange(., sort_key) %>%
  mutate(., ordered_sum_revenue = cumsum(revenue)) %>%
  mutate(., fraction_revenue_seen = ordered_sum_revenue/sum(revenue))

Each step of this example has been replaced by the corresponding dplyr equivalent. arrange() is dplyr’s replacement for order(), and mutate() is dplyr’s replacement for assignment. The code translation is line by line, with the minor exception that assignment is written first (even though it happens after all other steps). The calculation steps are sequenced by the magrittr pipe symbol %>%.

The magrittr pipe allows you to write any of x %>% f, x %>% f(), or x %>% f(.) in place of f(x). Typically, x %>% f is the notation taught: however, we feel x %>% f(.) is the most explicit in representing what is happening.[9]

9

For our own work, we actual prefer to use the “dot pipe” %.>% from the wrapr package that enforces more notational consistency.

The details of the dplyr notation can be found here: http://dplyr.tidyverse.org/articles/dplyr.html. Be aware that debugging long dplyr pipelines is difficult, and during development and experimentation it makes sense to break dplyr pipelines into smaller steps, storing intermediate results into temporary variables.

The intermediate result notation has the advantages that it is easy to both restart and step-debug. In this book, we will use different notations as convenient.

The data.frame class

The R data.frame class is designed to store data in a very good “ready for analysis” format. data.frames are two-dimensional arrays where each column represents a variable, measurement, or fact, and each row represents an individual or instance. In this format, an individual cell represents what is known about a single fact or variable for a single instance. data.frames are implemented as a named list of column vectors (list columns are possible, but they are more of the exception than the rule for data.frames). In a data.frame, all columns have the same length, and this means we can think of the kth entry of all columns as forming a row.

Operations on data.frame columns tend to be efficient and vectorized. Adding, looking up, and removing columns is fast. Operations per row on a data.frame can be expensive, so you should prefer vectorized column notations for large data.frame processing.

R’s data.frame is much like a database table in that it has schema-like information: an explicit list of column names and column types. Most analyses are best expressed in terms of transformations over data.frame columns.

Let R do the work for you

Most common statistical or data processing operations already have a good implementation either in “base R” (R itself and its core packages such as utils and stats) or in an extension package. If you do not delegate to R, you end up fighting R. For example, a programmer coming from Java might expect to have to use a for loop to add every row of values from two data columns. In R, adding two data columns is considered fundamental and achieved as follows:

d <- data.frame(col1 = c(1, 2, 3), col2 = c(-1, 0, 1))
d$col3 <- d$col1 + d$col2
print(d)
#   col1 col2 col3
# 1    1   -1    0
# 2    2    0    2
# 3    3    1    4

data.frames are in fact named lists of columns. We will use them throughout the book. In R one tends to work over columns and let R’s vectorized nature perform the specified operation over every row at once. If you find yourself iterating over rows in R, you are fighting the language.

Search for ready-made solutions

Searching for the right R function can be tedious, but it is well worth the time (especially if you keep searchable notes). R was designed for data analysis, so the most common steps needed in a data analysis have already been implemented well in R, though possibly under an obscure name and possibly with odd default settings. It is as chemist Frank Westheimer said, “A couple of months in the laboratory can frequently save a couple of hours in the library.”[10] This is a deliberately ironic restatement of the move-fast-by-working-slow principle: researching available solutions costs time, but often saves far more direct coding time.

10

2.2. Working with data from files

The most common ready-to-go data format is in fact a family of tabular formats called structured values. Most of the data you find will be in (or nearly in) one of these formats. When you can read such files into R, you can analyze data from an incredible range of public and private data sources. In this section, we’ll work on two examples of loading data from structured files, and one example of loading data directly from a relational database. The point is to get data quickly into R so we can then use R to perform interesting analyses.

2.2.1. Working with well-structured data from files or URLs

The easiest data format to read is table-structured data with headers. As shown in figure 2.2, this data is arranged in rows and columns with a header showing the column names. Each column represents a different fact or measurement; each row represents an instance or datum about which we know the set of facts. A lot of public data is in this format, so being able to read it opens up a lot of opportunities.

Figure 2.2. Car data viewed as a table

Before we load the German credit data we used in the previous chapter, let’s demonstrate the basic loading commands with a simple dataset originally from the University of California Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/). The UCI data files tend to come without headers, so to save steps (and to keep things simple) we’ve pre-prepared our first data example from the UCI car dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/car/. Our pre-prepared file is included in the book support directory PDSwR2/UCICar (please see section FM.5.6 for instructions) and looks like the following:

buying,maint,doors,persons,lug_boot,safety,rating    1
vhigh,vhigh,2,2,small,low,unacc                      2
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
...

  • 1 The header row contains the names of the data columns, in this case separated by commas. When the separators are commas, the format is called comma-separated values, or .csv.
  • 2 The data rows are in the same format as the header row, but each row contains actual data values. In this case, the first row represents the set of name/value pairs: buying=vhigh, maintenance=vhigh, doors=2, persons=2, and so on.
Avoid “by hand” steps outside of R

We strongly encourage you to avoid performing steps “by hand” outside of R when importing data. It’s tempting to use an editor to add a header line to a file, as we did in our example. A better strategy is to write an R script to perform any necessary reformatting. Automating these steps greatly reduces the amount of trauma and work during the inevitable data refresh. Receiving new, better data should always feel like good news, and writing automated and replicable procedures is a big step in this direction.

Our example in section 2.2.2 will show how to add headers without editing files by hand as we did in this example.

Notice that this presentation is structured like a spreadsheet with easy-to-identify rows and columns. Each (non-header) row represents a review of a different model of car. The columns represent facts about each car model. Most of the columns are objective measurements (purchase cost, maintenance cost, number of doors, and so on), and the final subjective column “rating” is marked with the overall rating (vgood, good, acc, and unacc). These sorts of details come from the documentation found with the original data, and are key to projects (so we suggest keeping a lab book or notes).

Loading well-structured data

Loading data of this type into R is a one-liner: we use the R command utils::read .table() and we’re done.[11] To work this exercise, we assume that you have downloaded and unpacked the contents of this book’s GitHub repository https://github.com/WinVector/PDSwR2 and changed your working directory to PDSwR2/UCICar as explained in section “Working with this book” in the front matter (to do this, you will use the setwd() R function, and you will need to type in the full path to where you have saved PDSwR2, not just the text fragment we have shown). Once R is in the PDSwR2/UCICar directory, reading the data is done as shown in the following listing.

11

Another option is using functions from the readr package.

Listing 2.1. Reading the UCI car data
uciCar <- read.table(          1
    'car.data.csv',            2
    sep = ',',                 3
    header = TRUE,             4
    stringsAsFactor = TRUE     5
    )

View(uciCar)                   6

  • 1 Command to read from a file or URL and store the result in a new data frame object called uciCar
  • 2 Filename or URL to get the data from
  • 3 Specifies the column or field separator as a comma
  • 4 Tells R to expect a header line that defines the data column names
  • 5 Tells R to convert string values to factors. This is the default behavior, so we are just using this argument to document intent.
  • 6 Examines the data with R’s built-in table viewer

Listing 2.1 loads the data and stores it in a new R data frame object called uciCar, which we show a View() of in figure 2.2.

The read.table() command is powerful and flexible; it can accept many different types of data separators (commas, tabs, spaces, pipes, and others), and it has many options for controlling quoting and escaping data. read.table() can read from local files or remote URLs. If a resource name ends with the .gz suffix, read.table() assumes the file has been compressed in gzip style and will automatically decompress it while reading.

Examining our data

Once we’ve loaded the data into R, we’ll want to examine it. These are the commands to always try first:

  • class()Tells you what kind of R object you have. In our case, class(uciCar) tells us the object uciCar is of class data.frame. Class is an object-oriented concept, which describes how an object is going to behave. R also has a (less useful) typeof() command, which reveals how the object’s storage is implemented.
  • dim()For data frames, this command shows how many rows and columns are in the data.
  • head()Shows the top few rows (or “head”) of the data. Example: head(uciCar).
  • help()Provides the documentation for a class. In particular, try help(class (uciCar)).
  • str()Gives you the structure for an object. Try str(uciCar).
  • summary()Provides a summary of almost any R object. summary(uciCar) shows us a lot about the distribution of the UCI car data.
  • print()Prints all the data. Note: for large datasets, this can take a very long time and is something you want to avoid.
  • View()Displays the data in a simple spreadsheet-like grid viewer.
Many R functions are generic

Many R functions are generic in that they work much the same on many data types, or even object-oriented in that they pick a correct behavior depending on the runtime class of the object they are working with. We suggest that if you see a function used in an example on one object or class, try it on others. Common R functions that can be used on many different classes and types include length(), print(), saveRDS(), str(), and summary(). The R runtime is very robust and rewards experimentation. Most common errors are caught and cannot corrupt your data or crash the R interpreter. So please, experiment!

We show the results of a few of these steps next (R results are shown prefixed by “##” after each step).

Listing 2.2. Exploring the car data
class(uciCar)
## [1] "data.frame"               1
summary(uciCar)
##    buying      maint       doors
##  high :432   high :432   2    :432
##  low  :432   low  :432   3    :432
##  med  :432   med  :432   4    :432
##  vhigh:432   vhigh:432   5more:432
##
##  persons     lug_boot    safety
##  2   :576   big  :576   high:576
##  4   :576   med  :576   low :576
##  more:576   small:576   med :576
##
##    rating
##  acc  : 384
##  good :  69
##  unacc:1210
##  vgood:  65

dim(uciCar)
## [1] 1728    7                2

  • 1 The loaded object uciCar is of type data.frame.
  • 2 [1] is merely an output sequence marker. The actual information is this: uciCar has 1728 rows and 7 columns. Always try to confirm you got a good parse by at least checking that the number of rows is exactly one fewer than the number of lines of text in the original file. The difference of one is because the column header counts as a line of text, but not as a data row.

The summary() command shows us the distribution of each variable in the dataset. For example, we know each car in the dataset was declared to seat 2, 4, or more persons, and we know there were 576 two-seater cars in the dataset. Already we’ve learned a lot about our data, without having to spend a lot of time manually building pivot tables as we would have to in a spreadsheet.

Working with other data formats

.csv is not the only common data file format you’ll encounter. Other formats include .tsv (tab-separated values), pipe-separated (vertical bar) files, Microsoft Excel workbooks, JSON data, and XML. R’s built-in read.table() command can be made to read most separated value formats. Many of the deeper data formats have corresponding R packages:

  • CSV/TSV/FWF— The package reader (http://readr.tidyverse.org) supplies tools for reading “separated data” such as comma-separated values (CSV), tab-separated values (TSV), and fixed-width files (FWF).
  • SQL— https://CRAN.R-project.org/package=DBI
  • XLS/XLSX— http://readxl.tidyverse.org
  • .RData/.RDS— R has binary data formats (which can avoid complications of parsing, quoting, escaping, and loss of precision in reading and writing numeric or floating-point data as text). The .RData format is for saving sets of objects and object names, and is used through the save()/load() commands. The .RDS format is for saving single objects (without saving the original object name) and is used through the saveRDS()/readRDS() commands. For ad hoc work, .RData is more convenient (as it can save the entire R workspace), but for reusable work, the .RDS format is to be preferred as it makes saving and restoring a bit more explicit. To save multiple objects in .RDS format, we suggest using a named list.
  • JSON— https://CRAN.R-project.org/package=rjson
  • XML— https://CRAN.R-project.org/package=XML
  • MongoDB— https://CRAN.R-project.org/package=mongolite

2.2.2. Using R with less-structured data

Data isn’t always available in a ready-to-go format. Data curators often stop just short of producing a ready-to-go machine-readable format. The German bank credit dataset discussed in chapter 1 is an example of this. This data is stored as tabular data without headers; it uses a cryptic coding of values that requires the dataset’s accompanying documentation to untangle. This isn’t uncommon and is often due to habits or limitations of other tools that commonly work with the data. Instead of reformatting the data before we bring it into R, as we did in the last example, we’ll now show how to reformat the data using R. This is a much better practice, as we can save and reuse the R commands needed to prepare the data.

Details of the German bank credit dataset can be found at http://mng.bz/mZbu, and we have included a copy of this data in the directory PDSwR2/Statlog. We’ll show how to transform this data into something meaningful using R. After these steps, you can perform the analysis already demonstrated in chapter 1. As we can see in our file excerpt, the data appears to initially be an incomprehensible block of codes:

A11 6 A34 A43 1169 A65 A75 4 A93 A101 4 ...
A12 48 A32 A43 5951 A61 A73 2 A92 A101 2 ...
A14 12 A34 A46 2096 A61 A74 2 A93 A101 3 ...
  ...
Transforming data in R

Data often needs a bit of transformation before it makes sense. In order to decrypt troublesome data, you need what’s called the schema documentation or a data dictionary. In this case, the included dataset description says the data is 20 input columns followed by one result column. In this example, there’s no header in the data file. The column definitions and the meaning of the cryptic A-* codes are all in the accompanying data documentation. Let’s start by loading the raw data into R. Start a copy of R or RStudio and type in the commands in the following listing.

Listing 2.3. Loading the credit dataset
setwd("PDSwR2/Statlog")                    1
 d <- read.table('german.data', sep=' ',
   stringsAsFactors = FALSE, header = FALSE)

  • 1 Replace this path with the actual path where you have saved PDSwR2.

As there was no column header in the file, our data.frame d will have useless column names of the form V#. We can change the column names to something meaningful with the c() command, as shown in the following listing.

Listing 2.4. Setting column names
d <- read.table('german.data',
                sep  =  " ",
                stringsAsFactors  =  FALSE, header  =  FALSE)

colnames(d) <- c('Status_of_existing_checking_account', 'Duration_in_month',
                 'Credit_history', 'Purpose', 'Credit_amount', 'Savings_account_bonds',
                 'Present_employment_since',
                 'Installment_rate_in_percentage_of_disposable_income',
                 'Personal_status_and_sex', 'Other_debtors_guarantors',
                 'Present_residence_since', 'Property', 'Age_in_years',
                 'Other_installment_plans', 'Housing',
                 'Number_of_existing_credits_at_this_bank', 'Job',
                 'Number_of_people_being_liable_to_provide_maintenance_for',
                 'Telephone', 'foreign_worker', 'Good_Loan')
str(d)
## 'data.frame':    1000 obs. of  21 variables:
##  $ Status_of_existing_checking_account                     : chr  "A11" "A
     12" "A14" "A11" ...
##  $ Duration_in_month                                       : int  6 48 12
     42 24 36 24 36 12 30 ...
##  $ Credit_history                                          : chr  "A34" "A
     32" "A34" "A32" ...
##  $ Purpose                                                 : chr  "A43" "A
     43" "A46" "A42" ...
##  $ Credit_amount                                           : int  1169 595
     1 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ Savings_account_bonds                                   : chr  "A65" "A
     61" "A61" "A61" ...
##  $ Present_employment_since                                : chr  "A75" "A
     73" "A74" "A74" ...
##  $ Installment_rate_in_percentage_of_disposable_income     : int  4 2 2 2
     3 2 3 2 2 4 ...
##  $ Personal_status_and_sex                                 : chr  "A93" "A
     92" "A93" "A93" ...
##  $ Other_debtors_guarantors                                : chr  "A101" "
     A101" "A101" "A103" ...
##  $ Present_residence_since                                 : int  4 2 3 4
     4 4 4 2 4 2 ...
##  $ Property                                                : chr  "A121" "
     A121" "A121" "A122" ...
##  $ Age_in_years                                            : int  67 22 49
     45 53 35 53 35 61 28 ...
##  $ Other_installment_plans                                 : chr  "A143" "
     A143" "A143" "A143" ...
##  $ Housing                                                 : chr  "A152" "
     A152" "A152" "A153" ...
##  $ Number_of_existing_credits_at_this_bank                 : int  2 1 1 1
     2 1 1 1 1 2 ...
##  $ Job                                                     : chr  "A173" "
     A173" "A172" "A173" ...
##  $ Number_of_people_being_liable_to_provide_maintenance_for: int  1 1 2 2
     2 2 1 1 1 1 ...
##  $ Telephone                                               : chr  "A192" "
     A191" "A191" "A191" ...
##  $ foreign_worker                                          : chr  "A201" "
     A201" "A201" "A201" ...
##  $ Good_Loan                                               : int  1 2 1 1
     2 1 1 1 1 2 ...

The c() command is R’s method to construct a vector.[12] We copied the column names directly from the dataset documentation. By assigning our vector of names into the data frame’s colnames(), we’ve reset the data frame’s column names to something sensible.

12

c() also concatenates lists or vectors, without introducing additional nesting.

Assigning to accessors

In R the data frame class has a number of data accessors such as colnames() and names(). Many of these data accessors can be assigned to, as we saw when we assigned new names in listing 2.3 with colnames(d) <- c('Status_of_existing_checking_account', ...). This ability to assign into accessors is a bit unusual, but a very useful feature of R.

The data documentation further tells us the column names, and also has a code dictionary of the meanings of all of the cryptic A-* codes. For example, it says in column 4 (now called Purpose, meaning the purpose of the loan) that the code A40 is a “new car loan,” A41 is a “used car loan,” and so on. We can use R’s list-mapping capabilities to remap the values to more descriptive terms. The file PDSwR2/Statlog/GCDSteps.Rmd is an R Markdown that includes all the steps up through now and also remaps the values from the A# forms to clearer names. The file first implements the dataset documentation’s value mapping as an R named vector. This allows us to change the illegible names (such as A11) into somewhat meaningful descriptions (such as ... < 0 DM, which itself is presumably shorthand for “zero or fewer deutsche marks reported”).[13] The first few lines of this map definition look like the following:

13

German currency at the time of data collection was the deutsche mark (DM).

mapping <- c('A11' = '... < 0 DM',
             'A12' = '0 <= ... < 200 DM',
             'A13' = '... >= 200 DM / salary assignments for at least 1 year',
             ...
                )

Note: In building a named map, you must use the argument binding symbol =, and not any of the assignment operators such as <-.

With the mapping list defined, we can then use the following for loop to convert values in each column that was of type character from the original cryptic A-* codes into short level descriptions taken directly from the data documentation. We, of course, skip any such transform for columns that contain numeric data.

Listing 2.5. Transforming the car data
source("mapping.R")                             1
for(ci in colnames(d)) {                        2
    if(is.character(d[[ci]])) {
      d[[ci]] <- as.factor(mapping[d[[ci]]])    3
    }
}

  • 1 This file can be found at https://github.com/WinVector/PDSwR2/blob/master/Statlog/mapping.R.
  • 2 Prefer using column names to column indices.
  • 3 The [ [ ] ] notation is using the fact that data.frames are named lists of columns. So we are working on each column in turn. Notice the mapping lookup is vectorized: it is applied to all elements in the column in one step.

As we mentioned, the complete set of column preparations for this is in the R Markdown file PDSwR2/Statlog/GCDSteps.Rmd. We encourage readers to examine this file and try all of these steps themselves. For convenience, the prepared data is saved in PDSwR2/Statlog/creditdata.RDS.

Examining our new data

We can now easily examine the purpose of the first three loans with the command print(d[1:3,'Purpose']). We can look at the distribution of loan purpose with summary(d$Purpose). This summary is why we converted the values into factors, as summary() does not report much for string/character types, though we could also use table(d$Purpose, useNA = "always") directly on character types. We can also start to investigate the relation of loan type to loan outcome, as shown in the following listing.

Listing 2.6. Summary of Good_Loan and Purpose
setwd("PDSwR2/Statlog")                   1
 d <- readRDS("creditdata.RDS")           2

table(d$Purpose, d$Good_Loan)

##                       BadLoan GoodLoan
##   business                 34       63
##   car (new)                89      145
##   car (used)               17       86
##   domestic appliances       4        8
##   education                22       28
##   furniture/equipment      58      123
##   others                    5        7
##   radio/television         62      218
##   repairs                   8       14
##   retraining                1        8

  • 1 Sets the working directory. You will have to replace PDSwR2/Statlog with the actual full path to Statlog on your machine.
  • 2 Reads the prepared statlog data

From the output, we can see we have successfully loaded the data from the file. However, as mentioned, a lot of data is in other sources such as Excel spreadsheets (with the readxl package, these can be treated much like the way one works with files) and in databases (including big data systems such as Apache Spark). We will next discuss working with relational databases through the SQL query language and the DBI package.

2.3. Working with relational databases

In many production environments, the data you want lives in a relational or SQL database, not in files. Public data is often in files (as they are easier to share), but your most important client data is often in databases. Relational databases scale easily to hundreds of millions of records and supply important production features such as parallelism, consistency, transactions, logging, and audits. Relational databases are designed to support online transaction processing (OLTP), so they’re likely where transactions you need to know about were actually produced.

Often you can export the data into a structured file and use the methods from our previous sections to then transfer the data into R. But this is generally not the right way to do things. Exporting from databases to files is often unreliable and idiosyncratic due to loss of schema information, escaping, quoting, and character-encoding issues. The best way to work with data found in databases is to connect R directly to the database, which is what we’ll demonstrate in this section.

As a step of the demonstration, we’ll first show how to load data into a database. Relational databases are a good place for transformations such as joins or sampling (though packages such as sqldf and dplyr give R similar capabilities), which will be the topic of chapter 5. We will start working with data in a database for our next example.

2.3.1. A production-size example

For our production-size example, we’ll use the 2016 United States Census American Community Survey (ACS) Public Use Microdata Sample (PUMS) data, often called “ACS PUMS.” We have documentation on how to download and prepare a sample of this data in the dictionary PDSwR2/PUMS/download. We also have a ready-to-work-with recoded sample in the R-data file PDSwR2/PUMS/PUMSsample.RDS, allowing you to skip the initial download and processing steps.

The PUMS data is ideal for setting up somewhat realistic data science scenarios: summarizing data and building models predicting one column of the data from other columns. We will return to this dataset later in this book.

The PUMS is a remarkable set of data involving around 3 million individuals and 1.5 million households. It is one of the few shared United States Census datasets that deals with individual people and households (instead of per-region summaries). This is important, as most common data science tasks are designed to use detailed per-individual records, so this is public data that is most like the private data a data scientist would work with. Each row contains over 200 facts about each individual or household (income, employment, education, number of rooms, and so on). The data has household cross-reference IDs so individuals can be joined to the household they’re in. The size of the dataset is interesting: a few gigabytes when compressed. So it’s small enough to store on a good network or thumb drive, but larger than is convenient to work with on a laptop with R in memory (which is more comfortable when working in the range of hundreds of thousands of rows).

Summaries or marginals

Moving from individual-oriented data to summaries or marginals is an easy process called summary statistics or basic analytics. Converting the other way is often not possible, or at best a deep statistical problem (beyond the scope of basic data science). Most United States Census data is shared as regional summaries, so it often requires sophisticated statistical imputation methodology to generate useful individual-level predictive models. The PUMS data is very useful because it is individually oriented.

Tens of millions of rows is a sweet spot size for relational database or SQL-assisted analysis on a single machine. We’re not yet forced to move into a database cluster or a Apache Spark cluster to do our work.

Curating the data

A hard rule of science is that you must be able to reproduce your results. At the very least, you should be able to repeat your own successful work through your recorded steps. Everything must either have directions on how to produce it or clear documentation on where it came from. We call this the “no alien artifacts” discipline. For example, when we said we’re using PUMS American Community Survey data, this statement isn’t precise enough for anybody to know what data we specifically mean. Our actual notebook entry (which we keep online, so we can search it) on the PUMS data is shown in the next listing.

Listing 2.7. PUMS data provenance documentation (PDSwR2/PUMS/download/LoadPUMS.Rmd)
Data downloaded 4/21/2018 from:Reduce Zoom                               1

  https://www.census.gov/data/developers/data-sets/acs-1year.2016.html   2
   https://www.census.gov/programs-surveys/acs/
technical-documentation/pums.html
  http://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt
  https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/

First in a `bash` shell perform the following steps:

wget https://www2.census.gov/programs-surveys/acs/data/
pums/2016/1-Year/csv_hus.zip                                           3
 md5 csv_hus.zip
# MD5 (csv_hus.zip) = c81d4b96a95d573c1b10fc7f230d5f7a                   4
 wget https://www2.census.gov/programs-surveys/acs/data/pums/2016/1-Year/csv_pus.zip
md5 csv_pus.zip
# MD5 (csv_pus.zip) = 06142320c3865620b0630d74d74181db
wget http://www2.census.gov/programs-
     surveys/acs/tech_docs/pums/data_dict/PUMSDataDict16.txt
md5 PUMSDataDict16.txt
# MD5 (PUMSDataDict16.txt) = 56b4e8fcc7596cc8b69c9e878f2e699aunzip csv_hus.zip

  • 1 When we downloaded the data
  • 2 Where we found the data documentation. This is important to record, as many data files don’t contain links back to the documentation.
  • 3 The exact steps we took
  • 4 Cryptographic hashes of the file contents we downloaded. These are very short summaries (called hashes) that are highly unlikely to have the same value for different files. These summaries can later help us determine if another researcher in our organization is using the same data.
Keep notes

A big part of being a data scientist is being able to defend your results and repeat your work. We strongly advise keeping local copies of data and keeping a notebook. Notice that in listing 2.7 we not only show how and when we got the data, we also show what cryptographic hash the download had at the time. This is important to help ensure reproducible results and also to diagnose if and where something has changed. We also strongly advise keeping all of your scripts and code under version control, as we'll discuss in chapter 11. You absolutely need to be able to answer exactly what code and which data were used to build results you presented last week.

A particularly important form of note maintenance is using Git source control, which we will discuss in chapter 11.

Starting with the PUMS data

It is important to at least skim the downloaded PUMS data documentation: PDSwR2/PUMS/ACS2016_PUMS_README.pdf (a file that was in the downloaded zip container) and PDSwR2/PUMS/PUMSDataDict16.txt (one of the files we downloaded). Three things stand out: the data is distributed as comma-separated structured files with column headers, the values are coded as indecipherable integers (much like our earlier Statlog example), and the individuals are weighted to represent varying numbers of additional households. The R Markdown[14] script PDSwR2/PUMS/download/LoadPUMS.Rmd reads the CSV files (from a compressed intermediate file), recodes the values to more-meaningful strings, and takes a pseudo-random sample of the data with probabilities proportional to the specified household sampling weights. The proportional sampling both cuts the file size down to around 10 megabytes (a size easy to be distributed through GitHub) and builds a sample that can be used in a statistically correct manner, without further reference to the Census weights.

14

We will discuss R Markdown later in this book. It is an important format for storing both R code and text documentation together.

Sampling

When we say “pseudo-random sample,” we simply mean a sample built from R’s pseudo-random number generator. R’s random number generator is called “pseudo-random” as it is actually a deterministic sequence of choices that are hoped to be hard to predict and thus behave much like a truly random unpredictable sample. Pseudo-random samples are good to work with as they are repeatable: start the pseudo-random generator with the same seed, and you get the same sequences of choices. Prior to the widespread availability of digital computers, statisticians used to achieve this repeatability by using pre-prepared tables such as Rand Corporation’s 1955 book A Million Random Digits with 100,000 Normal Deviates. The intent is that a random sample should have properties very similar to the total population. The more common the feature you are working with, the more often this is in fact true.

Note: Some care must be taken around the repeatability of pseudo-random experiments. A number of things can interfere with the exact reproduction of pseudo-random samples and results. For example, using a different order of operation can produce different results (especially in the case of parallel algorithms), and R itself changed details of its pseudo-random number when it moved from version 3.5.* (used in the preparation of this book) to 3.6.* (the next version of R). As with things like floating-point representations, one must sometimes accept equivalent results in place of exactly identical results.

Structured data at a scale of millions of rows is best handled in a database, though R and the data.table package also work well at this scale. We will simulate working with data that lives in a database by copying our PUMS sample into an in-memory database, as shown next.

Listing 2.8. Loading data into R from a relational database
library("DBI")
library("dplyr")                                           1
library("rquery")

dlist <- readRDS("PUMSsample.RDS")                         2
db <- dbConnect(RSQLite::SQLite(), ":memory:")             3
dbWriteTable(db, "dpus", as.data.frame(dlist$ss16pus))     4
dbWriteTable(db, "dhus", as.data.frame(dlist$ss16hus)) 
rm(list = "dlist")                                         5

dbGetQuery(db, "SELECT * FROM dpus LIMIT 5")               6

dpus <- tbl(db, "dpus")                                    7
dhus <- tbl(db, "dhus")

print(dpus)                                                8
glimpse(dpus)

View(rsummary(db, "dpus"))                                 9

  • 1 Attaches some packages we wish to use commands and functions from.
  • 2 Loads the data from the compressed RDS disk format into R memory. Note: You will need to change the path PUMSsample to where you have saved the contents of PDSwR2/PUMS.
  • 3 Connects to a new RSQLite in-memory database. We will use RSQLite for our examples. In practice you would connect to a preexisting database, such as PostgreSQL or Spark, with preexisting tables.
  • 4 Copies the data from the in-memory structure dlist into the database
  • 5 Removes our local copy of the data, as we are simulating having found the data in the database
  • 6 Uses the SQL query language for a quick look at up to five rows of our data
  • 7 Builds dplyr handles that refer to the remote database data
  • 8 Uses dplyr to examine and work with the remote data
  • 9 Uses the rquery package to get a summary of the remote data

In this listing, we have deliberately not shown any of the results the commands produce, as we would like you to try this example yourself.

Code examples

All code examples from this book are available in the directory PDSwR2/CodeExamples. Taking code from this directory can be easier than retyping it and more reliable than copying and pasting from an electronic copy of the book (avoiding issues of page breaks, character encodings, and formatting glitches such as smart quotes).

Note that this data, while small, is out of the range where using spreadsheets is convenient. Using dim(dlist$ss16hus) and dim(dlist$ss16pus) (before the rm() step, or after reloading the data), we see that our household sample has 50,000 rows and 149 columns, and the people sample has 109,696 rows and 203 columns. All columns and value codes are defined in the Census documentation. Such documentation is critical, and we supply links to the documentation in PDSwR2/PUMS.

Examining and conditioning the PUMS data

The point of loading data into R is to facilitate modeling and analysis. Data analysts should always have their “hands in the data” and always take a quick look at their data after loading it. As our example, we’ll demonstrate how to perform a quick examination of some of the PUMS columns or fields.

Each row of PUMS data represents a single anonymized person or household. Personal data recorded includes occupation, level of education, personal income, and many other demographics variables. We loaded the data in listing 2.8, but before we continue, let’s discuss a few of the columns found in the dataset and its documentation:

  • Age— An integer found in column AGEP
  • Employment class— Examples: for-profit company, nonprofit company, and so on, found in column COW
  • Education level— Examples: no high school diploma, high school, college, and so on, found in column SCHL
  • Total person’s income— Found in column PINCP
  • Sex of worker— Found in column SEX

We will make our example problem to relate income (represented in US dollars in the field) to these variables. This is a typical predictive modeling task: relating some variables we know the values of (age, employment, and so on) to a variable we wish to know (in this case, income). This task is an example of supervised learning, meaning we use a dataset where both the observable variables (denoted “independent variables” in statistics) and the unobserved outcome (or the “dependent variable”) are both available at the same time. You usually get such labeled data by buying data, employing annotators, or using older data where you have had time to observe the desired outcome.

Don’t be too proud to sample

Many data scientists spend too much time adapting algorithms to work directly with big data. Often this is wasted effort, as for many model types you would get almost exactly the same results on a reasonably sized data sample. You only need to work with “all of your data” when what you’re modeling isn’t well served by sampling, such as when characterizing rare events or performing linkage calculations over social networks.

We don’t want to spend too much on time on the artificial aspects of the example problem; our goal is to illustrate modeling and data-handling procedures. Conclusions are very dependent on choices of data conditioning (what subset of the data you use) and data coding (how you map records to informative symbols). This is why empirical scientific papers have a mandatory “materials and methods” section describing how data was chosen and prepared. Our data treatment is to select a subset of “typical full-time workers” by restricting the subset to data that meets all of the following conditions:

  • Workers self-described as full-time employees
  • Workers reporting at least 30 hours a week of activity
  • Workers 18–65 years of age
  • Workers with an annual income between $1,000 and $250,000.

The following listing shows the code to limit to our desired subset of the data. Continuing with our data from listing 2.8, we work as shown in listing 2.9. As our data is small (just a sample from PUMS), we use the DBI package to bring the data into R where we can work with it.

Listing 2.9. Loading data from a database
dpus <- dbReadTable(db, "dpus")                        1

dpus <- dpus[, c("AGEP", "COW", "ESR",  "PERNP",
                 "PINCP","SCHL", "SEX", "WKHP")]       2


for(ci in c("AGEP", "PERNP", "PINCP", "WKHP")) {       3
   dpus[[ci]] <- as.numeric(dpus[[ci]])
}

dpus$COW <- strtrim(dpus$COW, 50)                      4

str(dpus)                                              5

  • 1 Copies data from the database into R memory. This assumes we are continuing from the previous example, so the packages we have attached are still available and the database handle db is still valid.
  • 2 All the columns in this copy of PUMS data are stored as the character type to preserve features such as leading zeros from the original data. Here we are converting columns we wish to treat as numeric to the numeric type. Non-numeric values, often missing entries, get coded with the symbol NA, which stands for not available.
  • 3 Selects a subset of columns we want to work with. Restricting columns is not required, but improves legibility of later printing.
  • 4 The PUMS level names are very long (which is one of the reasons these columns are distributed as integers), so for this dataset that has level names instead of level codes, we are shortening the employment codes to no more than 50 characters.
  • 5 Looks at the first few rows of data in a column orientation.
Watch out for NAs

R’s representation for blank or missing data is NA. Unfortunately, a lot of R commands quietly skip NAs without warning. The command table(dpus$COW, useNA = 'always') will show NAs much like summary(dpus$COW) does.

We have now performed a few standard data analysis steps: loading the data, reprocessing a few columns, and taking a look at the data. These steps have been performed using what we call “base R,” which means using features and functions coming from the R language itself and the basic packages that are automatically attached (such as base, stats, and utils). R is well suited to data processing tasks, as this is what most users come to R to do. There are extension packages such as dplyr that have their own data processing notation and can perform many steps directly against data in a database in addition to being able to work on data held in memory. We share examples showing how to perform the same data processing steps using base R in the R Markdown example PDSwR2/PUMS/PUMS1.Rmd, or using dplyr in PDSwR2/PUMS/PUMS1_dplyr.Rmd, or using the advanced query generation package rquery in PDSwR2/PUMS/PUMS1_rquery.Rmd.

We are now ready to work our notional problem in listing 2.10: characterizing income with relation to other facts known about individuals. We will start with some domain-specific steps: we will remap some level names and convert the levels to factors, each with a chosen reference level. Factors are strings taken from a specified set (much like an enumerate type in other languages). Factors also have one special level called the reference level ; it is convention that each level is considered to be a difference from the reference level. For example, we will set all less-than-bachelors-degree education levels to a new level called No Advanced Degree and make No Advanced Degree our reference level. Some R modeling functions will then score education levels such as Master’s Degree as how they differ from the reference level No Advanced Degree. This will be made clear in our example.

Listing 2.10. Remapping values and selecting rows from data
target_emp_levs <- c(                                        1
  "Employee of a private for-profit company or busine",
  "Employee of a private not-for-profit, tax-exempt, ",
  "Federal government employee",
  "Local government employee (city, county, etc.)",
  "Self-employed in own incorporated business, profes",
  "Self-employed in own not incorporated business, pr",
  "State government employee")

complete <- complete.cases(dpus)                             2

stdworker <- with(dpus,                                      3
                   (PINCP>1000) &
                    (ESR=="Civilian employed, at work") &
                    (PINCP<=250000) &
                    (PERNP>1000) & (PERNP<=250000) &
                    (WKHP>=30) &
                    (AGEP>=18) & (AGEP<=65) &
                    (COW %in% target_emp_levs))

dpus <- dpus[complete & stdworker, , drop = FALSE]           4

no_advanced_degree <- is.na(dpus$SCHL) |                     5
   (!(dpus$SCHL %in% c("Associate's degree",
                      "Bachelor's degree",
                      "Doctorate degree",
                      "Master's degree",
                      "Professional degree beyond a bachelor's degree")))
dpus$SCHL[no_advanced_degree] <- "No Advanced Degree"

dpus$SCHL <- relevel(factor(dpus$SCHL),                      6
                      "No Advanced Degree")
dpus$COW <- relevel(factor(dpus$COW),
                    target_emp_levs[[1]])
dpus$ESR <- relevel(factor(dpus$ESR),
                    "Civilian employed, at work")
dpus$SEX <- relevel(factor(dpus$SEX),
                    "Male")

saveRDS(dpus, "dpus_std_employee.RDS")                       7

summary(dpus)                                                8

  • 1 Defines a vector of employment definitions we consider “standard”
  • 2 Builds a new logical vector indicating which rows have valid values in all of our columns of interest. In real applications, dealing with missing values is important and cannot always be handled by skipping incomplete rows. We will return to the issue of properly dealing with missing values when we discuss managing data.
  • 3 Builds a new logical vector indicating which workers we consider typical full-time employees. All of these column names are the ones we discussed earlier. The results of any analysis will be heavily influenced by this definition, so, in a real task, we would spend a lot of time researching the choices in this step. It literally controls who and what we are studying. Notice that to keep things simple and homogeneous, we restricted this study to civilians, which would be an unacceptable limitation in a complete work.
  • 4 Restricts to only rows or examples that meet our definition of a typical worker
  • 5 Recodes education, merging the less-than-bachelor’s-degree levels to the single level No Advanced Degree
  • 6 Converts our string-valued columns to factors, picking the reference level with the relevel() function
  • 7 Save this data to a file so we can use it in later examples. This file is also already available at the path PDSwR2/PUMS/dpus_std_employee.RDS.
  • 8 Takes a look at our data. One of the advantages of factors is that summary() builds up useful counts for them. However, it was best to delay having string codes as factors until after we finished with remapping level codes.
A bit more on factor coding

R’s factor type encodes strings as integer indices into a known set of possible strings. For example, our SCHL column is represented in R as follows:

levels(dpus$SCHL)                                                          1
## [1] "No Advanced Degree"                            "Associate's degree"

## [3] "Bachelor's degree"                              "Doctorate degree"
## [5] "Master's degree"                                "Professional degree
     beyond a bachelor's degree"

head(dpus$SCHL)                                                            2
## [1] Associate's degree Associate's degree Associate's degree No Advanced D
     egree Doctorate degree Associate's degree
##   6 Levels: No Advanced Degree Associate's degree Bachelor's degree Doctor
     ate degree ... Professional degree beyond a bachelor's degree

str(dpus$SCHL)                                                             3
##  Factor w/ 6 levels "No Advanced Degree",..: 2 2 2 1 4 2 1 5 1 1 ...

  • 1 Shows the possible levels for SCHL
  • 2 Shows how the first few levels are represented as codes
  • 3 Shows the first few string values for SCHL

Non-statisticians are often surprised that you can use non-numeric columns (such as strings or factors) as inputs to or variables in models. This can be accomplished a number of ways, and the most common one is a method called introducing indicators or dummy variables. In R this encoding is often automatic and unseen. In other systems (such as Python’s scikit-learn), the analyst must specify an encoding (through a method name such as “one-hot”). In this book, we will use this encoding and additional, more sophisticated encodings from the vtreat package. The SCHL column can be explicitly converted into basic dummy variables as we show next. This recoding strategy will be used both implicitly and explicitly in the book, so we will demonstrate it here:

d <- cbind(                                                                1
   data.frame(SCHL = as.character(dpus$SCHL),                              2
              stringsAsFactors = FALSE),
   model.matrix(~SCHL, dpus)                                               3
 )
d$'(Intercept)' <- NULL                                                    4
 str(d)                                                                    5

## 'data.frame':    41305 obs. of  6 variables:
##  $ SCHL                                              : chr  "Associate's d
     egree" "Associate's degree" "Associate's degree" "No Advanced Degree" ...

##  $ SCHLAssociate's degree                            : num  1 1 1 0 0 1 0
     0 0 0 ...
##  $ SCHLBachelor's degree                             : num  0 0 0 0 0 0 0
     0 0 0 ...
##  $ SCHLDoctorate degree                              : num  0 0 0 0 1 0 0
     0 0 0 ...
##  $ SCHLMaster's degree                               : num  0 0 0 0 0 0 0
     1 0 0 ...
##  $ SCHLProfessional degree beyond a bachelor's degree: num  0 0 0 0 0 0 0
     0 0 0 ...

  • 1 The cbind operator combines two data frames by columns, or each row is built by matching columns from rows in each data frame.
  • 2 Builds a data.frame with the SCHL column recoded as character strings instead of as a factor
  • 3 Builds a matrix with dummy variables generated from the SCHL factor column
  • 4 Removes a column named "(Intercept)" from the data.frame, as it is a side effect of model.matrix that we are not interested in at this time.
  • 5 Shows the structure that presents the original SCHL string form along with the indicators. str() presents the first few rows in transpose format (flipped so rows are now up and down and columns are across).

Notice that the reference level No Advanced Degree did not get a column, and new indicator columns have a 1, which reveals which value is in the original SCHL column. The No Advanced Degree columns have all-zero dummies, so we can also tell which examples had that value. This coding can be read as “all-zero rows are the base or normal case and other rows differ from the all-zero case by having one indicator on (showing which case we are talking about).” Notice that this encoding contains all the information of the original string form, but all columns are now numeric (which is a format many machine learning and modeling procedures require). This format is implicitly used in many R machine learning and modeling functions, and the user may not even be aware of the conversion.

Working with the PUMS data

At this point, we are ready to practice working on our problem with data. As we have seen, summary(dpus) already gives us information about the distribution of every variable in our dataset. We can also look at relations between variables with one of the tabulating commands: tapply() or table(). For example, to see a count of examples simultaneously broken down by level of schooling and sex, we could type in the command table(schooling = dpus$SCHL, sex = dpus$SEX). To get the mean income broken down the same way, we could use the command tapply(dpus$PINCP, list(dpus$SCHL, dpus$SEX), FUN = mean).

table(schooling = dpus$SCHL, sex = dpus$SEX)                            1

##                                                 sex
## schooling                                         Male Female
##   No Advanced Degree                             13178   9350
##   Associate's degree                              1796   2088
##   Bachelor's degree                               4927   4519
##   Doctorate degree                                 361    269
##   Master's degree                                 1792   2225
##   Professional degree beyond a bachelor's degree   421    379

tapply(                                                                 2
    dpus$PINCP,                                                         3
    list(dpus$SCHL, dpus$SEX),                                          4
    FUN = mean                                                          5
    )

##                                                     Male   Female
## No Advanced Degree                              44304.21 33117.37
## Associate's degree                              56971.93 42002.06
## Bachelor's degree                               76111.84 57260.44
## Doctorate degree                               104943.33 89336.99
## Master's degree                                 94663.41 69104.54
## Professional degree beyond a bachelor's degree 111047.26 92071.56

  • 1 Uses the table command to count how often each pair of SCHL and SEX occurs
  • 2 Uses tapply to tally how often each pair of SCHL of SEX occurs
  • 3 This argument is the vector of data we are aggregating or summarizing in the tapply.
  • 4 This argument list specifies how we are grouping the data, in this case simultaneously by SCHL and SEX.
  • 5 This argument specifies how we are aggregating values; in this case, we are taking the mean or average using the mean function.

The same calculation in dplyr idiom is as follows:

library("dplyr")

dpus %>%
  group_by(., SCHL, SEX)  %>%
  summarize(.,
            count = n(),
            mean_income = mean(PINCP)) %>%
  ungroup(.) %>%
  arrange(., SCHL, SEX)

## # A tibble: 12 x 4
##    SCHL                                           SEX    count mean_income
##    <fct>                                          <fct>  <int>       <dbl>
##  1 No Advanced Degree                             Male   13178      44304.
##  2 No Advanced Degree                             Female  9350      33117.
##  3 Associate's degree                             Male    1796      56972.
##  4 Associate's degree                             Female  2088      42002.
##  5 Bachelor's degree                              Male    4927      76112.
##  6 Bachelor's degree                              Female  4519      57260.
##  7 Doctorate degree                               Male     361     104943.
##  8 Doctorate degree                               Female   269      89337.
##  9 Master's degree                                Male    1792      94663.
## 10 Master's degree                                Female  2225      69105.
## 11 Professional degree beyond a bachelor's degree Male     421     111047.
## 12 Professional degree beyond a bachelor's degree Female   379      92072.

dplyr pipelines express tasks as sequences of basic data transformations. Also, notice that the tapply() result was in a so-called wide format (data cells keyed by row and column), and the dplyr output is in a tall format (data cells keyed by key columns in each row).

We can even graph relations, as shown in listing 2.11. Finally, if we want a model estimating income as a joint function of all of our other variables simultaneously, we can try a regression, which is the topic of chapter 8. Converting between such formats is one of the key topics covered in chapter 5.

Listing 2.11. Plotting the data
WVPlots::ScatterHist(
  dpus, "AGEP", "PINCP",
  "Expected income (PINCP) as function age (AGEP)",
  smoothmethod = "lm",
  point_alpha = 0.025)

This is a moment to celebrate, as we have finally achieved a data science goal. In figure 2.3, we are looking at the data and relations in the data. The technical task of explaining the summary information in the graph will be covered in chapter 8.

Figure 2.3. Scatter plot of income (PINCP) as a function of age (AGEP)

We’ll return to the Census data and demonstrate more-sophisticated modeling techniques a few times in this book. In all cases, we are working these examples to demonstrate the basic challenges one encounters in working with their hands on the data, and to introduce some of the R tools that are ready to help. As a follow-up, we strongly advise running these examples, consulting the help() on all of these functions, and also searching online for official documentation and user guides.

Summary

In this chapter, we’ve worked through the basics of how to initially extract, transform, and load data for analysis. For smaller datasets, we perform the transformations using R and in memory. For larger datasets, we advise using a SQL database or even a big data system such as Spark (via the sparklyr package plus SQL, dplyr, or rquery). In any case, we save all the transformation steps as code (either in SQL or in R) that can be reused in the event of a data refresh. The intent of this chapter was to prepare for the actual interesting work in our next chapters: exploring, managing, correcting, and modeling data.

R is built to work with data, and the purpose of loading data into R is to examine and work with it. In chapter 3, we’ll demonstrate how to characterize your data through summaries, exploration, and graphing. These are key steps early in any modeling effort because it is through these steps that you learn the actual details and nature of the problem you’re hoping to model.

In this chapter you have learned

  • Data frames, with their discipline of each row being an instance and each column being a variable or measurement, are a preferred data structure for data analysis.
  • Use utils::read.table() or the readr package to load small, structured datasets into R.
  • The DBI package allows you to work directly with databases or Apache Spark using any of SQL, dplyr, or rquery.
  • R is designed to work with data in high-level steps, and has many ready-made, data-transforming commands and functions. Generally, if a task becomes difficult in R, it is because you are accidentally attempting to reimplement high-level data transforms in terms of low-level programming steps.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset