Chapter 5. Data engineering and data shaping

This chapter covers

  • Becoming comfortable with applying data transforms
  • Starting with important data manipulation packages including data.table and dplyr
  • Learning to control the layout of your data

This chapter will show you how to use R to organize or wrangle data into a shape useful for analysis. Data shaping is a set of steps you have to take if your data is not found all in one table or in an arrangement ready for analysis.

Figure 5.1 is the mental model for this chapter: working with data. Previous chapters have assumed the data is in a ready-to-go form, or we have pre-prepared the data to be in such a form for you. This chapter will prepare you to take these steps yourself. The basic concept of data wrangling is to visualize your data being structured to make your task easier, and then take steps to add this structure to your data. To teach this, we'll work a number of examples, each with a motivating task, and then work a transform that solves the problem. We'll concentrate on a set of transforms that are powerful and useful, and that cover most common situations.

Figure 5.1. Chapter 5 mental model

We will show data wrangling solutions using base R, data.table, and dplyr.[1] Each of these has its advantages, which is why we are presenting more than one solution. Throughout this book, we are deliberately using a polyglot approach to data wrangling: mixing base R, data.table, and dplyr, as convenient. Each of these systems has its strengths:

1

For database tasks, we suggest using dbplyr or rquery, which we will touch on briefly in appendix A.

  • Base R— This is code written in R that directly manipulates data.frames using R’s built-in capabilities. Breaking complex transforms into base R primitives can be a puzzle, but we will give you the tools to solve the puzzle in this chapter.
  • data.table data.table is the package for fast and memory-efficient data manipulation in R. It differs from normal R semantics in that data.table uses reference semantics where changes are made directly in a shared data structure (visible to all references to the same structure) instead of R’s more typical value semantics (where changes made in one reference do not become visible to other references). data.table notation specifies powerful transforms through a variation of the []-indexing operator, and is well explained in help(data .table, package="data.table") and vignette("datatable-intro", package= "data.table").
  • dplyr dplyr is a popular data manipulation package that emphasizes data manipulations through sequences of SQL-like (or Codd-style) operators. dplyr is usually not as fast (or space efficient) as data.table, but the notations are convenient.

Some good places to start on manipulating data in R are the following free guides:

We want to improve your ability to write R code (to translate intent into implementation) and to read R code (to infer intent from existing code). To this end, this chapter, like much of this book, is designed as a sequence of worked examples. We strongly encourage you try running the examples yourself (they are available here: https://github.com/WinVector/PDSwR2). It is key to get into the habit of planning (even drawing) your data transformation before coding. Finish clarifying your intent before getting bogged down by the details of coding. Trust that there are easy-to-find methods for most common data wrangling tasks in R, and plan with the assumption that you can find them when needed. The principle is this: make your analysis simpler by transforming your data into a simple “data matrix” format where each row is an observation, and each column is a measurement type. Fix issues, such as odd column names, early, and then you don’t have to write complicated code to work around them.

This chapter is organized in terms of the type of transform needed. For variety, we will bring in a number of small notional datasets, and spend a little time looking at each dataset. This task-to-example organization will give you a quick introduction to data transforms in R. The transforms we are going to cover include these:

  • Choosing a subset of columns
  • Choosing a subset of rows
  • Reordering rows
  • Creating new columns
  • Dealing with missing values
  • Combining two datasets by row
  • Combining two datasets by column
  • Joining two datasets
  • Aggregating rows
  • General data reshaping (tall versus wide forms)

The idea is that this list will give you enough tools for a very large number of tasks. We'll work from problems to solutions. We will show which command solves a given problem, and leave details of the command’s syntax to R’s help system and the guides and tutorials we suggest in this chapter. Please think of this chapter as a Rosetta Stone for data wrangling: each concept is explained once and then executed three times (usually in base R, data.table, and dplyr).

Our first application (subsetting rows and columns) will set up the general pattern of the applications, so it is worth reading through even if you are already confidently subsetting data in R.

Data sources

In this chapter, we’ll use small, toy-sized datasets to make it easier to examine data before and after the transforms. We strongly suggest you run all the examples along with us. All examples are either built-in data that comes with R or available from the book’s GitHub site: https://github.com/WinVector/PDSwR2. Also, all code examples can be found in the CodeExamples at the same location. We suggest cloning or downloading this material to help working with this book.

For more information on R’s built-in examples, try the command help(datasets).

5.1. Data selection

This section covers removing rows, removing columns, reordering columns, removing missing data, and reordering data rows. In the era of big data, you often have too much to look at, so limiting your data to what you need can greatly speed up your work.

5.1.1. Subsetting rows and columns

A common task when working with a dataset is selecting a subset of rows or columns.

Situation

For our first example, we will use the iris dataset: measurements of sepal length and width and petal length and width for three species of iris.

First we will look at some aspects of the data. We suggest always doing this and making it part of an “eyes on the data” work discipline. For example, figure 5.2 shows the petal dimensions of our example irises:

library("ggplot2")                                                     1

summary(iris)                                                          2
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
##
##        Species
##  setosa    :50
##  versicolor:50
##  virginica :50

  • 1 Attaches the ggplot2 package for later plotting
  • 2 Takes a look at the built-in iris data

Figure 5.2. Example iris plot

Attaching packages

It is good practice to attach packages early. If a package won’t attach, try installing it with a command such as install.packages ("ggplot2").

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

ggplot(iris,
       aes(x = Petal.Length, y = Petal.Width,
           shape = Species, color = Species)) +
  geom_point(size =2 ) +
  ggtitle("Petal dimensions by iris species: all measurements")

The iris data comes preinstalled with R and is part of the datasets package. We will deliberately use small examples in this chapter so it is easy to look at results.

Scenario

Suppose we are assigned to generate a report on only petal length and petal width, by iris species, for irises where the petal length is greater than 2. To accomplish this, we need to select a subset of columns (variables) or a subset of rows (instances) from a data frame.

Column and row selections look like figure 5.3.

Figure 5.3. Selecting columns and rows

The diagrams

The diagrams in this chapter are meant to be mnemonic cartoons of the transform. We suggest looking at the actual data before and after the transform to get more details on what the data transforms are doing. We also suggest reviewing them again after working the solutions and noticing how they abstract the arrangement of the data before and after the transforms. Think of these diagrams as a visual index of transforms.

Solution 1: Base R

The base R solution works by using the [,] indexing operator.

drop = FALSE

When working with [,] always add a third argument drop = FALSE to get around the issue that the default behavior when selecting a single column from an R data.frame returns a vector and not a data.frame containing the column. In many cases, we know we have more than one column, and don’t strictly need the command. But it is good to get in the habit of adding this argument to avoid unpleasant surprises.

The solution strategy is this:

  • Get desired columns by name or column index in the second position of [,].
  • Get desired rows by Boolean per-row selection in the first position of [,].
columns_we_want <- c("Petal.Length", "Petal.Width", "Species")
rows_we_want <- iris$Petal.Length > 2

# before
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

iris_base <- iris[rows_we_want, columns_we_want, drop = FALSE]

# after
head(iris_base)

##    Petal.Length Petal.Width    Species
## 51          4.7         1.4 versicolor
## 52          4.5         1.5 versicolor
## 53          4.9         1.5 versicolor
## 54          4.0         1.3 versicolor
## 55          4.6         1.5 versicolor
## 56          4.5         1.3 versicolor

Notice column selection is also a good way to reorder columns. An advantage of base R is it tends to be fast and has very stable APIs: code written this year in base R is most likely to work next year (tidyverse packages, unfortunately, tend to have less-stable APIs). The one disadvantage is that a few base R defaults are irritating. For example, we included the drop=FALSE notation to work around the fact that base R would return a vector instead of a data.frame if we tried to select only one column.

Solution 2: data.table

Row and column selection in data.table is performed similarly to base R. data .table uses a very powerful set of index notations. In this case, we use a .. notation to tell data.table that we are using the second index position to specify column names (and not to specify calculations, as we will demonstrate later).

library("data.table")

iris_data.table <- as.data.table(iris)                                 1

columns_we_want <- c("Petal.Length", "Petal.Width", "Species")
rows_we_want <- iris_data.table$Petal.Length > 2

iris_data.table <- iris_data.table[rows_we_want , ..columns_we_want]   2

head(iris_data.table)

##    Petal.Length Petal.Width    Species
## 1:          4.7         1.4 versicolor
## 2:          4.5         1.5 versicolor
## 3:          4.9         1.5 versicolor
## 4:          4.0         1.3 versicolor
## 5:          4.6         1.5 versicolor
## 6:          4.5         1.3 versicolor

  • 1 Converts to data.table class to get data.table semantics
  • 2 The .. notation tells data.table that columns_we_want isn’t itself the name of a column but a variable referring to names of columns.

The advantage of data.table is that it is the fastest and most memory efficient solution for data wrangling in R at a wide range of scales. data.table has a very helpful FAQ, and there is a nice cheat sheet:

Both of these will make more sense if you approach them after working some examples from the data.table vignette accessible in R with the command vignette ("datatable-intro", package = "data.table").

Taking care when using data.table

data.table works like data.frames for packages that are not data.table-aware. This means you can use data.tables with just about any package, even those that predate data.table. In a data.table-aware situation (using data.table at the command line, or using a package that depends on data.table), data.table implements slightly enhanced semantics. We show a quick example here:

library("data.table")

df <- data.frame(x = 1:2, y = 3:4)                              1

df[, x]                                                         2
## Error in `[.data.frame`(df, , x) : object 'x' not found

x <- "y"                                                        3
dt <- data.table(df)

dt[, x]                                                         4
## [1] 1 2

dt[, ..x]                                                       5
##    y
## 1: 3
## 2: 4

  • 1 Example data.frame
  • 2 Notice that writing df[, x] instead of df[, "x"] is an error (assuming x is not bound to a value in our environment).
  • 3 Sets up data.table example
  • 4 Notice that this returns the column x much like d$x would.
  • 5 This uses data.table’s “look up” idiom to get a data.table of columns referred to by the variable x.
Solution 3: dplyr

The dplyr solution is written in terms of select and filter:

  • dplyr::select to select desired columns
  • dplyr::filter to select desired rows

It is traditional to chain dplyr steps with the magrittr pipe operator %>%, but assigning to temporary variables works just as well. While teaching here, we'll use explicit dot notation, where the data pipeline is written as iris %>% select(., column) instead of the more-common implicit first-argument notation (iris %>% select(column)). Explicit dot notation was discussed in chapter 2 and is the topic of the following R Tip: http://www.win-vector.com/blog/2018/03/r-tip-make-arguments-explicit-in-magrittr-dplyr-pipelines/.[2]

2

A powerful alternate pipe called the dot arrow pipe (written as %.>%) that insists on explicit dot notation and supplies additional capabilities can be found in the wrapr package. For most of this book, we will stick to the magrittr pipe, but we encourage curious readers to check out the wrapr pipe in their own work.

library("dplyr")

iris_dplyr <- iris %>%
  select(.,
         Petal.Length, Petal.Width, Species) %>%
  filter(.,
         Petal.Length > 2)

head(iris_dplyr)

##   Petal.Length Petal.Width    Species
## 1          4.7         1.4 versicolor
## 2          4.5         1.5 versicolor
## 3          4.9         1.5 versicolor
## 4          4.0         1.3 versicolor
## 5          4.6         1.5 versicolor
## 6          4.5         1.3 versicolor

The advantage of dplyr is the emphasis of data processing as a sequence of operations broken down into a visible pipeline.

There is a nice cheat sheet for dplyr available from https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf. Cheat sheets are always going to be a bit brief, so the sheet will become very useful after you have tried a few examples.

5.1.2. Removing records with incomplete data

An important variation of subsetting data is removing rows of data that have missing values. We will also deal with some simple strategies for replacing missing values by moving values across rows (using na.locf()) or moving values across columns (called coalescing).[3]

3

There is actually an entire science devoted to imputing values for missing data. A good resource on this topic is https://CRAN.R-project.org/view=MissingData.

In this section, we will show how to quickly select only rows that have no missing data or values. This is only an example; we generally suggest using the methodologies in chapters 4 and 8 for treating missing values in real-world applications.

Situation

As our preceding example does not have missing values, we will move to another example: the msleep dataset of sleep times of animals with different characteristics. In this dataset, several rows have missing values. An additional goal of this example is to familiarize you with a number of common practice datasets. These are the datasets that you should break out to try a new data wrangling method.

First, as always, let’s look at the data:

library("ggplot2")
data(msleep)               1

str(msleep)

## Classes 'tbl_df', 'tbl' and 'data.frame':    83 obs. of  11 variables:
##  $ name        : chr  "Cheetah" "Owl monkey" "Mountain beaver" "Greater sh
     ort-tailed shrew" ...
##  $ genus       : chr  "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
##  $ vore        : chr  "carni" "omni" "herbi" "omni" ...
##  $ order       : chr  "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
##  $ conservation: chr  "lc" NA "nt" "lc" ...
##  $ sleep_total : num  12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
##  $ sleep_rem   : num  NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
##  $ sleep_cycle : num  NA NA NA 0.133 0.667 ...
##  $ awake       : num  11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
##  $ brainwt     : num  NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
##  $ bodywt      : num  50 0.48 1.35 0.019 600 ...

summary(msleep)

##      name              genus               vore
##  Length:83          Length:83          Length:83
##  Class :character   Class :character   Class :character
##  Mode  :character   Mode  :character   Mode  :character
##
##
##
##
##     order           conservation        sleep_total      sleep_rem
##  Length:83          Length:83          Min.   : 1.90   Min.   :0.100
##  Class :character   Class :character   1st Qu.: 7.85   1st Qu.:0.900
##  Mode  :character   Mode  :character   Median :10.10   Median :1.500
##                                        Mean   :10.43   Mean   :1.875
##                                        3rd Qu.:13.75   3rd Qu.:2.400
##                                        Max.   :19.90   Max.   :6.600
##                                                        NA's   :22
##   sleep_cycle         awake          brainwt            bodywt
##  Min.   :0.1167   Min.   : 4.10   Min.   :0.00014   Min.   :   0.005
##  1st Qu.:0.1833   1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174
##  Median :0.3333   Median :13.90   Median :0.01240   Median :   1.670
##  Mean   :0.4396   Mean   :13.57   Mean   :0.28158   Mean   : 166.136
##  3rd Qu.:0.5792   3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750
##  Max.   :1.5000   Max.   :22.10   Max.   :5.71200   Max.   :6654.000
##  NA's   :51                       NA's   :27

  • 1 Copies the msleep from the ggplot2 package into our workspace
Scenario

We have been asked to build an extract of the msleep data that has no missing values. To accomplish this, we will remove all rows that have missing values. The cartoon of the transform is shown in figure 5.4.

Figure 5.4. Removing rows with missing values

Base R solution

  • complete.cases() returns a vector with one entry for each row of the data frame, which is TRUE if and only if the row has no missing entries. Once we know what rows we want, it is just a matter of selecting those rows (which we have seen earlier).
  • na.omit() performs the whole task in one step.
clean_base_1 <- msleep[complete.cases(msleep), , drop = FALSE]

summary(clean_base_1)

##      name              genus               vore
##  Length:20          Length:20          Length:20
##  Class :character   Class :character   Class :character
##  Mode  :character   Mode  :character   Mode  :character
##
##
##
##     order           conservation        sleep_total       sleep_rem
##  Length:20          Length:20          Min.   : 2.900   Min.   :0.600
##  Class :character   Class :character   1st Qu.: 8.925   1st Qu.:1.300
##  Mode  :character   Mode  :character   Median :11.300   Median :2.350
##                                        Mean   :11.225   Mean   :2.275
##                                        3rd Qu.:13.925   3rd Qu.:3.125
##                                        Max.   :19.700   Max.   :4.900
##   sleep_cycle         awake          brainwt            bodywt
##  Min.   :0.1167   Min.   : 4.30   Min.   :0.00014   Min.   :  0.0050
##  1st Qu.:0.1792   1st Qu.:10.07   1st Qu.:0.00115   1st Qu.:  0.0945
##  Median :0.2500   Median :12.70   Median :0.00590   Median :  0.7490
##  Mean   :0.3458   Mean   :12.78   Mean   :0.07882   Mean   : 72.1177
##  3rd Qu.:0.4167   3rd Qu.:15.07   3rd Qu.:0.03670   3rd Qu.:  6.1250
##  Max.   :1.0000   Max.   :21.10   Max.   :0.65500   Max.   :600.0000

nrow(clean_base_1)

## [1] 20

clean_base_2 = na.omit(msleep)
nrow(clean_base_2)

## [1] 20
data.table solution

The complete.cases() solution also works with data.table:

library("data.table")

msleep_data.table <- as.data.table(msleep)

clean_data.table = msleep_data.table[complete.cases(msleep_data.table), ]

nrow(clean_data.table)

## [1] 20
dplyr solution

dplyr::filter can also be used with complete.cases().

With magrittr pipe notation, a . is taken to mean the item being piped. So we can use . to refer to our data multiple times conveniently, such as telling the dplyr::filter to use the data both as the object to be filtered and as the object to pass to complete .cases().

library("dplyr")

clean_dplyr <- msleep %>%
filter(., complete.cases(.))

nrow(clean_dplyr)

## [1] 20

5.1.3. Ordering rows

In this section, we want to sort or control what order our data rows are in. Perhaps the data came to us unsorted, or sorted for a purpose other than ours.

Scenario

We are asked to build a running or cumulative sum of sales by time, but the data came to us out of order:

purchases <- wrapr::build_frame(        1
    "day", "hour", "n_purchase" |
   1    , 9     , 5            |
   2    , 9     , 3            |
   2    , 11    , 5            |
   1    , 13    , 1            |
   2    , 13    , 3            |
   1    , 14    , 1            )

  • 1 Uses wrapr::build_frame to type data in directly in legible column order
Problem

Reorder the rows by day and then hour and compute a running sum. The abstract diagram is shown in figure 5.5.

Figure 5.5. Ordering rows

Base R solution
order_index <- with(purchases, order(day, hour))                          1

purchases_ordered <- purchases[order_index, , drop = FALSE]
purchases_ordered$running_total <- cumsum(purchases_ordered$n_purchase)   2

purchases_ordered

##   day hour n_purchase running_total
## 1   1    9          5             5
## 4   1   13          1             6
## 6   1   14          1             7
## 2   2    9          3            10
## 3   2   11          5            15
## 5   2   13          3            18

  • 1 with() executes the code in its second argument as if the columns of the first argument were variables. This lets us write x instead of purchases_ordered$x.
  • 2 Computes the running sum
data.table solution
library("data.table")

DT_purchases <- as.data.table(purchases)

order_cols <- c("day", "hour")              1
setorderv(DT_purchases, order_cols)
DT_purchases[ , running_total := cumsum(n_purchase)]

# print(DT_purchases)

  • 1 Reorders data
:= and []

Operations that alter data in place (such as :=) annotate the result to suppress printing. This is important, as often you are working with large structures and do not want intermediate data to print. [] is a no-operation that as a side effect restores printing.

setorderv() reorders data in place and takes a list of ordering column names to specify the order. This is much more convenient than the base R solution that takes multiple ordering columns as multiple arguments. wrapr::orderv() tries to bridge this gap by allowing the user to specify ordering constraints with a list of columns (column values, not column names).

dplyr solution

dplyr uses the word arrange to order data, and mutate to add a new column:

library("dplyr")

res <- purchases %>%
  arrange(., day, hour) %>%
  mutate(., running_total = cumsum(n_purchase))

# print(res)
Advanced use of ordering

For our advanced example, suppose we want the cumulative sum of sales to be perday—that is, to reset the sum at the start of each day.

Base R solution

This easiest base R solution is a split and recombine strategy:

order_index <- with(purchases, order(day, hour))                  1
purchases_ordered <- purchases[order_index, , drop = FALSE]

data_list <- split(purchases_ordered, purchases_ordered$day)      2

data_list <- lapply(                                              3
  data_list,
  function(di) {
    di$running_total <- cumsum(di$n_purchase)
    di
  })

purchases_ordered <- do.call(base::rbind, data_list)              4
rownames(purchases_ordered) <- NULL                               5

purchases_ordered

##   day hour n_purchase running_total
## 1   1    9          5             5
## 2   1   13          1             6
## 3   1   14          1             7
## 4   2    9          3             3
## 5   2   11          5             8
## 6   2   13          3            11

  • 1 First sorts the data
  • 2 Now splits the data into a list of groups
  • 3 Applies the cumsum to each group
  • 4 Puts the results back together into a single data.frame
  • 5 R often keeps annotations in rownames(). In this case, it is storing the original row numbers of the pieces we are assembling. This can confuse users when printing, so it is good practice to remove these annotations, as we do here.
data.table solution

The data.table solution is particularly concise. We order the data and then tell data.table to calculate the new running sum per-group with the by argument. The idea that the grouping is a property of the calculation, and not a property of the data, is similar to SQL and helps minimize errors.

:= versus =

In data.table, := means “assign in place”—it is used to alter or create a column in the incoming data.table. Conversely, = is used to mean “create in new data.table,” and we wrap these sorts of assignments in a .() notation so that column names are not confused with arguments to data.table.

library("data.table")

# new copy for result solution
DT_purchases <- as.data.table(purchases)[order(day, hour),
             .(hour = hour,
               n_purchase = n_purchase,
               running_total = cumsum(n_purchase)),
             by = "day"]                                          1
# print(DT_purchases)                                             2

# in-place solution
DT_purchases <- as.data.table(purchases)
order_cols <- c("day", "hour")
setorderv(DT_purchases, order_cols)
DT_purchases[ , running_total := cumsum(n_purchase), by = day]
# print(DT_purchases)                                             3

# don't reorder the actual data variation!
DT_purchases <- as.data.table(purchases)
DT_purchases[order(day, hour),
             `:=`(hour = hour,
               n_purchase = n_purchase,
               running_total = cumsum(n_purchase)),
             by = "day"]

# print(DT_purchases)                                             4

  • 1 Adding the by keyword converts the calculation into a per-group calculation.
  • 2 First solution: result is a second copy of the data .(=) notation. Only columns used in the calculation (such as day) and those explicitly assigned to are in the result.
  • 3 Second solution: result is computed in place by ordering the table before the grouped calculation.
  • 4 Third solution: result is in the same order as the original table, but the cumulative sum is computed as if we sorted the table, computed the grouped running sum, and then returned the table to the original order.
Sequencing data.table operations

Sequencing data.table operations is achieved either by writing in-place operations one after the other (as we did in these examples) or by starting a new open-[ right after a close-] for operations that create new copies (this is called method chaining and is equivalent to using a pipe operator).

dplyr solution

The dplyr solution works because the command mutate() (which we will discuss in the next section) works per-group if the data is grouped. We can make the data grouped by using the group_by() command:

library("dplyr")

res <- purchases %>%
  arrange(., day, hour) %>%
  group_by(., day) %>%
  mutate(., running_total = cumsum(n_purchase)) %>%
  ungroup(.)

# print(res)
ungroup()

In dplyr it is important to always ungroup your data when you are done performing per-group operations. This is because the presence of a dplyr grouping annotation can cause many downstream steps to calculate unexpected and incorrect results. We advise doing this even after a summarize() step, as summarize() removes one key from the grouping, leaving it unclear to a code reader if the data remains grouped or not.

5.2. Basic data transforms

This section covers adding and renaming columns.

5.2.1. Adding new columns

The section covers adding new variables (columns) to a data frame, or applying transformations to existing columns (see figure 5.6).

Figure 5.6. Adding or altering columns

Example data

For our example data, we will use air quality measurements with missing data and non-standard date formatting, for the year 1973:

library("datasets")
library("ggplot2")

summary(airquality)

##      Ozone           Solar.R           Wind             Temp
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00
##  NA's   :37       NA's   :7
##      Month            Day
##  Min.   :5.000   Min.   : 1.0
##  1st Qu.:6.000   1st Qu.: 8.0
##  Median :7.000   Median :16.0
##  Mean   :6.993   Mean   :15.8
##  3rd Qu.:8.000   3rd Qu.:23.0
##  Max.   :9.000   Max.   :31.0
##
Scenario

We are asked to convert this non-standard date representation into a new, more useful date column for queries and plotting.

library("lubridate")
library("ggplot2")

# create a function to make the date string.
datestr = function(day, month, year) {
  paste(day, month, year, sep="-")
}
Base R solution

In base R, we create new columns by assigning to them:

airquality_with_date <- airquality                                    1

airquality_with_date$date <- with(airquality_with_date,               2
                                   dmy(datestr(Day, Month, 1973)))

airquality_with_date <- airquality_with_date[,                        3
                                              c("Ozone", "date"),
                                              drop = FALSE]

head(airquality_with_date)                                            4

##   Ozone       date
## 1    41 1973-05-01
## 2    36 1973-05-02
## 3    12 1973-05-03
## 4    18 1973-05-04
## 5    NA 1973-05-05
## 6    28 1973-05-06
ggplot(airquality_with_date, aes(x = date, y = Ozone)) +              5
  geom_point() +
  geom_line() +
  xlab("Date") +
  ggtitle("New York ozone readings, May 1 - Sept 30, 1973")

  • 1 Builds a copy of the data
  • 2 Adds the date column, using with () to refer to columns without needing the table name
  • 3 Limits down to the columns of interest
  • 4 Shows the results
  • 5 Plots the results

The preceding code produces figure 5.7.

Figure 5.7. Ozone plot example

Base R has had transform-style (or pipeable) versions of these basic operators for quite some time (just no pipe!). Let’s work the example again in that style:

library("wrapr")                                                   1

airquality %.>%                                                    2
  transform(., date = dmy(datestr(Day, Month, 1973))) %.>%
  subset(., !is.na(Ozone), select =  c("Ozone", "date")) %.>%
  head(.)
##   Ozone       date
## 1    41 1973-05-01
## 2    36 1973-05-02
## 3    12 1973-05-03
## 4    18 1973-05-04
## 6    28 1973-05-06
## 7    23 1973-05-07

  • 1 Attaches the wrapr package to define the wrapr dot arrow pipe: %.>%. The dot arrow pipe is another R pipe and is described in the R Journal at https://journal.r-project.org/archive/2018/RJ-2018-042/index.html.
  • 2 Runs all the steps as before using transform() and subset(), adding an extra step of filtering down to rows that do not have missing Ozone values
data.table solution

data.table uses := to show column alterations or creations that are supposed to happen “in place” (the data.table at hand gets altered, instead of a new one being created).

library("data.table")

DT_airquality <-
  as.data.table(airquality)[                       1
     , date := dmy(datestr(Day, Month, 1973)) ][   2
       , c("Ozone", "date")]                       3

head(DT_airquality)

##    Ozone       date
## 1:    41 1973-05-01
## 2:    36 1973-05-02
## 3:    12 1973-05-03
## 4:    18 1973-05-04
## 5:    NA 1973-05-05
## 6:    28 1973-05-06

  • 1 Builds a data.table copy of the data
  • 2 Adds the date column
  • 3 Limits down to the columns of interest

Notice how the open-[ steps work a lot like pipes, connecting one data.table stage to another. This is one of the reasons data.table places so many operations inside the []: in R the [] naturally chains operations left to right.

dplyr solution

dplyr users will remember that in dplyr, new columns are produced with the mutate() command :

library("dplyr")

airquality_with_date2 <- airquality %>%
  mutate(., date = dmy(datestr(Day, Month, 1973))) %>%
  select(., Ozone, date)

head(airquality_with_date2)

##   Ozone       date
## 1    41 1973-05-01
## 2    36 1973-05-02
## 3    12 1973-05-03
## 4    18 1973-05-04
## 5    NA 1973-05-05
## 6    28 1973-05-06
The scenario continued

Notice the original Ozone graph had holes in the data, due to missing values. We will try to fix this by propagating the last known Ozone reading forward to the dates with missing values. This “the task was finished ... until we looked at the results” situation is typical of data science. So always look, and look for problems.

Filling in missing values from earlier in a column is illustrated in figure 5.8.

Figure 5.8. Filling in missing values

The zoo package supplies a function called na.locf(), which is designed to solve our issue. We will show how to apply this function now.

Base R solution
library("zoo")

airquality_corrected <- airquality_with_date
airquality_corrected$OzoneCorrected <-
  na.locf(airquality_corrected$Ozone, na.rm = FALSE)

summary(airquality_corrected)

##      Ozone             date            OzoneCorrected
##  Min.   :  1.00   Min.   :1973-05-01   Min.   :  1.00
##  1st Qu.: 18.00   1st Qu.:1973-06-08   1st Qu.: 16.00
##  Median : 31.50   Median :1973-07-16   Median : 30.00
##  Mean   : 42.13   Mean   :1973-07-16   Mean   : 39.78
##  3rd Qu.: 63.25   3rd Qu.:1973-08-23   3rd Qu.: 52.00
##  Max.   :168.00   Max.   :1973-09-30   Max.   :168.00
##  NA's   :37

ggplot(airquality_corrected, aes(x = date, y = Ozone)) +
  geom_point(aes(y=Ozone)) +
  geom_line(aes(y=OzoneCorrected)) +
  ggtitle("New York ozone readings, May 1 - Sept 30, 1973",
          subtitle = "(corrected)") +
  xlab("Date")

This produces figure 5.9.

Figure 5.9. Ozone plot again

Use na.rm = FALSE

Always use na.rm = FALSE with na.locf(); otherwise, it may delete initial NA elements from your data.

data.table solution
library("data.table")
library("zoo")

DT_airquality[, OzoneCorrected := na.locf(Ozone, na.rm=FALSE)]

summary(DT_airquality)

##      Ozone             date            OzoneCorrected
##  Min.   :  1.00   Min.   :1973-05-01   Min.   :  1.00
##  1st Qu.: 18.00   1st Qu.:1973-06-08   1st Qu.: 16.00
##  Median : 31.50   Median :1973-07-16   Median : 30.00
##  Mean   : 42.13   Mean   :1973-07-16   Mean   : 39.78
##  3rd Qu.: 63.25   3rd Qu.:1973-08-23   3rd Qu.: 52.00
##  Max.   :168.00   Max.   :1973-09-30   Max.   :168.00
##  NA's   :37

Notice that data.table performed the correction “in place,” in DT_airquality instead of producing a new data.frame.

dplyr solution
library("dplyr")
library("zoo")

airquality_with_date %>%
  mutate(.,
         OzoneCorrected = na.locf(Ozone, na.rm = FALSE)) %>%
  summary(.)

##      Ozone             date            OzoneCorrected
##  Min.   :  1.00   Min.   :1973-05-01   Min.   :  1.00
##  1st Qu.: 18.00   1st Qu.:1973-06-08   1st Qu.: 16.00
##  Median : 31.50   Median :1973-07-16   Median : 30.00
##  Mean   : 42.13   Mean   :1973-07-16   Mean   : 39.78
##  3rd Qu.: 63.25   3rd Qu.:1973-08-23   3rd Qu.: 52.00
##  Max.   :168.00   Max.   :1973-09-30   Max.   :168.00
##  NA's   :37

5.2.2. Other simple operations

A number of additional simple operations commonly used in working with data are available—in particular, renaming columns by altering the column names directly, and also removing columns by assigning NULL. We will show these briefly:

d <- data.frame(x = 1:2, y = 3:4)
print(d)
#>   x y
#> 1 1 3
#> 2 2 4

colnames(d) <- c("BIGX", "BIGY")
print(d)
#>   BIGX BIGY
#> 1    1    3
#> 2    2    4

d$BIGX <- NULL
print(d)
#>   BIGY
#> 1    3
#> 2    4

5.3. Aggregating transforms

This section covers transforms that combine multiple rows or multiple columns.

5.3.1. Combining many rows into summary rows

Here we address the situation where there are multiple observations or measurements of a single subject, in this case species of Iris, that we wish to aggregate into a single observation.

Scenario

We have been asked to make a report summarizing iris petals by species.

Problem

Summarize measurements by category, as shown in figure 5.10.

Figure 5.10. Aggregating rows

Example data

Again, we use measurements of petal length and width, by iris species, from the iris dataset :

library("datasets")
library("ggplot2")

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
Base R solution
iris_summary <- aggregate(
  cbind(Petal.Length, Petal.Width) ~ Species,
  data = iris,
  FUN = mean)

print(iris_summary)

#      Species Petal.Length Petal.Width
# 1     setosa        1.462       0.246
# 2 versicolor        4.260       1.326
# 3  virginica        5.552       2.026

library(ggplot2)
ggplot(mapping = aes(x = Petal.Length, y = Petal.Width,
                     shape = Species, color = Species)) +
  geom_point(data = iris, # raw data
             alpha = 0.5) +
  geom_point(data = iris_summary, # per-group summaries
             size = 5) +
  ggtitle("Average Petal dimensions by iris species
(with raw data for refer
     ence)")

This produces figure 5.11, a new iris plot with group mean annotations.

Figure 5.11. Iris plot

data.table solution
library("data.table")

iris_data.table <- as.data.table(iris)
iris_data.table <- iris_data.table[,
                                   .(Petal.Length = mean(Petal.Length),
                                     Petal.Width = mean(Petal.Width)),
                                   by = .(Species)]

# print(iris_data.table)
dplyr solution

  • dplyr::group_by
  • dplyr::summarize
  • A one-argument aggregation function, for example sum or mean
library("dplyr")

iris_summary <- iris %>% group_by(., Species) %>%
  summarize(.,
            Petal.Length = mean(Petal.Length),
            Petal.Width = mean(Petal.Width)) %>%
  ungroup(.)

# print(iris_summary)
Window functions

Both data.table and dplyr have grouped versions of the preceding operations (similar to what relational databases call window functions). This lets each row include the per-group summary in each row without having to build a summary table and join (the usual way to compute such quantities). For example:

iris_copy <- iris
iris_copy$mean_Petal.Length <-
      ave(iris$Petal.Length, iris$Species, FUN = mean)
iris_copy$mean_Petal.Width <- ave(iris$Petal.Width, iris$Species, FUN = mean)

# head(iris_copy)
# tail(iris_copy)

In data.table, the task looks like the following:

library("data.table")

iris_data.table <- as.data.table(iris)

iris_data.table[ ,
                 `:=`(mean_Petal.Length = mean(Petal.Length),
                      mean_Petal.Width = mean(Petal.Width)),
                 by = "Species"]

# print(iris_data.table)

Please run the preceding code and print iris_data.table to see that the computed means are per-group.

dplyr has similar functionality:

library("dplyr")

iris_dplyr <- iris %>%
  group_by(., Species) %>%
  mutate(.,
         mean_Petal.Length = mean(Petal.Length),
         mean_Petal.Width = mean(Petal.Width)) %>%
  ungroup(.)

# head(iris_dplyr)

Again, it is critical to ungroup() when applying per-group transforms. Also, be aware that dplyr grouped operations (in particular, row selection through filter()) can be much slower than ungrouped operations, so you want to make your group()/ ungroup() intervals as short as possible. And dplyr grouped operations are usually much slower than data.table grouped operations in general.

5.4. Multitable data transforms

This section covers operations between multiple tables. This includes the tasks of splitting tables, concatenating tables, and joining tables.

5.4.1. Combining two or more ordered data frames quickly

Here we discuss combining two data frames, with the same number of rows or columns (and same order!). A more involved, but more general way to combine data is demonstrated in section 5.4.2.

Scenario

We have been asked to draw information about products from a sales database and produce a report. Typically, different facts (in this case, price and units sold) are stored in different tables, so to produce our report, we will have to combine data from more than one table.

For example, suppose our example data was the following:

productTable <- wrapr::build_frame(
   "productID", "price" |
   "p1"       , 9.99    |
   "p2"       , 16.29   |
   "p3"       , 19.99   |
   "p4"       , 5.49    |
   "p5"       , 24.49   )

salesTable <- wrapr::build_frame(
   "productID", "sold_store", "sold_online" |
   "p1"       , 6           , 64            |
   "p2"       , 31          , 1             |
   "p3"       , 30          , 23            |
   "p4"       , 31          , 67            |
   "p5"       , 43          , 51            )

productTable2 <- wrapr::build_frame(
   "productID", "price" |
   "n1"       , 25.49   |
   "n2"       , 33.99   |
   "n3"       , 17.99   )

productTable$productID <- factor(productTable$productID)
productTable2$productID <- factor(productTable2$productID)
Problem 1: Appending rows

When two tables have the exact same column structure, we can concatenate them to get a larger table, as in figure 5.12.

Figure 5.12. Unioning rows

Base R solution

rbind

rbind_base = rbind(productTable,
                   productTable2)

Note that rbind creates a new factor variable when merging incompatible factor variables:

str(rbind_base)

## 'data.frame':    8 obs. of  2 variables:
##  $ productID: Factor w/ 8 levels "p1","p2","p3",..: 1 2 3 4 5 6 7 8
##  $ price    : num  9.99 16.29 19.99 5.49 24.49 ...
data.table solution
library("data.table")

rbindlist(list(productTable,
               productTable2))

##    productID price
## 1:        p1  9.99
## 2:        p2 16.29
## 3:        p3 19.99
## 4:        p4  5.49
## 5:        p5 24.49
## 6:        n1 25.49
## 7:        n2 33.99
## 8:        n3 17.99

data.table also correctly merges factor types.

dplyr solution

dplyr::bind_rows

library("dplyr")

bind_rows(list(productTable,
               productTable2))

## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

##   productID price
## 1        p1  9.99
## 2        p2 16.29
## 3        p3 19.99
## 4        p4  5.49
## 5        p5 24.49
## 6        n1 25.49
## 7        n2 33.99
## 8        n3 17.99

Notice that bind_rows coerces incompatible factor variables to character.

Problem 2: Splitting tables

The inverse of row binding is splitting. Many difficult calculations can be made easy by splitting a data frame into a family of data frames, then working on each data frame, and finally rebinding them together. The best implementation is the one found in data.table, which has some priority (being one of the first). na.rm = FALSE only simulates splitting and recombining data (so tends to be very fast).

Base R solution
# add an extra column telling us which table
# each row comes from
productTable_marked <- productTable
productTable_marked$table <- "productTable"
productTable2_marked <- productTable2
productTable2_marked$table <- "productTable2"

# combine the tables
rbind_base <- rbind(productTable_marked,
                    productTable2_marked)
rbind_base

##   productID price         table
## 1        p1  9.99  productTable
## 2        p2 16.29  productTable
## 3        p3 19.99  productTable
## 4        p4  5.49  productTable
## 5        p5 24.49  productTable
## 6        n1 25.49 productTable2
## 7        n2 33.99 productTable2
## 8        n3 17.99 productTable2

# split them apart
tables <- split(rbind_base, rbind_base$table)
tables

## $productTable
##   productID price        table
## 1        p1  9.99 productTable
## 2        p2 16.29 productTable
## 3        p3 19.99 productTable
## 4        p4  5.49 productTable
## 5        p5 24.49 productTable
##
## $productTable2
##   productID price         table
## 6        n1 25.49 productTable2
## 7        n2 33.99 productTable2
## 8        n3 17.99 productTable2
data.table solution

data.table combines the split, apply, and recombine steps into a single, very efficient operation. We will continue our example with the rbind_base object to show the effect. data.table is willing to call a user function or execute a user expression for each data group and supplies special variables to work per-group:

  • .BY A named list of the grouping variables and values per group. .BY is a list of scalars, as by definition grouping variables do not vary per group.
  • .SD A data.table representation of the set of rows for the given group with the grouping columns removed.

For instance, to compute a max price per group, we can do the following:

library("data.table")

# convert to data.table
dt <- as.data.table(rbind_base)

# arbitrary user defined function
f <- function(.BY, .SD) {
  max(.SD$price)
}

# apply the function to each group
# and collect results
dt[ , max_price := f(.BY, .SD), by = table]

print(dt)

##    productID price         table max_price
## 1:        p1  9.99  productTable     24.49
## 2:        p2 16.29  productTable     24.49
## 3:        p3 19.99  productTable     24.49
## 4:        p4  5.49  productTable     24.49
## 5:        p5 24.49  productTable     24.49
## 6:        n1 25.49 productTable2     33.99
## 7:        n2 33.99 productTable2     33.99
## 8:        n3 17.99 productTable2     33.99

Note that the preceding is a powerful general form not needed for such a simple task. Simple per-group aggregation of values is usually achieved by naming the columns:

library("data.table")

dt <- as.data.table(rbind_base)
grouping_column <- "table"
dt[ , max_price := max(price), by = eval(grouping_column)]

print(dt)

##    productID price         table max_price
## 1:        p1  9.99  productTable     24.49
## 2:        p2 16.29  productTable     24.49
## 3:        p3 19.99  productTable     24.49
## 4:        p4  5.49  productTable     24.49
## 5:        p5 24.49  productTable     24.49
## 6:        n1 25.49 productTable2     33.99
## 7:        n2 33.99 productTable2     33.99
## 8:        n3 17.99 productTable2     33.99

In this example, we took the liberty of showing how we would group by a column chosen by a variable.

dplyr solution

dplyr doesn’t have its own split implementation. dplyr tries to simulate working on subtables with its group_by() notation. For example, to compute the maximum price per group in dplyr, we would write code like the following:

rbind_base %>%
  group_by(., table) %>%
  mutate(., max_price = max(price)) %>%
  ungroup(.)

## # A tibble: 8 x 4
##   productID price table         max_price
##   <fct>     <dbl> <chr>             <dbl>
## 1 p1         9.99 productTable       24.5
## 2 p2        16.3  productTable       24.5
## 3 p3        20.0  productTable       24.5
## 4 p4         5.49 productTable       24.5
## 5 p5        24.5  productTable       24.5
## 6 n1        25.5  productTable2      34.0
## 7 n2        34.0  productTable2      34.0
## 8 n3        18.0  productTable2      34.0

This is not going to be as powerful as calling an arbitrary function per data group.

Problem 3: Appending columns

Append a data frame as columns to another data frame. The data frames must have the same number of rows and same row order (with respect to what we consider to be row-keys). This is illustrated in figure 5.13.

Figure 5.13. Unioning columns

Create a table of product information (price and units sold), from productTable and salesTable. This assumes that the products are sorted in the same order in both tables. If they are not, then sort them, or use a join command to merge the tables together (please see section 5.4.2).

Base R solution

cbind

cbind(productTable, salesTable[, -1])

##   productID price sold_store sold_online
## 1        p1  9.99          6          64
## 2        p2 16.29         31           1
## 3        p3 19.99         30          23
## 4        p4  5.49         31          67
## 5        p5 24.49         43          51
data.table solution

For binding columns, data.table methods require the data to already be of type data.table.

library("data.table")

cbind(as.data.table(productTable),
      as.data.table(salesTable[, -1]))

##    productID price sold_store sold_online
## 1:        p1  9.99          6          64
## 2:        p2 16.29         31           1
## 3:        p3 19.99         30          23
## 4:        p4  5.49         31          67
## 5:        p5 24.49         43          51
dplyr solution

dplyr::bind_cols

library("dplyr")

# list of data frames calling convention
dplyr::bind_cols(list(productTable, salesTable[, -1]))

##   productID price sold_store sold_online
## 1        p1  9.99          6          64
## 2        p2 16.29         31           1
## 3        p3 19.99         30          23
## 4        p4  5.49         31          67
## 5        p5 24.49         43          51

5.4.2. Principal methods to combine data from multiple tables

Join is the relational name for the process of combining two tables to create a third. The join results in a table that possibly has a new row for every pair of rows from the original two tables (plus possibly rows from each table that did not have matches from the other table). Rows are matched on key-values, matching from one table to another. The simplest case is when each table has a set of columns that uniquely determine each row (a unique key), and this is the case we will discuss here.

Scenario

Our example data is information about products in a sales database. Various facts (in this case, price and units sold) are stored in different tables. Missing values are allowed. We are tasked with combining these tables to produce a report.

First let’s set up some example data:

productTable <- wrapr::build_frame(
   "productID", "price" |
   "p1"       , 9.99    |
   "p3"       , 19.99   |
   "p4"       , 5.49    |
   "p5"       , 24.49   )

salesTable <- wrapr::build_frame(
   "productID", "unitsSold" |
   "p1"       , 10          |
   "p2"       , 43          |
   "p3"       , 55          |
   "p4"       , 8           )
Left join

The most important join for the data scientist is likely the left join. This join keeps every row from the left table and adds columns coming from matching rows in the right table. When there are no matching rows, NA values are substituted in. Usually, you design the right table (the second argument to your join command) to have unique keys; otherwise, the number of rows may grow (there is no need for the left table to have unique keys).

The operation is typically used to adjoin data from the second (or right) table into a copy of the first or left table, as shown in figure 5.14.

Figure 5.14. Left join

Base R solution

merge with argument all.x = TRUE

merge(productTable, salesTable, by = "productID", all.x = TRUE)

##   productID price unitsSold
## 1        p1  9.99        10
## 2        p3 19.99        55
## 3        p4  5.49         8
## 4        p5 24.49        NA
data.table solution
library("data.table")

productTable_data.table <- as.data.table(productTable)
salesTable_data.table <- as.data.table(salesTable)

# index notation for join
# idea is rows are produced for each row inside the []
salesTable_data.table[productTable_data.table, on = "productID"]

##    productID unitsSold price
## 1:        p1        10  9.99
## 2:        p3        55 19.99
## 3:        p4         8  5.49
## 4:        p5        NA 24.49

# data.table also overrides merge()
merge(productTable, salesTable, by = "productID", all.x = TRUE)

##   productID price unitsSold
## 1        p1  9.99        10
## 2        p3 19.99        55
## 3        p4  5.49         8
## 4        p5 24.49        NA
Base R indexing solution

The data.table index notation reminds us that there is another very good Base R way to use one table to add a single column to another: vectorized lookup through the match() and [] methods.

library("data.table")

joined_table <- productTable
joined_table$unitsSold <- salesTable$unitsSold[match(joined_table$productID,
salesTable$productID)]
print(joined_table)

##   productID price unitsSold
## 1        p1  9.99        10
## 2        p3 19.99        55
## 3        p4  5.49         8
## 4        p5 24.49        NA

match() found the matching indices, and [] used the indices to retrieve the data. Please see help(match) for more details.

dplyr solution
library("dplyr")

left_join(productTable, salesTable, by = "productID")

##   productID price unitsSold
## 1        p1  9.99        10
## 2        p3 19.99        55
## 3        p4  5.49         8
## 4        p5 24.49        NA
Right join

There is also a join called right join that is just the left join with the arguments reversed. As the right join is so similar to the left, we will forgo any right join examples.

Inner join

In an inner join, you merge two tables into a single table, keeping only the rows where the key exists in both tables. This produces an intersection of the two tables, as shown in figure 5.15.

Figure 5.15. Inner join

Base R solution

merge

merge(productTable, salesTable, by = "productID")

##   productID price unitsSold
## 1        p1  9.99        10
## 2        p3 19.99        55
## 3        p4  5.49         8
data.table solution
library("data.table")

productTable_data.table <- as.data.table(productTable)
salesTable_data.table <- as.data.table(salesTable)

merge(productTable, salesTable, by = "productID")

##   productID price unitsSold
## 1        p1  9.99        10
## 2        p3 19.99        55
## 3        p4  5.49         8
dplyr solution

inner_join

library("dplyr")

inner_join(productTable, salesTable, by = "productID")

##   productID price unitsSold
## 1        p1  9.99        10
## 2        p3 19.99        55
## 3        p4  5.49         8
Full join

In a full join, you merge two tables into a single table, keeping rows for all key values. Notice that the two tables have equal importance here. We show the result in figure 5.16.

Figure 5.16. Full join

Base R solution

merge with argument all=TRUE

# note that merge orders the result by key column by default
# use sort=FALSE to skip the sorting
merge(productTable, salesTable, by = "productID", all=TRUE)

##   productID price unitsSold
## 1        p1  9.99        10
## 2        p2    NA        43
## 3        p3 19.99        55
## 4        p4  5.49         8
## 5        p5 24.49        NA
data.table solution
library("data.table")

productTable_data.table <- as.data.table(productTable)
salesTable_data.table <- as.data.table(salesTable)

merge(productTable_data.table, salesTable_data.table,
      by = "productID", all = TRUE)

##    productID price unitsSold
## 1:        p1  9.99        10
## 2:        p2    NA        43
## 3:        p3 19.99        55
## 4:        p4  5.49         8
## 5:        p5 24.49        NA
dplyr solution

dplyr::full_join

library("dplyr")

full_join(productTable, salesTable, by = "productID")

##   productID price unitsSold
## 1        p1  9.99        10
## 2        p3 19.99        55
## 3        p4  5.49         8
## 4        p5 24.49        NA
## 5        p2    NA        43
A harder join problem

The examples we have given up to now do not use row order. Some problems can be solved much more efficiently with methods that do use row order, such as data.table's powerful rolling join operation.

Scenario

You are given historic stock trade and quote (bid/ask) data. You are asked to perform the following analyses on the stock data: find what the bid and ask price were current when each trade was performed. This involves using row order to indicate time, and sharing information between rows.

Example data

In stock markets, the bid is the highest price somebody has declared they are willing to pay for a stock, and the ask is the lowest price that somebody has declared they are willing to sell a stock for. Bid and ask data are called quotes, and they usually are in an irregular time series (as new quotes can be formed at arbitrary times, not just at regular intervals), such as the following example:

library("data.table")

quotes <- data.table(
  bid = c(5, 5, 7, 8),
  ask = c(6, 6, 8, 10),
  bid_quantity = c(100, 100, 100, 100),
  ask_quantity = c(100, 100, 100, 100),
  when = as.POSIXct(strptime(
    c("2018-10-18 1:03:17",
      "2018-10-18 2:12:23",
      "2018-10-18 2:15:00",
      "2018-10-18 2:17:51"),
    "%Y-%m-%d %H:%M:%S")))

print(quotes)

##    bid ask bid_quantity ask_quantity                when
## 1:   5   6          100          100 2018-10-18 01:03:17
## 2:   5   6          100          100 2018-10-18 02:12:23
## 3:   7   8          100          100 2018-10-18 02:15:00
## 4:   8  10          100          100 2018-10-18 02:17:51

Another irregular time series is trades. These are after-the-fact reports about exchanges of quantities of stock at a given price at a given time.

trades <- data.table(
  trade_id = c(32525, 32526),
  price = c(5.5, 9),
  quantity = c(100, 200),
  when = as.POSIXct(strptime(
    c("2018-10-18 2:13:42",
      "2018-10-18 2:19:20"),
    "%Y-%m-%d %H:%M:%S")))

print(trades)

##    trade_id price quantity                when
## 1:    32525   5.5      100 2018-10-18 02:13:42
## 2:    32526   9.0      200 2018-10-18 02:19:20
Rolling joins

The data.table rolling join is perfect for finding what was the most recent quote information for each trade. A rolling join is a type of join on an ordered column that gives us the most recent data available at the lookup time.

quotes[, quote_time := when]
      trades[ , trade_time := when ]
      quotes[ trades, on = "when", roll = TRUE ][
        , .(quote_time, bid, price, ask, trade_id, trade_time) ]
##             quote_time bid price ask trade_id          trade_time
## 1: 2018-10-18 02:12:23   5   5.5   6    32525 2018-10-18 02:13:42
## 2: 2018-10-18 02:17:51   8   9.0  10    32526 2018-10-18 02:19:20

We read the preceding as “for each trade, look up the appropriate quote.” In the join, the when field comes from the trades, which is why we added a quote_time field so we could also see when the quote was established. The data.table rolling join is very fast, and also not easy to efficiently simulate in base R, SQL, or dplyr.

Rolling joins are unique to data.table. In R there are a number of tasks, such as matching most recent records, that are easily expressed as moving indexes across rows. However, moving indexes across rows tends to be inefficient in R, as it can’t be vectorized like column operations can. A rolling join is a direct way of solving such problems, and has an efficient implementation.

5.5. Reshaping transforms

This section covers moving data between rows and columns. This is often called pivoting, a name from Pito Salas’s work that combines data summarization and shape transforms. Examples will be worked in three packages: data.table, cdata (which only reshapes data, and does not summarize data), and tidyr. Base R does have notations for these transforms (such as stack() and unstack()), but the package versions are significantly better tools.

5.5.1. Moving data from wide to tall form

We will show how to move data records where all measurements are in single rows to a new record set where data is in multiple rows. We call this moving from a wide form to a thin or tall form.

Data example

Let’s work with measurements of vehicle drivers/passengers injured or killed, by month. The data includes additional information about fuel price and whether seatbelts are required by law.

Relevant variables for this example:

  • dateYear and month of measurement (numeric representation)
  • DriversKilledCar drivers killed
  • frontFront seat passengers killed or seriously injured
  • rearRear seat passengers killed or seriously injured
  • lawWhether or not seatbelt law was in effect (0/1)
library("datasets")
library("xts")

# move the date index into a column
dates <- index(as.xts(time(Seatbelts)))
Seatbelts <- data.frame(Seatbelts)
Seatbelts$date <- dates

# restrict down to 1982 and 1983
Seatbelts <- Seatbelts[ (Seatbelts$date >= as.yearmon("Jan 1982")) &
                          (Seatbelts$date <= as.yearmon("Dec 1983")),
                           , drop = FALSE]
Seatbelts$date <- as.Date(Seatbelts$date)
# mark if the seatbelt law was in effect
Seatbelts$law <- ifelse(Seatbelts$law==1, "new law", "pre-law")
# limit down to the columns we want
Seatbelts <- Seatbelts[, c("date", "DriversKilled", "front", "rear", "law")]

head(Seatbelts)

##           date DriversKilled front rear     law
## 157 1982-01-01           115   595  238 pre-law
## 158 1982-02-01           104   673  285 pre-law
## 159 1982-03-01           131   660  324 pre-law
## 160 1982-04-01           108   676  346 pre-law
## 161 1982-05-01           103   755  410 pre-law
## 162 1982-06-01           115   815  411 pre-law

To get our data into a presentable format, we have performed transforms described in earlier sections of this chapter: selecting rows, selecting columns, adding new derived columns, and so on. The data now has one row per date (we think of the date as the row-key) and contains information such as how many people were killed in each of three seating positions (driver, front, rear) and if the new seatbelt law was in effect.

We want to see if the new seatbelt law saves lives. Notice that we are missing a key bit of information: a normalizing factor such as number of cars owned per date, driving population size by date, or total miles driven per date (risks make more sense as rates than as absolute counts). This is an example of true data science being an iterative process: we are going to do the best job we can with the data at hand, but in a real project, we would also go back to sources and partners to try to get the critical missing data (or at least an estimate or proxy for the missing data).

Let’s plot the data conditioned on the law:

# let's give an example of the kind of graph we have in mind,
# using just driver deaths
library("ggplot2")

ggplot(Seatbelts,
       aes(x = date, y = DriversKilled, color = law, shape = law)) +
  geom_point() +
  geom_smooth(se=FALSE) +
  ggtitle("UK car driver deaths by month")

This code produces figure 5.17.

Figure 5.17. Passenger deaths plot

From the chart, it looks like the introduction of the seatbelt law produced a drop in deaths that is non-trivial when compared to the normal variation in deaths. It also looks like the effect may have reverted quickly.

Suppose our follow-up question is to break down this data further to seating position (as types of seatbelts differ quite a bit by seating position).

To make such a plot with ggplot2, we need to move the data from all facts being in each row to having one row per seating position. This is an example of moving from a wide or denormalized format, the natural format for machine learning tasks, to a tall or multiline record format.

Problem

Plot deaths conditioned on date and seating position, using ggplot2. ggplot2 requires the data to be in a long, rather than wide, format. So we will concentrate on how to perform this transform. We call this sort of transform moving data from row-oriented records to blocks of rows, as shown in figure 5.18.

Figure 5.18. Wide-to-tall conversion

Solution 1: data.table::melt.data.table()

We can solve this with data.table::melt.data.table(). Specify the columns of the original table that the values are to be taken from with the measure.vars argument. Specify the pair of columns the information is to be written into in the transformed table with the arguments variable.name (the new key column) and value.name (the new value column).

library("data.table")

seatbelts_long2 <-
  melt.data.table(as.data.table(Seatbelts),
                  id.vars = NULL,
                  measure.vars = c("DriversKilled", "front", "rear"),
                  variable.name = "victim_type",
                  value.name = "nvictims")

These new graphs do show us something more: the law had essentially no effect on people in the rear seats. This could be perhaps because the law didn't cover these seats, perhaps enforcing rear seat compliance was difficult, or perhaps rear-seat seatbelts were lap belts (instead of three-point restraints) and were not effective. The strongest benefit seems to be to front-seat passengers, and that is not too unusual, as they tend to have high-quality seatbelts and are not sitting in front of the steering column (a primary source of fatal injuries).

Solution 2: cdata::unpivot_to_blocks()
library("cdata")

seatbelts_long3 <- unpivot_to_blocks(
  Seatbelts,
  nameForNewKeyColumn = "victim_type",
  nameForNewValueColumn = "nvictims",
  columnsToTakeFrom = c("DriversKilled", "front", "rear 
"))

cdata has simple methods to specify the coordinated conversions of many columns at once. A good introduction can be found at http://www.win-vector.com/blog/2018/10/faceted-graphs-with-cdata-and-ggplot2/.

We encourage you to try all three solutions and convince yourself they produce the equivalent results. We prefer the cdata solution, but it is new and not as well known as the data.table or tidyr solutions.

Solution 3: tidyr::gather()
library("tidyr")

seatbelts_long1 <- gather(
  Seatbelts,
  key = victim_type,
  value = nvictims,
  DriversKilled, front, rear)

head(seatbelts_long1)
##         date     law   victim_type nvictims
## 1 1982-01-01 pre-law DriversKilled      115
## 2 1982-02-01 pre-law DriversKilled      104
## 3 1982-03-01 pre-law DriversKilled      131
## 4 1982-04-01 pre-law DriversKilled      108
## 5 1982-05-01 pre-law DriversKilled      103
## 6 1982-06-01 pre-law DriversKilled      115

ggplot(seatbelts_long1,
       aes(x = date, y = nvictims, color = law, shape = law)) +
  geom_point() +
  geom_smooth(se=FALSE) +
  facet_wrap(~victim_type, ncol=1, scale="free_y") +
  ggtitle("UK auto fatalities by month and seating position")

And we now have the passenger death data faceted by seating position in figure 5.19.

Figure 5.19. Faceted passenger death plot

5.5.2. Moving data from tall to wide form

We have been given data in a log style, where each detail of a measurement is written in a separate row. Colloquially, we call this a tall or thin data form (formally, it is related to information storage ideas such as RDF triples). The operation of moving to a wide form is very much like what Microsoft Excel users call pivoting, except aggregations (sums, averages, counts) are not strictly part of moving from tall to wide form (we suggest aggregating first before transforming). Also, moving from tall to wide form is, of course, the inverse of the moving from wide to tall form conversion we discussed earlier.

The data

For our example, we have taken the ChickWeight data from R’s datasets package. Please try these commands along with the book and take extra steps to examine the data (using commands such as View(), head(), summary() and so on):

library("datasets")
library("data.table")
library("ggplot2")

ChickWeight <- data.frame(ChickWeight) # get rid of attributes
ChickWeight$Diet <- NULL # remove the diet label
# pad names with zeros
padz <- function(x, n=max(nchar(x))) gsub(" ", "0", formatC(x, width=n))
# append "Chick" to the chick ids
ChickWeight$Chick <- paste0("Chick", padz(as.character(ChickWeight$Chick)))

head(ChickWeight)

##   weight Time   Chick
## 1     42    0 Chick01
## 2     51    2 Chick01
## 3     59    4 Chick01
## 4     64    6 Chick01
## 5     76    8 Chick01
## 6     93   10 Chick01

This data is organized so each row is a single fact (weight) about a given chick at a given time. The is a very easy format to produce and transmit, which is why it is popular in scientific settings. To perform interesting work or learn from the data, we need to bring the data into a wider structure. For our problem, we would like all the weight facts about a chick to be in a single row, with time as the new column name.

Before doing that, let’s use some of our earlier lessons to get a look at the data. We can aggregate the data to move from information about individuals to overall trends.

# aggregate count and mean weight by time
ChickSummary <- as.data.table(ChickWeight)
ChickSummary <- ChickSummary[,
             .(count = .N,
               weight = mean(weight),
               q1_weight = quantile(weight, probs = 0.25),
               q2_weight = quantile(weight, probs = 0.75)),
             by = Time]
head(ChickSummary)

##    Time count    weight q1_weight q2_weight
## 1:    0    50  41.06000        41        42
## 2:    2    50  49.22000        48        51
## 3:    4    49  59.95918        57        63
## 4:    6    49  74.30612        68        80
## 5:    8    49  91.24490        83       102
## 6:   10    49 107.83673        93       124

In ChickSummary the only key is Time (specified by the data.tableby argument) and we can now see how many chicks are surviving at a given time and the distribution of surviving chick weights at a given time.

We can present this table graphically. To use ggplot2 to do this, we need to move the summarized data to a tall form (as ggplot2 prefers to work with tall data). We use cdata::unpivot_to_blocks:

library("ggplot2")

ChickSummary <- cdata::unpivot_to_blocks(                                1
   ChickSummary,
  nameForNewKeyColumn = "measurement",
  nameForNewValueColumn = "value",
  columnsToTakeFrom = c("count", "weight"))

ChickSummary$q1_weight[ChickSummary$measurement=="count"] <- NA          2
 ChickSummary$q2_weight[ChickSummary$measurement=="count"] <- NA
CW <- ChickWeight
CW$measurement <- "weight"

ggplot(ChickSummary, aes(x = Time, y = value, color = measurement)) +    3
   geom_line(data = CW, aes(x = Time, y = weight, group = Chick),
            color="LightGray") +
  geom_line(size=2) +
  geom_ribbon(aes(ymin = q1_weight, ymax = q2_weight),
              alpha = 0.3, colour = NA) +
  facet_wrap(~measurement, ncol=1, scales = "free_y") +
  theme(legend.position = "none") +
  ylab(NULL) +
  ggtitle("Chick Weight and Count Measurements by Time",
          subtitle = "25% through 75% quartiles of weight shown shaded around
      mean")

  • 1 Unpivots into tall form for plotting
  • 2 Makes sure we have the exact set of columns needed for plotting
  • 3 Makes the plot

This gives the chick weights organized by time and chick, as shown in figure 5.20.

Figure 5.20. Chick count and weight over time

Here we have plotted the total count of surviving chicks as a function of time, plus the weight trajectory of each individual check, and the summary statistics (mean weight, and 25% through 75% quartiles of weight).

Problem

We can now return to the example task of this section: putting all the information about each chick into a single row.

Diagrammatically, it looks like the following: one column’s (meastype) values are used as new column headers and a second column (meas) supplies values. We call this moving data from blocks to wide row records, as illustrated in figure 5.21.

Figure 5.21. Moving from tall to wide form

Solution 1: data.table::dcast.data.table()

To move data to a wide form using dcast.data.table(), we specify the rows and columns of a result matrix using a formula with the ~notation. We then say how to populate the cells of this matrix with the value.var argument. In our case, to get a data frame with a row per chick, a column per time, and weight in the cells, we use the following step:

library("data.table")

ChickWeight_wide2 <- dcast.data.table(
  as.data.table(ChickWeight),
  Chick ~ Time,
  value.var = "weight")

This table is a matrix where the row is identified by the chick, and the column is time. The cells contain the weight for the given chick and time (and NA if the chick did not survive to a given time). Notice that this format is much easier to read and may be required for reporting.

data.table’s implementation of dcast also allows more-powerful transforms, such as casting to multiple variables and aggregations at the same time.

Solution 2: cdata::pivot_to_rowrecs()

cdata::pivot_to_rowrecs() describes the intended table by row-keys, column to take new column keys from, and column to take values from:

library("cdata")

ChickWeight_wide3 <- pivot_to_rowrecs(
  ChickWeight,
  columnToTakeKeysFrom = "Time",
  columnToTakeValuesFrom = "weight",
  rowKeyColumns = "Chick")
Solution 3: tidyr::spread()
library("tidyr")

ChickWeight_wide1 <- spread(ChickWeight,
                            key = Time,
                            value = weight)

head(ChickWeight_wide1)

##     Chick  0  2  4  6  8  10  12  14  16  18  20  21
## 1 Chick01 42 51 59 64 76  93 106 125 149 171 199 205
## 2 Chick02 40 49 58 72 84 103 122 138 162 187 209 215
## 3 Chick03 43 39 55 67 84  99 115 138 163 187 198 202
## 4 Chick04 42 49 56 67 74  87 102 108 136 154 160 157
## 5 Chick05 41 42 48 60 79 106 141 164 197 199 220 223
## 6 Chick06 41 49 59 74 97 124 141 148 155 160 160 157

5.5.3. Data coordinates

There are a lot of details to data transforms. The important concept to retain is this: data has coordinates such as name of table, name of column, and identity of row. The exact way the coordinates are specified is an implementation detail to be overcome or transformed to a convenient state. All of this is a consequence of Codd’s second rule of database design: “Each and every datum (atomic value) in a relational database is guaranteed to be logically accessible by resorting to a combination of table name, primary key value, and column name.”[4] What we hope you have learned is this: what parts of the coordinates (or access plan) happen to be the table name, versus row-keys, versus the column name is an alterable implementation detail.

4

Prefer simple code

Building temporary tables, adding columns, and correcting column names early is much better than having complicated analysis code. This follows Raymond’s “Rule of Representation.”

Raymond’s “Rule of Representation”

Fold knowledge into data, so program logic can be stupid and robust.

The Art of Unix Programming, Erick S. Raymond, Addison-Wesley, 2003

We suggest transforming your data to fix problems (correct column names, change the data layout) early, to make later steps easier. The format you should try to move to for predictive modeling is what database designers call a denormalized form, or what statisticians call a multivariate data matrix, or model matrix: a regular array where rows are individuals, and columns are possible observations.[5]

5

See W. J. Krzanowski and F. H. C. Marriott, Multivariate Analysis, Part 1, Edward Arnold, 1994.

The interested reader may want to pursue cdata’s powerful diagrammatic system of data layout, which is being broadly adopted, and is discussed here: https://github.com/WinVector/cdata.

Summary

In this chapter, we’ve worked through the basic examples of transform data for analysis and presentation.

At this point, we have worked through a great number of data transforms. Natural questions to ask are these: are these enough transforms? Can we quickly decompose any task into a small sequence of these transforms?

The answer is “no and yes.” There are more-specialized transforms such as “rolling window” functions and other time-series operations that are hard to express in terms of these transforms, but do in fact have their own efficient implementations in R and data.table. However, toward the “yes” answer, there are good reasons to consider the set of transforms we have learned as substantial. The basic manipulation transforms pretty much cover all of Edgar F. Codd’s relational algebra: a set of transforms that has been driving data engineering since 1970.

In this chapter you have learned

  • How to use a catalog of powerful data-reshaping transforms
  • How to apply these transforms to solve data organization issues

In part 2 of the book, we’ll talk about the process of building and evaluating models to meet your stated objectives.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset