This chapter covers
This chapter will show you how to use R to organize or wrangle data into a shape useful for analysis. Data shaping is a set of steps you have to take if your data is not found all in one table or in an arrangement ready for analysis.
Figure 5.1 is the mental model for this chapter: working with data. Previous chapters have assumed the data is in a ready-to-go form, or we have pre-prepared the data to be in such a form for you. This chapter will prepare you to take these steps yourself. The basic concept of data wrangling is to visualize your data being structured to make your task easier, and then take steps to add this structure to your data. To teach this, we'll work a number of examples, each with a motivating task, and then work a transform that solves the problem. We'll concentrate on a set of transforms that are powerful and useful, and that cover most common situations.
We will show data wrangling solutions using base R, data.table, and dplyr.[1] Each of these has its advantages, which is why we are presenting more than one solution. Throughout this book, we are deliberately using a polyglot approach to data wrangling: mixing base R, data.table, and dplyr, as convenient. Each of these systems has its strengths:
For database tasks, we suggest using dbplyr or rquery, which we will touch on briefly in appendix A.
Some good places to start on manipulating data in R are the following free guides:
We want to improve your ability to write R code (to translate intent into implementation) and to read R code (to infer intent from existing code). To this end, this chapter, like much of this book, is designed as a sequence of worked examples. We strongly encourage you try running the examples yourself (they are available here: https://github.com/WinVector/PDSwR2). It is key to get into the habit of planning (even drawing) your data transformation before coding. Finish clarifying your intent before getting bogged down by the details of coding. Trust that there are easy-to-find methods for most common data wrangling tasks in R, and plan with the assumption that you can find them when needed. The principle is this: make your analysis simpler by transforming your data into a simple “data matrix” format where each row is an observation, and each column is a measurement type. Fix issues, such as odd column names, early, and then you don’t have to write complicated code to work around them.
This chapter is organized in terms of the type of transform needed. For variety, we will bring in a number of small notional datasets, and spend a little time looking at each dataset. This task-to-example organization will give you a quick introduction to data transforms in R. The transforms we are going to cover include these:
The idea is that this list will give you enough tools for a very large number of tasks. We'll work from problems to solutions. We will show which command solves a given problem, and leave details of the command’s syntax to R’s help system and the guides and tutorials we suggest in this chapter. Please think of this chapter as a Rosetta Stone for data wrangling: each concept is explained once and then executed three times (usually in base R, data.table, and dplyr).
Our first application (subsetting rows and columns) will set up the general pattern of the applications, so it is worth reading through even if you are already confidently subsetting data in R.
In this chapter, we’ll use small, toy-sized datasets to make it easier to examine data before and after the transforms. We strongly suggest you run all the examples along with us. All examples are either built-in data that comes with R or available from the book’s GitHub site: https://github.com/WinVector/PDSwR2. Also, all code examples can be found in the CodeExamples at the same location. We suggest cloning or downloading this material to help working with this book.
For more information on R’s built-in examples, try the command help(datasets).
This section covers removing rows, removing columns, reordering columns, removing missing data, and reordering data rows. In the era of big data, you often have too much to look at, so limiting your data to what you need can greatly speed up your work.
A common task when working with a dataset is selecting a subset of rows or columns.
For our first example, we will use the iris dataset: measurements of sepal length and width and petal length and width for three species of iris.
First we will look at some aspects of the data. We suggest always doing this and making it part of an “eyes on the data” work discipline. For example, figure 5.2 shows the petal dimensions of our example irises:
library("ggplot2") 1 summary(iris) 2 ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## ## Species ## setosa :50 ## versicolor:50 ## virginica :50
It is good practice to attach packages early. If a package won’t attach, try installing it with a command such as install.packages ("ggplot2").
head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ggplot(iris, aes(x = Petal.Length, y = Petal.Width, shape = Species, color = Species)) + geom_point(size =2 ) + ggtitle("Petal dimensions by iris species: all measurements")
The iris data comes preinstalled with R and is part of the datasets package. We will deliberately use small examples in this chapter so it is easy to look at results.
Suppose we are assigned to generate a report on only petal length and petal width, by iris species, for irises where the petal length is greater than 2. To accomplish this, we need to select a subset of columns (variables) or a subset of rows (instances) from a data frame.
Column and row selections look like figure 5.3.
The diagrams in this chapter are meant to be mnemonic cartoons of the transform. We suggest looking at the actual data before and after the transform to get more details on what the data transforms are doing. We also suggest reviewing them again after working the solutions and noticing how they abstract the arrangement of the data before and after the transforms. Think of these diagrams as a visual index of transforms.
The base R solution works by using the [,] indexing operator.
When working with [,] always add a third argument drop = FALSE to get around the issue that the default behavior when selecting a single column from an R data.frame returns a vector and not a data.frame containing the column. In many cases, we know we have more than one column, and don’t strictly need the command. But it is good to get in the habit of adding this argument to avoid unpleasant surprises.
The solution strategy is this:
columns_we_want <- c("Petal.Length", "Petal.Width", "Species") rows_we_want <- iris$Petal.Length > 2 # before head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa iris_base <- iris[rows_we_want, columns_we_want, drop = FALSE] # after head(iris_base) ## Petal.Length Petal.Width Species ## 51 4.7 1.4 versicolor ## 52 4.5 1.5 versicolor ## 53 4.9 1.5 versicolor ## 54 4.0 1.3 versicolor ## 55 4.6 1.5 versicolor ## 56 4.5 1.3 versicolor
Notice column selection is also a good way to reorder columns. An advantage of base R is it tends to be fast and has very stable APIs: code written this year in base R is most likely to work next year (tidyverse packages, unfortunately, tend to have less-stable APIs). The one disadvantage is that a few base R defaults are irritating. For example, we included the drop=FALSE notation to work around the fact that base R would return a vector instead of a data.frame if we tried to select only one column.
Row and column selection in data.table is performed similarly to base R. data .table uses a very powerful set of index notations. In this case, we use a .. notation to tell data.table that we are using the second index position to specify column names (and not to specify calculations, as we will demonstrate later).
library("data.table") iris_data.table <- as.data.table(iris) 1 columns_we_want <- c("Petal.Length", "Petal.Width", "Species") rows_we_want <- iris_data.table$Petal.Length > 2 iris_data.table <- iris_data.table[rows_we_want , ..columns_we_want] 2 head(iris_data.table) ## Petal.Length Petal.Width Species ## 1: 4.7 1.4 versicolor ## 2: 4.5 1.5 versicolor ## 3: 4.9 1.5 versicolor ## 4: 4.0 1.3 versicolor ## 5: 4.6 1.5 versicolor ## 6: 4.5 1.3 versicolor
The advantage of data.table is that it is the fastest and most memory efficient solution for data wrangling in R at a wide range of scales. data.table has a very helpful FAQ, and there is a nice cheat sheet:
Both of these will make more sense if you approach them after working some examples from the data.table vignette accessible in R with the command vignette ("datatable-intro", package = "data.table").
data.table works like data.frames for packages that are not data.table-aware. This means you can use data.tables with just about any package, even those that predate data.table. In a data.table-aware situation (using data.table at the command line, or using a package that depends on data.table), data.table implements slightly enhanced semantics. We show a quick example here:
library("data.table") df <- data.frame(x = 1:2, y = 3:4) 1 df[, x] 2 ## Error in `[.data.frame`(df, , x) : object 'x' not found x <- "y" 3 dt <- data.table(df) dt[, x] 4 ## [1] 1 2 dt[, ..x] 5 ## y ## 1: 3 ## 2: 4
The dplyr solution is written in terms of select and filter:
It is traditional to chain dplyr steps with the magrittr pipe operator %>%, but assigning to temporary variables works just as well. While teaching here, we'll use explicit dot notation, where the data pipeline is written as iris %>% select(., column) instead of the more-common implicit first-argument notation (iris %>% select(column)). Explicit dot notation was discussed in chapter 2 and is the topic of the following R Tip: http://www.win-vector.com/blog/2018/03/r-tip-make-arguments-explicit-in-magrittr-dplyr-pipelines/.[2]
A powerful alternate pipe called the dot arrow pipe (written as %.>%) that insists on explicit dot notation and supplies additional capabilities can be found in the wrapr package. For most of this book, we will stick to the magrittr pipe, but we encourage curious readers to check out the wrapr pipe in their own work.
library("dplyr") iris_dplyr <- iris %>% select(., Petal.Length, Petal.Width, Species) %>% filter(., Petal.Length > 2) head(iris_dplyr) ## Petal.Length Petal.Width Species ## 1 4.7 1.4 versicolor ## 2 4.5 1.5 versicolor ## 3 4.9 1.5 versicolor ## 4 4.0 1.3 versicolor ## 5 4.6 1.5 versicolor ## 6 4.5 1.3 versicolor
The advantage of dplyr is the emphasis of data processing as a sequence of operations broken down into a visible pipeline.
There is a nice cheat sheet for dplyr available from https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf. Cheat sheets are always going to be a bit brief, so the sheet will become very useful after you have tried a few examples.
An important variation of subsetting data is removing rows of data that have missing values. We will also deal with some simple strategies for replacing missing values by moving values across rows (using na.locf()) or moving values across columns (called coalescing).[3]
There is actually an entire science devoted to imputing values for missing data. A good resource on this topic is https://CRAN.R-project.org/view=MissingData.
In this section, we will show how to quickly select only rows that have no missing data or values. This is only an example; we generally suggest using the methodologies in chapters 4 and 8 for treating missing values in real-world applications.
As our preceding example does not have missing values, we will move to another example: the msleep dataset of sleep times of animals with different characteristics. In this dataset, several rows have missing values. An additional goal of this example is to familiarize you with a number of common practice datasets. These are the datasets that you should break out to try a new data wrangling method.
First, as always, let’s look at the data:
library("ggplot2") data(msleep) 1 str(msleep) ## Classes 'tbl_df', 'tbl' and 'data.frame': 83 obs. of 11 variables: ## $ name : chr "Cheetah" "Owl monkey" "Mountain beaver" "Greater sh ort-tailed shrew" ... ## $ genus : chr "Acinonyx" "Aotus" "Aplodontia" "Blarina" ... ## $ vore : chr "carni" "omni" "herbi" "omni" ... ## $ order : chr "Carnivora" "Primates" "Rodentia" "Soricomorpha" ... ## $ conservation: chr "lc" NA "nt" "lc" ... ## $ sleep_total : num 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ... ## $ sleep_rem : num NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ... ## $ sleep_cycle : num NA NA NA 0.133 0.667 ... ## $ awake : num 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ... ## $ brainwt : num NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ... ## $ bodywt : num 50 0.48 1.35 0.019 600 ... summary(msleep) ## name genus vore ## Length:83 Length:83 Length:83 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## ## order conservation sleep_total sleep_rem ## Length:83 Length:83 Min. : 1.90 Min. :0.100 ## Class :character Class :character 1st Qu.: 7.85 1st Qu.:0.900 ## Mode :character Mode :character Median :10.10 Median :1.500 ## Mean :10.43 Mean :1.875 ## 3rd Qu.:13.75 3rd Qu.:2.400 ## Max. :19.90 Max. :6.600 ## NA's :22 ## sleep_cycle awake brainwt bodywt ## Min. :0.1167 Min. : 4.10 Min. :0.00014 Min. : 0.005 ## 1st Qu.:0.1833 1st Qu.:10.25 1st Qu.:0.00290 1st Qu.: 0.174 ## Median :0.3333 Median :13.90 Median :0.01240 Median : 1.670 ## Mean :0.4396 Mean :13.57 Mean :0.28158 Mean : 166.136 ## 3rd Qu.:0.5792 3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750 ## Max. :1.5000 Max. :22.10 Max. :5.71200 Max. :6654.000 ## NA's :51 NA's :27
We have been asked to build an extract of the msleep data that has no missing values. To accomplish this, we will remove all rows that have missing values. The cartoon of the transform is shown in figure 5.4.
clean_base_1 <- msleep[complete.cases(msleep), , drop = FALSE] summary(clean_base_1) ## name genus vore ## Length:20 Length:20 Length:20 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## order conservation sleep_total sleep_rem ## Length:20 Length:20 Min. : 2.900 Min. :0.600 ## Class :character Class :character 1st Qu.: 8.925 1st Qu.:1.300 ## Mode :character Mode :character Median :11.300 Median :2.350 ## Mean :11.225 Mean :2.275 ## 3rd Qu.:13.925 3rd Qu.:3.125 ## Max. :19.700 Max. :4.900 ## sleep_cycle awake brainwt bodywt ## Min. :0.1167 Min. : 4.30 Min. :0.00014 Min. : 0.0050 ## 1st Qu.:0.1792 1st Qu.:10.07 1st Qu.:0.00115 1st Qu.: 0.0945 ## Median :0.2500 Median :12.70 Median :0.00590 Median : 0.7490 ## Mean :0.3458 Mean :12.78 Mean :0.07882 Mean : 72.1177 ## 3rd Qu.:0.4167 3rd Qu.:15.07 3rd Qu.:0.03670 3rd Qu.: 6.1250 ## Max. :1.0000 Max. :21.10 Max. :0.65500 Max. :600.0000 nrow(clean_base_1) ## [1] 20 clean_base_2 = na.omit(msleep) nrow(clean_base_2) ## [1] 20
The complete.cases() solution also works with data.table:
library("data.table") msleep_data.table <- as.data.table(msleep) clean_data.table = msleep_data.table[complete.cases(msleep_data.table), ] nrow(clean_data.table) ## [1] 20
dplyr::filter can also be used with complete.cases().
With magrittr pipe notation, a . is taken to mean the item being piped. So we can use . to refer to our data multiple times conveniently, such as telling the dplyr::filter to use the data both as the object to be filtered and as the object to pass to complete .cases().
library("dplyr") clean_dplyr <- msleep %>% filter(., complete.cases(.)) nrow(clean_dplyr) ## [1] 20
In this section, we want to sort or control what order our data rows are in. Perhaps the data came to us unsorted, or sorted for a purpose other than ours.
We are asked to build a running or cumulative sum of sales by time, but the data came to us out of order:
purchases <- wrapr::build_frame( 1 "day", "hour", "n_purchase" | 1 , 9 , 5 | 2 , 9 , 3 | 2 , 11 , 5 | 1 , 13 , 1 | 2 , 13 , 3 | 1 , 14 , 1 )
Reorder the rows by day and then hour and compute a running sum. The abstract diagram is shown in figure 5.5.
order_index <- with(purchases, order(day, hour)) 1 purchases_ordered <- purchases[order_index, , drop = FALSE] purchases_ordered$running_total <- cumsum(purchases_ordered$n_purchase) 2 purchases_ordered ## day hour n_purchase running_total ## 1 1 9 5 5 ## 4 1 13 1 6 ## 6 1 14 1 7 ## 2 2 9 3 10 ## 3 2 11 5 15 ## 5 2 13 3 18
library("data.table") DT_purchases <- as.data.table(purchases) order_cols <- c("day", "hour") 1 setorderv(DT_purchases, order_cols) DT_purchases[ , running_total := cumsum(n_purchase)] # print(DT_purchases)
Operations that alter data in place (such as :=) annotate the result to suppress printing. This is important, as often you are working with large structures and do not want intermediate data to print. [] is a no-operation that as a side effect restores printing.
setorderv() reorders data in place and takes a list of ordering column names to specify the order. This is much more convenient than the base R solution that takes multiple ordering columns as multiple arguments. wrapr::orderv() tries to bridge this gap by allowing the user to specify ordering constraints with a list of columns (column values, not column names).
dplyr uses the word arrange to order data, and mutate to add a new column:
library("dplyr") res <- purchases %>% arrange(., day, hour) %>% mutate(., running_total = cumsum(n_purchase)) # print(res)
For our advanced example, suppose we want the cumulative sum of sales to be perday—that is, to reset the sum at the start of each day.
This easiest base R solution is a split and recombine strategy:
order_index <- with(purchases, order(day, hour)) 1 purchases_ordered <- purchases[order_index, , drop = FALSE] data_list <- split(purchases_ordered, purchases_ordered$day) 2 data_list <- lapply( 3 data_list, function(di) { di$running_total <- cumsum(di$n_purchase) di }) purchases_ordered <- do.call(base::rbind, data_list) 4 rownames(purchases_ordered) <- NULL 5 purchases_ordered ## day hour n_purchase running_total ## 1 1 9 5 5 ## 2 1 13 1 6 ## 3 1 14 1 7 ## 4 2 9 3 3 ## 5 2 11 5 8 ## 6 2 13 3 11
The data.table solution is particularly concise. We order the data and then tell data.table to calculate the new running sum per-group with the by argument. The idea that the grouping is a property of the calculation, and not a property of the data, is similar to SQL and helps minimize errors.
In data.table, := means “assign in place”—it is used to alter or create a column in the incoming data.table. Conversely, = is used to mean “create in new data.table,” and we wrap these sorts of assignments in a .() notation so that column names are not confused with arguments to data.table.
library("data.table") # new copy for result solution DT_purchases <- as.data.table(purchases)[order(day, hour), .(hour = hour, n_purchase = n_purchase, running_total = cumsum(n_purchase)), by = "day"] 1 # print(DT_purchases) 2 # in-place solution DT_purchases <- as.data.table(purchases) order_cols <- c("day", "hour") setorderv(DT_purchases, order_cols) DT_purchases[ , running_total := cumsum(n_purchase), by = day] # print(DT_purchases) 3 # don't reorder the actual data variation! DT_purchases <- as.data.table(purchases) DT_purchases[order(day, hour), `:=`(hour = hour, n_purchase = n_purchase, running_total = cumsum(n_purchase)), by = "day"] # print(DT_purchases) 4
Sequencing data.table operations is achieved either by writing in-place operations one after the other (as we did in these examples) or by starting a new open-[ right after a close-] for operations that create new copies (this is called method chaining and is equivalent to using a pipe operator).
The dplyr solution works because the command mutate() (which we will discuss in the next section) works per-group if the data is grouped. We can make the data grouped by using the group_by() command:
library("dplyr") res <- purchases %>% arrange(., day, hour) %>% group_by(., day) %>% mutate(., running_total = cumsum(n_purchase)) %>% ungroup(.) # print(res)
In dplyr it is important to always ungroup your data when you are done performing per-group operations. This is because the presence of a dplyr grouping annotation can cause many downstream steps to calculate unexpected and incorrect results. We advise doing this even after a summarize() step, as summarize() removes one key from the grouping, leaving it unclear to a code reader if the data remains grouped or not.
This section covers adding and renaming columns.
The section covers adding new variables (columns) to a data frame, or applying transformations to existing columns (see figure 5.6).
For our example data, we will use air quality measurements with missing data and non-standard date formatting, for the year 1973:
library("datasets") library("ggplot2") summary(airquality) ## Ozone Solar.R Wind Temp ## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 ## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 ## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 ## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 ## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 ## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 ## NA's :37 NA's :7 ## Month Day ## Min. :5.000 Min. : 1.0 ## 1st Qu.:6.000 1st Qu.: 8.0 ## Median :7.000 Median :16.0 ## Mean :6.993 Mean :15.8 ## 3rd Qu.:8.000 3rd Qu.:23.0 ## Max. :9.000 Max. :31.0 ##
We are asked to convert this non-standard date representation into a new, more useful date column for queries and plotting.
library("lubridate") library("ggplot2") # create a function to make the date string. datestr = function(day, month, year) { paste(day, month, year, sep="-") }
In base R, we create new columns by assigning to them:
airquality_with_date <- airquality 1 airquality_with_date$date <- with(airquality_with_date, 2 dmy(datestr(Day, Month, 1973))) airquality_with_date <- airquality_with_date[, 3 c("Ozone", "date"), drop = FALSE] head(airquality_with_date) 4 ## Ozone date ## 1 41 1973-05-01 ## 2 36 1973-05-02 ## 3 12 1973-05-03 ## 4 18 1973-05-04 ## 5 NA 1973-05-05 ## 6 28 1973-05-06 ggplot(airquality_with_date, aes(x = date, y = Ozone)) + 5 geom_point() + geom_line() + xlab("Date") + ggtitle("New York ozone readings, May 1 - Sept 30, 1973")
The preceding code produces figure 5.7.
Base R has had transform-style (or pipeable) versions of these basic operators for quite some time (just no pipe!). Let’s work the example again in that style:
library("wrapr") 1 airquality %.>% 2 transform(., date = dmy(datestr(Day, Month, 1973))) %.>% subset(., !is.na(Ozone), select = c("Ozone", "date")) %.>% head(.) ## Ozone date ## 1 41 1973-05-01 ## 2 36 1973-05-02 ## 3 12 1973-05-03 ## 4 18 1973-05-04 ## 6 28 1973-05-06 ## 7 23 1973-05-07
data.table uses := to show column alterations or creations that are supposed to happen “in place” (the data.table at hand gets altered, instead of a new one being created).
library("data.table") DT_airquality <- as.data.table(airquality)[ 1 , date := dmy(datestr(Day, Month, 1973)) ][ 2 , c("Ozone", "date")] 3 head(DT_airquality) ## Ozone date ## 1: 41 1973-05-01 ## 2: 36 1973-05-02 ## 3: 12 1973-05-03 ## 4: 18 1973-05-04 ## 5: NA 1973-05-05 ## 6: 28 1973-05-06
Notice how the open-[ steps work a lot like pipes, connecting one data.table stage to another. This is one of the reasons data.table places so many operations inside the []: in R the [] naturally chains operations left to right.
dplyr users will remember that in dplyr, new columns are produced with the mutate() command :
library("dplyr") airquality_with_date2 <- airquality %>% mutate(., date = dmy(datestr(Day, Month, 1973))) %>% select(., Ozone, date) head(airquality_with_date2) ## Ozone date ## 1 41 1973-05-01 ## 2 36 1973-05-02 ## 3 12 1973-05-03 ## 4 18 1973-05-04 ## 5 NA 1973-05-05 ## 6 28 1973-05-06
Notice the original Ozone graph had holes in the data, due to missing values. We will try to fix this by propagating the last known Ozone reading forward to the dates with missing values. This “the task was finished ... until we looked at the results” situation is typical of data science. So always look, and look for problems.
Filling in missing values from earlier in a column is illustrated in figure 5.8.
The zoo package supplies a function called na.locf(), which is designed to solve our issue. We will show how to apply this function now.
library("zoo") airquality_corrected <- airquality_with_date airquality_corrected$OzoneCorrected <- na.locf(airquality_corrected$Ozone, na.rm = FALSE) summary(airquality_corrected) ## Ozone date OzoneCorrected ## Min. : 1.00 Min. :1973-05-01 Min. : 1.00 ## 1st Qu.: 18.00 1st Qu.:1973-06-08 1st Qu.: 16.00 ## Median : 31.50 Median :1973-07-16 Median : 30.00 ## Mean : 42.13 Mean :1973-07-16 Mean : 39.78 ## 3rd Qu.: 63.25 3rd Qu.:1973-08-23 3rd Qu.: 52.00 ## Max. :168.00 Max. :1973-09-30 Max. :168.00 ## NA's :37 ggplot(airquality_corrected, aes(x = date, y = Ozone)) + geom_point(aes(y=Ozone)) + geom_line(aes(y=OzoneCorrected)) + ggtitle("New York ozone readings, May 1 - Sept 30, 1973", subtitle = "(corrected)") + xlab("Date")
This produces figure 5.9.
Always use na.rm = FALSE with na.locf(); otherwise, it may delete initial NA elements from your data.
library("data.table") library("zoo") DT_airquality[, OzoneCorrected := na.locf(Ozone, na.rm=FALSE)] summary(DT_airquality) ## Ozone date OzoneCorrected ## Min. : 1.00 Min. :1973-05-01 Min. : 1.00 ## 1st Qu.: 18.00 1st Qu.:1973-06-08 1st Qu.: 16.00 ## Median : 31.50 Median :1973-07-16 Median : 30.00 ## Mean : 42.13 Mean :1973-07-16 Mean : 39.78 ## 3rd Qu.: 63.25 3rd Qu.:1973-08-23 3rd Qu.: 52.00 ## Max. :168.00 Max. :1973-09-30 Max. :168.00 ## NA's :37
Notice that data.table performed the correction “in place,” in DT_airquality instead of producing a new data.frame.
library("dplyr") library("zoo") airquality_with_date %>% mutate(., OzoneCorrected = na.locf(Ozone, na.rm = FALSE)) %>% summary(.) ## Ozone date OzoneCorrected ## Min. : 1.00 Min. :1973-05-01 Min. : 1.00 ## 1st Qu.: 18.00 1st Qu.:1973-06-08 1st Qu.: 16.00 ## Median : 31.50 Median :1973-07-16 Median : 30.00 ## Mean : 42.13 Mean :1973-07-16 Mean : 39.78 ## 3rd Qu.: 63.25 3rd Qu.:1973-08-23 3rd Qu.: 52.00 ## Max. :168.00 Max. :1973-09-30 Max. :168.00 ## NA's :37
A number of additional simple operations commonly used in working with data are available—in particular, renaming columns by altering the column names directly, and also removing columns by assigning NULL. We will show these briefly:
d <- data.frame(x = 1:2, y = 3:4) print(d) #> x y #> 1 1 3 #> 2 2 4 colnames(d) <- c("BIGX", "BIGY") print(d) #> BIGX BIGY #> 1 1 3 #> 2 2 4 d$BIGX <- NULL print(d) #> BIGY #> 1 3 #> 2 4
This section covers transforms that combine multiple rows or multiple columns.
Here we address the situation where there are multiple observations or measurements of a single subject, in this case species of Iris, that we wish to aggregate into a single observation.
We have been asked to make a report summarizing iris petals by species.
Summarize measurements by category, as shown in figure 5.10.
Again, we use measurements of petal length and width, by iris species, from the iris dataset :
library("datasets") library("ggplot2") head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
iris_summary <- aggregate( cbind(Petal.Length, Petal.Width) ~ Species, data = iris, FUN = mean) print(iris_summary) # Species Petal.Length Petal.Width # 1 setosa 1.462 0.246 # 2 versicolor 4.260 1.326 # 3 virginica 5.552 2.026 library(ggplot2) ggplot(mapping = aes(x = Petal.Length, y = Petal.Width, shape = Species, color = Species)) + geom_point(data = iris, # raw data alpha = 0.5) + geom_point(data = iris_summary, # per-group summaries size = 5) + ggtitle("Average Petal dimensions by iris species (with raw data for refer ence)")
This produces figure 5.11, a new iris plot with group mean annotations.
library("data.table") iris_data.table <- as.data.table(iris) iris_data.table <- iris_data.table[, .(Petal.Length = mean(Petal.Length), Petal.Width = mean(Petal.Width)), by = .(Species)] # print(iris_data.table)
library("dplyr") iris_summary <- iris %>% group_by(., Species) %>% summarize(., Petal.Length = mean(Petal.Length), Petal.Width = mean(Petal.Width)) %>% ungroup(.) # print(iris_summary)
Both data.table and dplyr have grouped versions of the preceding operations (similar to what relational databases call window functions). This lets each row include the per-group summary in each row without having to build a summary table and join (the usual way to compute such quantities). For example:
iris_copy <- iris iris_copy$mean_Petal.Length <- ave(iris$Petal.Length, iris$Species, FUN = mean) iris_copy$mean_Petal.Width <- ave(iris$Petal.Width, iris$Species, FUN = mean) # head(iris_copy) # tail(iris_copy)
In data.table, the task looks like the following:
library("data.table") iris_data.table <- as.data.table(iris) iris_data.table[ , `:=`(mean_Petal.Length = mean(Petal.Length), mean_Petal.Width = mean(Petal.Width)), by = "Species"] # print(iris_data.table)
Please run the preceding code and print iris_data.table to see that the computed means are per-group.
dplyr has similar functionality:
library("dplyr") iris_dplyr <- iris %>% group_by(., Species) %>% mutate(., mean_Petal.Length = mean(Petal.Length), mean_Petal.Width = mean(Petal.Width)) %>% ungroup(.) # head(iris_dplyr)
Again, it is critical to ungroup() when applying per-group transforms. Also, be aware that dplyr grouped operations (in particular, row selection through filter()) can be much slower than ungrouped operations, so you want to make your group()/ ungroup() intervals as short as possible. And dplyr grouped operations are usually much slower than data.table grouped operations in general.
This section covers operations between multiple tables. This includes the tasks of splitting tables, concatenating tables, and joining tables.
Here we discuss combining two data frames, with the same number of rows or columns (and same order!). A more involved, but more general way to combine data is demonstrated in section 5.4.2.
We have been asked to draw information about products from a sales database and produce a report. Typically, different facts (in this case, price and units sold) are stored in different tables, so to produce our report, we will have to combine data from more than one table.
For example, suppose our example data was the following:
productTable <- wrapr::build_frame( "productID", "price" | "p1" , 9.99 | "p2" , 16.29 | "p3" , 19.99 | "p4" , 5.49 | "p5" , 24.49 ) salesTable <- wrapr::build_frame( "productID", "sold_store", "sold_online" | "p1" , 6 , 64 | "p2" , 31 , 1 | "p3" , 30 , 23 | "p4" , 31 , 67 | "p5" , 43 , 51 ) productTable2 <- wrapr::build_frame( "productID", "price" | "n1" , 25.49 | "n2" , 33.99 | "n3" , 17.99 ) productTable$productID <- factor(productTable$productID) productTable2$productID <- factor(productTable2$productID)
When two tables have the exact same column structure, we can concatenate them to get a larger table, as in figure 5.12.
rbind
rbind_base = rbind(productTable, productTable2)
Note that rbind creates a new factor variable when merging incompatible factor variables:
str(rbind_base) ## 'data.frame': 8 obs. of 2 variables: ## $ productID: Factor w/ 8 levels "p1","p2","p3",..: 1 2 3 4 5 6 7 8 ## $ price : num 9.99 16.29 19.99 5.49 24.49 ...
library("data.table") rbindlist(list(productTable, productTable2)) ## productID price ## 1: p1 9.99 ## 2: p2 16.29 ## 3: p3 19.99 ## 4: p4 5.49 ## 5: p5 24.49 ## 6: n1 25.49 ## 7: n2 33.99 ## 8: n3 17.99
data.table also correctly merges factor types.
library("dplyr") bind_rows(list(productTable, productTable2)) ## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character ## Warning in bind_rows_(x, .id): binding character and factor vector, ## coercing into character vector ## Warning in bind_rows_(x, .id): binding character and factor vector, ## coercing into character vector ## productID price ## 1 p1 9.99 ## 2 p2 16.29 ## 3 p3 19.99 ## 4 p4 5.49 ## 5 p5 24.49 ## 6 n1 25.49 ## 7 n2 33.99 ## 8 n3 17.99
Notice that bind_rows coerces incompatible factor variables to character.
The inverse of row binding is splitting. Many difficult calculations can be made easy by splitting a data frame into a family of data frames, then working on each data frame, and finally rebinding them together. The best implementation is the one found in data.table, which has some priority (being one of the first). na.rm = FALSE only simulates splitting and recombining data (so tends to be very fast).
# add an extra column telling us which table # each row comes from productTable_marked <- productTable productTable_marked$table <- "productTable" productTable2_marked <- productTable2 productTable2_marked$table <- "productTable2" # combine the tables rbind_base <- rbind(productTable_marked, productTable2_marked) rbind_base ## productID price table ## 1 p1 9.99 productTable ## 2 p2 16.29 productTable ## 3 p3 19.99 productTable ## 4 p4 5.49 productTable ## 5 p5 24.49 productTable ## 6 n1 25.49 productTable2 ## 7 n2 33.99 productTable2 ## 8 n3 17.99 productTable2 # split them apart tables <- split(rbind_base, rbind_base$table) tables ## $productTable ## productID price table ## 1 p1 9.99 productTable ## 2 p2 16.29 productTable ## 3 p3 19.99 productTable ## 4 p4 5.49 productTable ## 5 p5 24.49 productTable ## ## $productTable2 ## productID price table ## 6 n1 25.49 productTable2 ## 7 n2 33.99 productTable2 ## 8 n3 17.99 productTable2
data.table combines the split, apply, and recombine steps into a single, very efficient operation. We will continue our example with the rbind_base object to show the effect. data.table is willing to call a user function or execute a user expression for each data group and supplies special variables to work per-group:
For instance, to compute a max price per group, we can do the following:
library("data.table") # convert to data.table dt <- as.data.table(rbind_base) # arbitrary user defined function f <- function(.BY, .SD) { max(.SD$price) } # apply the function to each group # and collect results dt[ , max_price := f(.BY, .SD), by = table] print(dt) ## productID price table max_price ## 1: p1 9.99 productTable 24.49 ## 2: p2 16.29 productTable 24.49 ## 3: p3 19.99 productTable 24.49 ## 4: p4 5.49 productTable 24.49 ## 5: p5 24.49 productTable 24.49 ## 6: n1 25.49 productTable2 33.99 ## 7: n2 33.99 productTable2 33.99 ## 8: n3 17.99 productTable2 33.99
Note that the preceding is a powerful general form not needed for such a simple task. Simple per-group aggregation of values is usually achieved by naming the columns:
library("data.table") dt <- as.data.table(rbind_base) grouping_column <- "table" dt[ , max_price := max(price), by = eval(grouping_column)] print(dt) ## productID price table max_price ## 1: p1 9.99 productTable 24.49 ## 2: p2 16.29 productTable 24.49 ## 3: p3 19.99 productTable 24.49 ## 4: p4 5.49 productTable 24.49 ## 5: p5 24.49 productTable 24.49 ## 6: n1 25.49 productTable2 33.99 ## 7: n2 33.99 productTable2 33.99 ## 8: n3 17.99 productTable2 33.99
In this example, we took the liberty of showing how we would group by a column chosen by a variable.
dplyr doesn’t have its own split implementation. dplyr tries to simulate working on subtables with its group_by() notation. For example, to compute the maximum price per group in dplyr, we would write code like the following:
rbind_base %>% group_by(., table) %>% mutate(., max_price = max(price)) %>% ungroup(.) ## # A tibble: 8 x 4 ## productID price table max_price ## <fct> <dbl> <chr> <dbl> ## 1 p1 9.99 productTable 24.5 ## 2 p2 16.3 productTable 24.5 ## 3 p3 20.0 productTable 24.5 ## 4 p4 5.49 productTable 24.5 ## 5 p5 24.5 productTable 24.5 ## 6 n1 25.5 productTable2 34.0 ## 7 n2 34.0 productTable2 34.0 ## 8 n3 18.0 productTable2 34.0
This is not going to be as powerful as calling an arbitrary function per data group.
Append a data frame as columns to another data frame. The data frames must have the same number of rows and same row order (with respect to what we consider to be row-keys). This is illustrated in figure 5.13.
Create a table of product information (price and units sold), from productTable and salesTable. This assumes that the products are sorted in the same order in both tables. If they are not, then sort them, or use a join command to merge the tables together (please see section 5.4.2).
cbind
cbind(productTable, salesTable[, -1]) ## productID price sold_store sold_online ## 1 p1 9.99 6 64 ## 2 p2 16.29 31 1 ## 3 p3 19.99 30 23 ## 4 p4 5.49 31 67 ## 5 p5 24.49 43 51
For binding columns, data.table methods require the data to already be of type data.table.
library("data.table") cbind(as.data.table(productTable), as.data.table(salesTable[, -1])) ## productID price sold_store sold_online ## 1: p1 9.99 6 64 ## 2: p2 16.29 31 1 ## 3: p3 19.99 30 23 ## 4: p4 5.49 31 67 ## 5: p5 24.49 43 51
library("dplyr") # list of data frames calling convention dplyr::bind_cols(list(productTable, salesTable[, -1])) ## productID price sold_store sold_online ## 1 p1 9.99 6 64 ## 2 p2 16.29 31 1 ## 3 p3 19.99 30 23 ## 4 p4 5.49 31 67 ## 5 p5 24.49 43 51
Join is the relational name for the process of combining two tables to create a third. The join results in a table that possibly has a new row for every pair of rows from the original two tables (plus possibly rows from each table that did not have matches from the other table). Rows are matched on key-values, matching from one table to another. The simplest case is when each table has a set of columns that uniquely determine each row (a unique key), and this is the case we will discuss here.
Our example data is information about products in a sales database. Various facts (in this case, price and units sold) are stored in different tables. Missing values are allowed. We are tasked with combining these tables to produce a report.
First let’s set up some example data:
productTable <- wrapr::build_frame( "productID", "price" | "p1" , 9.99 | "p3" , 19.99 | "p4" , 5.49 | "p5" , 24.49 ) salesTable <- wrapr::build_frame( "productID", "unitsSold" | "p1" , 10 | "p2" , 43 | "p3" , 55 | "p4" , 8 )
The most important join for the data scientist is likely the left join. This join keeps every row from the left table and adds columns coming from matching rows in the right table. When there are no matching rows, NA values are substituted in. Usually, you design the right table (the second argument to your join command) to have unique keys; otherwise, the number of rows may grow (there is no need for the left table to have unique keys).
The operation is typically used to adjoin data from the second (or right) table into a copy of the first or left table, as shown in figure 5.14.
merge with argument all.x = TRUE
merge(productTable, salesTable, by = "productID", all.x = TRUE) ## productID price unitsSold ## 1 p1 9.99 10 ## 2 p3 19.99 55 ## 3 p4 5.49 8 ## 4 p5 24.49 NA
library("data.table") productTable_data.table <- as.data.table(productTable) salesTable_data.table <- as.data.table(salesTable) # index notation for join # idea is rows are produced for each row inside the [] salesTable_data.table[productTable_data.table, on = "productID"] ## productID unitsSold price ## 1: p1 10 9.99 ## 2: p3 55 19.99 ## 3: p4 8 5.49 ## 4: p5 NA 24.49 # data.table also overrides merge() merge(productTable, salesTable, by = "productID", all.x = TRUE) ## productID price unitsSold ## 1 p1 9.99 10 ## 2 p3 19.99 55 ## 3 p4 5.49 8 ## 4 p5 24.49 NA
The data.table index notation reminds us that there is another very good Base R way to use one table to add a single column to another: vectorized lookup through the match() and [] methods.
library("data.table") joined_table <- productTable joined_table$unitsSold <- salesTable$unitsSold[match(joined_table$productID, salesTable$productID)] print(joined_table) ## productID price unitsSold ## 1 p1 9.99 10 ## 2 p3 19.99 55 ## 3 p4 5.49 8 ## 4 p5 24.49 NA
match() found the matching indices, and [] used the indices to retrieve the data. Please see help(match) for more details.
library("dplyr") left_join(productTable, salesTable, by = "productID") ## productID price unitsSold ## 1 p1 9.99 10 ## 2 p3 19.99 55 ## 3 p4 5.49 8 ## 4 p5 24.49 NA
There is also a join called right join that is just the left join with the arguments reversed. As the right join is so similar to the left, we will forgo any right join examples.
In an inner join, you merge two tables into a single table, keeping only the rows where the key exists in both tables. This produces an intersection of the two tables, as shown in figure 5.15.
merge(productTable, salesTable, by = "productID") ## productID price unitsSold ## 1 p1 9.99 10 ## 2 p3 19.99 55 ## 3 p4 5.49 8
library("data.table") productTable_data.table <- as.data.table(productTable) salesTable_data.table <- as.data.table(salesTable) merge(productTable, salesTable, by = "productID") ## productID price unitsSold ## 1 p1 9.99 10 ## 2 p3 19.99 55 ## 3 p4 5.49 8
inner_join
library("dplyr") inner_join(productTable, salesTable, by = "productID") ## productID price unitsSold ## 1 p1 9.99 10 ## 2 p3 19.99 55 ## 3 p4 5.49 8
In a full join, you merge two tables into a single table, keeping rows for all key values. Notice that the two tables have equal importance here. We show the result in figure 5.16.
# note that merge orders the result by key column by default # use sort=FALSE to skip the sorting merge(productTable, salesTable, by = "productID", all=TRUE) ## productID price unitsSold ## 1 p1 9.99 10 ## 2 p2 NA 43 ## 3 p3 19.99 55 ## 4 p4 5.49 8 ## 5 p5 24.49 NA
library("data.table") productTable_data.table <- as.data.table(productTable) salesTable_data.table <- as.data.table(salesTable) merge(productTable_data.table, salesTable_data.table, by = "productID", all = TRUE) ## productID price unitsSold ## 1: p1 9.99 10 ## 2: p2 NA 43 ## 3: p3 19.99 55 ## 4: p4 5.49 8 ## 5: p5 24.49 NA
dplyr::full_join
library("dplyr") full_join(productTable, salesTable, by = "productID") ## productID price unitsSold ## 1 p1 9.99 10 ## 2 p3 19.99 55 ## 3 p4 5.49 8 ## 4 p5 24.49 NA ## 5 p2 NA 43
The examples we have given up to now do not use row order. Some problems can be solved much more efficiently with methods that do use row order, such as data.table's powerful rolling join operation.
You are given historic stock trade and quote (bid/ask) data. You are asked to perform the following analyses on the stock data: find what the bid and ask price were current when each trade was performed. This involves using row order to indicate time, and sharing information between rows.
In stock markets, the bid is the highest price somebody has declared they are willing to pay for a stock, and the ask is the lowest price that somebody has declared they are willing to sell a stock for. Bid and ask data are called quotes, and they usually are in an irregular time series (as new quotes can be formed at arbitrary times, not just at regular intervals), such as the following example:
library("data.table") quotes <- data.table( bid = c(5, 5, 7, 8), ask = c(6, 6, 8, 10), bid_quantity = c(100, 100, 100, 100), ask_quantity = c(100, 100, 100, 100), when = as.POSIXct(strptime( c("2018-10-18 1:03:17", "2018-10-18 2:12:23", "2018-10-18 2:15:00", "2018-10-18 2:17:51"), "%Y-%m-%d %H:%M:%S"))) print(quotes) ## bid ask bid_quantity ask_quantity when ## 1: 5 6 100 100 2018-10-18 01:03:17 ## 2: 5 6 100 100 2018-10-18 02:12:23 ## 3: 7 8 100 100 2018-10-18 02:15:00 ## 4: 8 10 100 100 2018-10-18 02:17:51
Another irregular time series is trades. These are after-the-fact reports about exchanges of quantities of stock at a given price at a given time.
trades <- data.table( trade_id = c(32525, 32526), price = c(5.5, 9), quantity = c(100, 200), when = as.POSIXct(strptime( c("2018-10-18 2:13:42", "2018-10-18 2:19:20"), "%Y-%m-%d %H:%M:%S"))) print(trades) ## trade_id price quantity when ## 1: 32525 5.5 100 2018-10-18 02:13:42 ## 2: 32526 9.0 200 2018-10-18 02:19:20
The data.table rolling join is perfect for finding what was the most recent quote information for each trade. A rolling join is a type of join on an ordered column that gives us the most recent data available at the lookup time.
quotes[, quote_time := when] trades[ , trade_time := when ] quotes[ trades, on = "when", roll = TRUE ][ , .(quote_time, bid, price, ask, trade_id, trade_time) ] ## quote_time bid price ask trade_id trade_time ## 1: 2018-10-18 02:12:23 5 5.5 6 32525 2018-10-18 02:13:42 ## 2: 2018-10-18 02:17:51 8 9.0 10 32526 2018-10-18 02:19:20
We read the preceding as “for each trade, look up the appropriate quote.” In the join, the when field comes from the trades, which is why we added a quote_time field so we could also see when the quote was established. The data.table rolling join is very fast, and also not easy to efficiently simulate in base R, SQL, or dplyr.
Rolling joins are unique to data.table. In R there are a number of tasks, such as matching most recent records, that are easily expressed as moving indexes across rows. However, moving indexes across rows tends to be inefficient in R, as it can’t be vectorized like column operations can. A rolling join is a direct way of solving such problems, and has an efficient implementation.
This section covers moving data between rows and columns. This is often called pivoting, a name from Pito Salas’s work that combines data summarization and shape transforms. Examples will be worked in three packages: data.table, cdata (which only reshapes data, and does not summarize data), and tidyr. Base R does have notations for these transforms (such as stack() and unstack()), but the package versions are significantly better tools.
We will show how to move data records where all measurements are in single rows to a new record set where data is in multiple rows. We call this moving from a wide form to a thin or tall form.
Let’s work with measurements of vehicle drivers/passengers injured or killed, by month. The data includes additional information about fuel price and whether seatbelts are required by law.
Relevant variables for this example:
library("datasets") library("xts") # move the date index into a column dates <- index(as.xts(time(Seatbelts))) Seatbelts <- data.frame(Seatbelts) Seatbelts$date <- dates # restrict down to 1982 and 1983 Seatbelts <- Seatbelts[ (Seatbelts$date >= as.yearmon("Jan 1982")) & (Seatbelts$date <= as.yearmon("Dec 1983")), , drop = FALSE] Seatbelts$date <- as.Date(Seatbelts$date) # mark if the seatbelt law was in effect Seatbelts$law <- ifelse(Seatbelts$law==1, "new law", "pre-law") # limit down to the columns we want Seatbelts <- Seatbelts[, c("date", "DriversKilled", "front", "rear", "law")] head(Seatbelts) ## date DriversKilled front rear law ## 157 1982-01-01 115 595 238 pre-law ## 158 1982-02-01 104 673 285 pre-law ## 159 1982-03-01 131 660 324 pre-law ## 160 1982-04-01 108 676 346 pre-law ## 161 1982-05-01 103 755 410 pre-law ## 162 1982-06-01 115 815 411 pre-law
To get our data into a presentable format, we have performed transforms described in earlier sections of this chapter: selecting rows, selecting columns, adding new derived columns, and so on. The data now has one row per date (we think of the date as the row-key) and contains information such as how many people were killed in each of three seating positions (driver, front, rear) and if the new seatbelt law was in effect.
We want to see if the new seatbelt law saves lives. Notice that we are missing a key bit of information: a normalizing factor such as number of cars owned per date, driving population size by date, or total miles driven per date (risks make more sense as rates than as absolute counts). This is an example of true data science being an iterative process: we are going to do the best job we can with the data at hand, but in a real project, we would also go back to sources and partners to try to get the critical missing data (or at least an estimate or proxy for the missing data).
Let’s plot the data conditioned on the law:
# let's give an example of the kind of graph we have in mind, # using just driver deaths library("ggplot2") ggplot(Seatbelts, aes(x = date, y = DriversKilled, color = law, shape = law)) + geom_point() + geom_smooth(se=FALSE) + ggtitle("UK car driver deaths by month")
This code produces figure 5.17.
From the chart, it looks like the introduction of the seatbelt law produced a drop in deaths that is non-trivial when compared to the normal variation in deaths. It also looks like the effect may have reverted quickly.
Suppose our follow-up question is to break down this data further to seating position (as types of seatbelts differ quite a bit by seating position).
To make such a plot with ggplot2, we need to move the data from all facts being in each row to having one row per seating position. This is an example of moving from a wide or denormalized format, the natural format for machine learning tasks, to a tall or multiline record format.
Plot deaths conditioned on date and seating position, using ggplot2. ggplot2 requires the data to be in a long, rather than wide, format. So we will concentrate on how to perform this transform. We call this sort of transform moving data from row-oriented records to blocks of rows, as shown in figure 5.18.
We can solve this with data.table::melt.data.table(). Specify the columns of the original table that the values are to be taken from with the measure.vars argument. Specify the pair of columns the information is to be written into in the transformed table with the arguments variable.name (the new key column) and value.name (the new value column).
library("data.table") seatbelts_long2 <- melt.data.table(as.data.table(Seatbelts), id.vars = NULL, measure.vars = c("DriversKilled", "front", "rear"), variable.name = "victim_type", value.name = "nvictims")
These new graphs do show us something more: the law had essentially no effect on people in the rear seats. This could be perhaps because the law didn't cover these seats, perhaps enforcing rear seat compliance was difficult, or perhaps rear-seat seatbelts were lap belts (instead of three-point restraints) and were not effective. The strongest benefit seems to be to front-seat passengers, and that is not too unusual, as they tend to have high-quality seatbelts and are not sitting in front of the steering column (a primary source of fatal injuries).
library("cdata") seatbelts_long3 <- unpivot_to_blocks( Seatbelts, nameForNewKeyColumn = "victim_type", nameForNewValueColumn = "nvictims", columnsToTakeFrom = c("DriversKilled", "front", "rear "))
cdata has simple methods to specify the coordinated conversions of many columns at once. A good introduction can be found at http://www.win-vector.com/blog/2018/10/faceted-graphs-with-cdata-and-ggplot2/.
We encourage you to try all three solutions and convince yourself they produce the equivalent results. We prefer the cdata solution, but it is new and not as well known as the data.table or tidyr solutions.
library("tidyr") seatbelts_long1 <- gather( Seatbelts, key = victim_type, value = nvictims, DriversKilled, front, rear) head(seatbelts_long1) ## date law victim_type nvictims ## 1 1982-01-01 pre-law DriversKilled 115 ## 2 1982-02-01 pre-law DriversKilled 104 ## 3 1982-03-01 pre-law DriversKilled 131 ## 4 1982-04-01 pre-law DriversKilled 108 ## 5 1982-05-01 pre-law DriversKilled 103 ## 6 1982-06-01 pre-law DriversKilled 115 ggplot(seatbelts_long1, aes(x = date, y = nvictims, color = law, shape = law)) + geom_point() + geom_smooth(se=FALSE) + facet_wrap(~victim_type, ncol=1, scale="free_y") + ggtitle("UK auto fatalities by month and seating position")
And we now have the passenger death data faceted by seating position in figure 5.19.
We have been given data in a log style, where each detail of a measurement is written in a separate row. Colloquially, we call this a tall or thin data form (formally, it is related to information storage ideas such as RDF triples). The operation of moving to a wide form is very much like what Microsoft Excel users call pivoting, except aggregations (sums, averages, counts) are not strictly part of moving from tall to wide form (we suggest aggregating first before transforming). Also, moving from tall to wide form is, of course, the inverse of the moving from wide to tall form conversion we discussed earlier.
For our example, we have taken the ChickWeight data from R’s datasets package. Please try these commands along with the book and take extra steps to examine the data (using commands such as View(), head(), summary() and so on):
library("datasets") library("data.table") library("ggplot2") ChickWeight <- data.frame(ChickWeight) # get rid of attributes ChickWeight$Diet <- NULL # remove the diet label # pad names with zeros padz <- function(x, n=max(nchar(x))) gsub(" ", "0", formatC(x, width=n)) # append "Chick" to the chick ids ChickWeight$Chick <- paste0("Chick", padz(as.character(ChickWeight$Chick))) head(ChickWeight) ## weight Time Chick ## 1 42 0 Chick01 ## 2 51 2 Chick01 ## 3 59 4 Chick01 ## 4 64 6 Chick01 ## 5 76 8 Chick01 ## 6 93 10 Chick01
This data is organized so each row is a single fact (weight) about a given chick at a given time. The is a very easy format to produce and transmit, which is why it is popular in scientific settings. To perform interesting work or learn from the data, we need to bring the data into a wider structure. For our problem, we would like all the weight facts about a chick to be in a single row, with time as the new column name.
Before doing that, let’s use some of our earlier lessons to get a look at the data. We can aggregate the data to move from information about individuals to overall trends.
# aggregate count and mean weight by time ChickSummary <- as.data.table(ChickWeight) ChickSummary <- ChickSummary[, .(count = .N, weight = mean(weight), q1_weight = quantile(weight, probs = 0.25), q2_weight = quantile(weight, probs = 0.75)), by = Time] head(ChickSummary) ## Time count weight q1_weight q2_weight ## 1: 0 50 41.06000 41 42 ## 2: 2 50 49.22000 48 51 ## 3: 4 49 59.95918 57 63 ## 4: 6 49 74.30612 68 80 ## 5: 8 49 91.24490 83 102 ## 6: 10 49 107.83673 93 124
In ChickSummary the only key is Time (specified by the data.tableby argument) and we can now see how many chicks are surviving at a given time and the distribution of surviving chick weights at a given time.
We can present this table graphically. To use ggplot2 to do this, we need to move the summarized data to a tall form (as ggplot2 prefers to work with tall data). We use cdata::unpivot_to_blocks:
library("ggplot2") ChickSummary <- cdata::unpivot_to_blocks( 1 ChickSummary, nameForNewKeyColumn = "measurement", nameForNewValueColumn = "value", columnsToTakeFrom = c("count", "weight")) ChickSummary$q1_weight[ChickSummary$measurement=="count"] <- NA 2 ChickSummary$q2_weight[ChickSummary$measurement=="count"] <- NA CW <- ChickWeight CW$measurement <- "weight" ggplot(ChickSummary, aes(x = Time, y = value, color = measurement)) + 3 geom_line(data = CW, aes(x = Time, y = weight, group = Chick), color="LightGray") + geom_line(size=2) + geom_ribbon(aes(ymin = q1_weight, ymax = q2_weight), alpha = 0.3, colour = NA) + facet_wrap(~measurement, ncol=1, scales = "free_y") + theme(legend.position = "none") + ylab(NULL) + ggtitle("Chick Weight and Count Measurements by Time", subtitle = "25% through 75% quartiles of weight shown shaded around mean")
This gives the chick weights organized by time and chick, as shown in figure 5.20.
Here we have plotted the total count of surviving chicks as a function of time, plus the weight trajectory of each individual check, and the summary statistics (mean weight, and 25% through 75% quartiles of weight).
We can now return to the example task of this section: putting all the information about each chick into a single row.
Diagrammatically, it looks like the following: one column’s (meastype) values are used as new column headers and a second column (meas) supplies values. We call this moving data from blocks to wide row records, as illustrated in figure 5.21.
To move data to a wide form using dcast.data.table(), we specify the rows and columns of a result matrix using a formula with the ~notation. We then say how to populate the cells of this matrix with the value.var argument. In our case, to get a data frame with a row per chick, a column per time, and weight in the cells, we use the following step:
library("data.table") ChickWeight_wide2 <- dcast.data.table( as.data.table(ChickWeight), Chick ~ Time, value.var = "weight")
This table is a matrix where the row is identified by the chick, and the column is time. The cells contain the weight for the given chick and time (and NA if the chick did not survive to a given time). Notice that this format is much easier to read and may be required for reporting.
data.table’s implementation of dcast also allows more-powerful transforms, such as casting to multiple variables and aggregations at the same time.
cdata::pivot_to_rowrecs() describes the intended table by row-keys, column to take new column keys from, and column to take values from:
library("cdata") ChickWeight_wide3 <- pivot_to_rowrecs( ChickWeight, columnToTakeKeysFrom = "Time", columnToTakeValuesFrom = "weight", rowKeyColumns = "Chick")
library("tidyr") ChickWeight_wide1 <- spread(ChickWeight, key = Time, value = weight) head(ChickWeight_wide1) ## Chick 0 2 4 6 8 10 12 14 16 18 20 21 ## 1 Chick01 42 51 59 64 76 93 106 125 149 171 199 205 ## 2 Chick02 40 49 58 72 84 103 122 138 162 187 209 215 ## 3 Chick03 43 39 55 67 84 99 115 138 163 187 198 202 ## 4 Chick04 42 49 56 67 74 87 102 108 136 154 160 157 ## 5 Chick05 41 42 48 60 79 106 141 164 197 199 220 223 ## 6 Chick06 41 49 59 74 97 124 141 148 155 160 160 157
There are a lot of details to data transforms. The important concept to retain is this: data has coordinates such as name of table, name of column, and identity of row. The exact way the coordinates are specified is an implementation detail to be overcome or transformed to a convenient state. All of this is a consequence of Codd’s second rule of database design: “Each and every datum (atomic value) in a relational database is guaranteed to be logically accessible by resorting to a combination of table name, primary key value, and column name.”[4] What we hope you have learned is this: what parts of the coordinates (or access plan) happen to be the table name, versus row-keys, versus the column name is an alterable implementation detail.
Building temporary tables, adding columns, and correcting column names early is much better than having complicated analysis code. This follows Raymond’s “Rule of Representation.”
Raymond’s “Rule of Representation”
Fold knowledge into data, so program logic can be stupid and robust.
The Art of Unix Programming, Erick S. Raymond, Addison-Wesley, 2003
We suggest transforming your data to fix problems (correct column names, change the data layout) early, to make later steps easier. The format you should try to move to for predictive modeling is what database designers call a denormalized form, or what statisticians call a multivariate data matrix, or model matrix: a regular array where rows are individuals, and columns are possible observations.[5]
See W. J. Krzanowski and F. H. C. Marriott, Multivariate Analysis, Part 1, Edward Arnold, 1994.
The interested reader may want to pursue cdata’s powerful diagrammatic system of data layout, which is being broadly adopted, and is discussed here: https://github.com/WinVector/cdata.
In this chapter, we’ve worked through the basic examples of transform data for analysis and presentation.
At this point, we have worked through a great number of data transforms. Natural questions to ask are these: are these enough transforms? Can we quickly decompose any task into a small sequence of these transforms?
The answer is “no and yes.” There are more-specialized transforms such as “rolling window” functions and other time-series operations that are hard to express in terms of these transforms, but do in fact have their own efficient implementations in R and data.table. However, toward the “yes” answer, there are good reasons to consider the set of transforms we have learned as substantial. The basic manipulation transforms pretty much cover all of Edgar F. Codd’s relational algebra: a set of transforms that has been driving data engineering since 1970.
In this chapter you have learned
In part 2 of the book, we’ll talk about the process of building and evaluating models to meet your stated objectives.