What You’ll Learn in This Hour:
The dplyr package
Piping commands together
The data.table package
Options for improving efficiency
In Hour 11, “Data Manipulation and Transformation,” we looked at some standard methods for processing data in R. In particular, you saw how to sort and merge data. In previous hours we discussed how to subscript and summarize data using the “apply” family of functions. Now we will look at two packages, dplyr and data.table, that enable us to do all of these tasks for data frames within consistent, highly efficient frameworks.
We will begin the hour by looking at Hadley Wickham’s incredibly popular dplyr package. Although dplyr is actually the more recent of the two packages we’ll discuss in this hour, it fits in with packages such as readr and tidyr from the previous two hours. The data.table package is a standalone package for data manipulation that offers greater efficiency for very large data.
The dplyr package is another Hadley Wickham package that is revolutionizing the way people work with data in R. The package, which was first released in January 2014, fits into an analysis workflow that Hadley Wickham has helped define. In Hour 10, “Importing and Exporting,” you saw how packages such as readr, haven, and readxl can be used to import data into R. In Hour 11, you saw how the tidyr package can be used to transform data into a new shape. We will now look at how dplyr can be used to sort, subset, merge and summarize data.
The dplyr package can be thought of as an evolution of the popular plyr package, although it focuses solely on the manipulation of rectangular data structures, whereas plyr provides a more general framework. The focus of dplyr is very much on usability; however, there has also been considerable effort to ensure that dplyr is fast and efficient.
The dplyr package is intended to be used in a data analysis workflow in which data is imported using packages such as readr, haven, and readxl and then (possibly) transformed using tidyr. Each of these packages contains functions that produce an object of the tbl_df
class. A tbl_df
object is a dplyr construct that extends a data frame, affecting the way it prints.
The tbl_df
class extension does not affect standard data frame operations; however, each of the data-manipulation functions within dplyr returns a tbl_df
object and so it is worth us spending a little time to see what a tbl_df
actually looks like. We can create a tbl_df
object directly from a data.frame
using the tbl_df
function. An example of this is shown in Listing 12.1.
1 : > library(dplyr)
2 : >
3 : > # Create a tbl_df object from mtcars
4 : > head(mtcars)
5 : mpg cyl disp hp drat wt qsec vs am gear carb
6 : Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
7 : Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
8 : Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
9 : Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
10 : Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
11 : Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
12 : >
13 : > carData <- tbl_df(mtcars)
14 : > carData
15 : Source: local data frame [32 x 11]
16 :
17 : mpg cyl disp hp drat wt qsec vs am gear carb
18 : 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
19 : 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
20 : 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21 : 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
22 : 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
23 : 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
24 : 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
25 : 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
26 : 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
27 : 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
28 :.. ... ... ... ... ... ... ... .. .. ... ...
29 : >
30 : > class(carData) # A dbl_df object is just an extension to a data.frame object
31 : [1] "tbl_df" "tbl" "data.frame"
In addition to changing the way in which data frames print, the creation of a tbl_df
object also removes row names. In Listing 12.1 we can see how the creation of the carData
“tbl_df
” removes the row names from the original mtcars
data. This is intentional and enforces the tidy data principle that all meaningful information should be stored in the same way (in columns). However, it can of course be a little frustrating if you have meaningful row names! The terms “tbl_df
” and “data frame” will be used interchangeably throughout the remainder of this hour.
Note: Working with Data Tables
The dplyr package allows us to work with data table objects via the tbl_dt
function, which extends the data.table
class to create a tbl_dt
object. A tbl_dt
object behaves just like a tbl_df
object.
In dplyr we sort data using the arrange
function. The arrange
function expects a data frame (or a tbl_df
) as the first argument. We can then list any number of columns as the subsequent arguments. The data is sorted by the first column we provide, then by the second, and so on. By default, an ascending sort is used. In the example below, we sort the carData
data by carb
and then by cyl
:
> arrange(carData, carb, cyl)
Source: local data frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
5 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
6 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
7 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
10 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
.. ... ... ... ... ... ... ... .. .. ... ...
If we want to sort by descending values for any of our sort columns, we can wrap the column name in a call to the desc
function; for example, to sort by carb
and then descending values of cyl
we would write arrange(carData, carb, desc(cyl))
. Alternatively, we can simply place a minus sign in front of the column name, as shown here:
arrange(carData, carb, -cyl)
The dplyr package defines subscripting as two distinct operations: choosing rows and choosing columns. These are defined respectively as filter
and select
. As with all of the dplyr functions we are discussing in this hour, each function expects a data frame (or tbl_df
object) as the first argument. This allows us to reference variables directly in subsequent arguments without using dollar signs or square brackets. In the second argument, we choose how we wish to “filter” the rows or “select” the columns. Let’s start by using the filter
function to create a subset of carData
containing only four-cylinder cars:
> cyl4 <- filter(carData, cyl == 4)
> cyl4
Source: local data frame [11 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
7 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
8 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
9 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
10 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
11 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
We can use any standard logical operations to filter our data. In addition to the standard ampersand (&
), dplyr also permits us to separate “and” operations with a comma:
> filter(carData, cyl == 4, gear == 5) # equivalent to cyl == 4 & gear == 5
Source: local data frame [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
1 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
2 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
The select
function operates in much the same way as filter
. We can either use column names or column numbers to select which columns to keep or drop, much like the select
option in the subset
function. The standard way to select multiple columns is to separate each column with a comma. Note again that we do not use quotes to specify columns.
> select(carData, mpg, wt, cyl) # Return just these columns
Source: local data frame [32 x 3]
mpg wt cyl
1 21.0 2.620 6
2 21.0 2.875 6
3 22.8 2.320 4
4 21.4 3.215 6
5 18.7 3.440 8
6 18.1 3.460 6
7 14.3 3.570 8
8 24.4 3.190 4
9 22.8 3.150 4
10 19.2 3.440 6
.. ... ... ...
> select(carData, -vs, -am) # Return everything except these columns
Source: local data frame [32 x 9]
mpg cyl disp hp drat wt qsec gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 3 1
5 18.7 8 360.0 175 3.15 3.440 17.02 3 2
6 18.1 6 225.0 105 2.76 3.460 20.22 3 1
7 14.3 8 360.0 245 3.21 3.570 15.84 3 4
8 24.4 4 146.7 62 3.69 3.190 20.00 4 2
9 22.8 4 140.8 95 3.92 3.150 22.90 4 2
10 19.2 6 167.6 123 3.92 3.440 18.30 4 4
.. ... ... ... ... ... ... ... ... ...
Another nice property of the select
function is that we can choose a sequence of columns using the column names in addition to the column numbers. For example, we could specify select(carData, mpg:wt)
. Choosing the columns that we want is simplified via a number of additional utility functions, as listed in Table 12.1.
Caution: Specialist functions within select
The functions described in Table 12.1 only work inside the select
function and cannot be used to find patterns in standard character vectors.
The mutate
function enables us to easily add new columns to our data. We can either provide a vector of values in the same way we would with a standard data frame or we can create new columns from existing variables. In the following example, we create a new column containing the original row names from the mtcars
data frame. We then use the information contained with the hp
and wt
columns to create a second new column containing the power-to-weight ratio.
> fullCarData <- mutate(carData, type = rownames(mtcars), pwr2wt = hp/wt)
> fullCarData
Source: local data frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb type pwr2wt
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 41.98473
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag 38.26087
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 40.08621
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive 34.21462
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout 50.87209
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Valiant 30.34682
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Duster 360 68.62745
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 240D 19.43574
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 230 30.15873
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280 35.75581
.. ... ... ... ... ... ... ... .. .. ... ... ... ...
We can also drop columns by assigning existing names to NULL
. The mutate
function is similar to the base R function transform
. However, unlike transform
, the mutate
function creates variables in the order in which we specify them, allowing variables that we create to themselves create new variables.
> fullCarData <- mutate(carData, type = rownames(mtcars),
+ drat = NULL, qsec = NULL,
+ pwr2wt = hp/wt, pwr2wt.Sq = pwr2wt^2)
> head(fullCarData,3)
Source: local data frame [3 x 12]
mpg cyl disp hp wt vs am gear carb type pwr2wt pwr2wt.Sq
1 21.0 6 160 110 2.620 0 1 4 4 Mazda RX4 41.98473 1762.718
2 21.0 6 160 110 2.875 0 1 4 4 Mazda RX4 Wag 38.26087 1463.894
3 22.8 4 108 93 2.320 1 1 4 1 Datsun 710 40.08621 1606.904
In Hour 11, you saw how the merge
function can be used to merge data frames. The merge
function allows us to specify arguments such as all.x
in order to achieve what is also commonly known as a “left join.” In contrast, dplyr splits these arguments out into separate functions. These can be seen in Table 12.2. As with merge
, we refer to our two datasets as x
and y
.
The first four functions listed in Table 12.2 operate in the same way as the merge
function. For example, inner_join(demoData, pkData)
provides an equivalent to merge(demoData, pkData)
. In addition, dplyr offers us the concepts of a semi-join and an anti-join. The semi_join
function does not actually perform a merge. Instead, it returns rows in x
that would be retained if we were to merge x
with y
. Conversely, the anti_join
function returns rows of x
that would not be retained if we were to merge with y
. Listing 12.2 illustrates a semi-join and an anti-join using two (fabricated) sample data frames.
1 : > # Fabricate two datasets to merge
2 : > beerData <- data.frame(ID = c(1, 2, 3), Beer = c(75, 64, 92))
3 : > diaperData <- data.frame(ID = c(1, 3, 4), Diapers = c(51, 68, 32))
4 : > beerData
5 : ID Beer
6 : 1 1 75
7 : 2 2 64
8 : 3 3 92
9 : > diaperData
10 : ID Diapers
11 : 1 1 51
12 : 2 3 68
13 : 3 4 32
14 : >
15 : > # Rows of beerData that have a corresponding "ID" in diaperData
16 : > semi_join(beerData, diaperData, by = "ID")
17 : ID Beer
18 : 1 1 75
19 : 2 3 92
20 : > # Rows of beerData that do not have a corresponding "ID" in diaperData
21 : > anti_join(beerData, diaperData, by = "ID")
22 : ID Beer
23 : 1 2 64
24 : > # An inner join of the two datasets
25 : > inner_join(beerData, diaperData, by = "ID")
26 : ID Beer Diapers
27 : 1 1 75 51
28 : 2 3 92 68
Note that in each case we specified the “by” variable for the merge as "ID"
but we did not have to. Like merge
, each of the dplyr *join
functions will automatically determine the merge by variables for us if we do not specify it. Because we stated that the data in the example is to be merged by the ID
variable, the semi-join looks for ID
values in beerData
that also appear in diaperData
. These are the rows that would be merged using either inner_join
(as in lines 25 to 28) or left_join
. Accordingly, anti_join
returns the remaining rows that would not be merged.
In addition to facilitating data manipulation, dplyr also provides an easy-to-use syntax for data aggregation that is a marked improvement upon the more generic predecessor, the plyr package. In dplyr terminology, data aggregation is referred to as a data summary. We therefore use a function called summarize
to obtain numeric summaries of our data. As always, when using dplyr we pass the data as the first argument. In the subsequent arguments we can use standard summary functions to summarize columns in the data. In the following example, we use the mean
function to summarize the mpg
column within carData
:
> summarize(carData, mean(mpg))
Source: local data frame [1 x 1]
mean(mpg)
1 20.09062
We can summarize using any function we like, including custom-written functions. The only restrictions are that the function we use must expect a vector as the input and that it must return a single value. We cannot therefore use a function such as range
because this returns a vector of length 2. However, we can make as many summaries as we like in a single call to summarize.
> summarize(carData, min(mpg), median(mpg), max(mpg))
Source: local data frame [1 x 3]
min(mpg) median(mpg) max(mpg)
1 10.4 19.2 33.9
When creating multiple summaries in this way, it can be helpful to be able to manually control the labels of the resulting data. In order to do so we simply specify the name of the resulting output column when creating the summary, as follows:
> mpgSummary <- summarize(carData, Min=min(mpg), Median=median(mpg), Max=max(mpg))
> mpgSummary
Source: local data frame [1 x 3]
Min Median Max
1 10.4 19.2 33.9
Sometimes we may find that we need to pass additional arguments to our summary functions. For example, we may need to specify na.rm = TRUE
when summarizing a variable with missing values. In order to pass extra arguments to our summary functions, we pass the arguments as if we were calling the function directly. Here’s an example:
summarize(airquality, mean(Ozone, na.rm = TRUE)).
If all we needed to do was summarize columns of data using standard numeric summary functions, then dplyr doesn’t really offer anything new. If anything, it makes the process more tedious. However, the real advantage of using the summarize
function is that it facilitates easy “by” operations. In order to summarize our data by variable(s), we use the group_by
function to define a grouping within our data. We can actually group our data at any time, and the grouping will be retained by any other operations we perform. We can group by as many variables as we like.
To demonstrate the concept of grouped data, let’s group carData
by the cyl
variable and observe what happens when we filter the data by carb
. The code for the operation is shown in Listing 12.3.
1: > cylGrouping <- group_by(carData, cyl)
2: > head(cylGrouping)
3: Source: local data frame [6 x 11]
4: Groups: cyl
5:
6: mpg cyl disp hp drat wt qsec vs am gear carb
7: 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
8: 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
9: 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
10: 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
11: 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
12: 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
13: >
14: > filter(cylGrouping, carb == 4)
15: Source: local data frame [10 x 11]
16: Groups: cyl
17:
18: mpg cyl disp hp drat wt qsec vs am gear carb
19: 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
20: 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
21: 3 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
22: 4 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
23: 5 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
24: 6 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
25: 7 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
26: 8 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
27: 9 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
28: 10 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Notice first of all that grouping by the cyl
variable has the effect of adding a line to the output (see line 4). As can be seen in line 16, the cyl
grouping was retained when we filtered the data. In both cases the sort order remains unaffected by the grouping. The effect of grouping our data is only felt when we summarize it. In the following example, we summarize the mpg
column in our grouped data, cylGrouping
:
> mpgSummaryByCyl <- summarize(cylGrouping, min(mpg), median(mpg), max(mpg))
> mpgSummaryByCyl
Source: local data frame [3 x 4]
cyl min(mpg) median(mpg) max(mpg)
1 4 21.4 26.0 33.9
2 6 17.8 19.7 21.4
3 8 10.4 15.2 19.2
The result of performing a summary operation on grouped data is that the output is summarized by each level of the grouping variable(s). In keeping with the concept of tidy data, the output is a data frame (in fact, a tbl_df
). The operation returns a separate column for each variable that we grouped by, with additional columns for each summary we specified.
You have already seen that when we filter our data, the grouping variables are retained. However, we can also use the grouping to our advantage within the filter itself. In the following example, we use a grouping on the cyl
variable to extract the maximum mpg
value for each value of cyl
. The comparison mpg == max(mpg)
is performed within each group (that is, each value of cyl
).
> cylGrouping <- group_by(carData, cyl)
> # Extract maximum mpg by for each cyl category
> filter(cylGrouping, mpg == max(mpg))
Source: local data frame [3 x 11]
Groups: cyl
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
2 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
3 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Grouping our data also facilitates the generation of new aggregation variables. For example, we could create a new variable, meanMPGbyCyl
, that is the mean of the mpg
column for each value of cyl
, as shown here:
> mutate(cylGrouping, meanMPGbyCyl = mean(mpg))
Source: local data frame [32 x 12]
Groups: cyl
mpg cyl disp hp drat wt qsec vs am gear carb meanMPGbyCyl
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 19.74286
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 19.74286
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 26.66364
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 19.74286
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 15.10000
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 19.74286
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 15.10000
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 26.66364
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 26.66364
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 19.74286
.. ... ... ... ... ... ... ... .. .. ... ... ...
Note: Remove a Grouping
We can remove any groupings in our data using the ungroup
function.
Functions in dplyr have been written in order to take advantage of what is commonly referred to as the “pipe” operator. The pipe operator, %>%
, originates in the magrittr package and is by no means restricted to usage within dplyr. The pipe operator allows us to chain functions together such that the output from one function becomes the input to the first argument (by default) of the next. This has led to it being called the “then” operator in some quarters (do this, then this, then this, and so on). It is particularly useful if we have many steps to perform on a single type of object such as a data frame. The advantage of this approach is that it avoids intermediary objects (that is, those that we create simply to break up nested function calls).
Note: Piping to Other Arguments
When you use the pipe operator, the output from a function does not have to be used as the input to the first argument of the next function. It can in fact become the input to any argument within the following function. However, the code is generally a lot more readable if we feed the output into the first argument of the following function.
The dplyr package has been written with the pipe operator very much in mind. In a typical analysis workflow we might arrange
, filter
, select
, mutate
, group_by
, and summarize
several times over. Each of these functions takes a data frame as its first input and returns another data frame as the output. This is ideal for piping together function calls. Consider the example in Listing 12.4 using mtcars
. In the first instance we use the traditional approach to data processing. To avoid nesting, we end up creating three intermediate datasets on the way to obtaining our summary. We then perform the same operations using the pipe operator. In the second case, no intermediate datasets are required.
1: > # A standard workflow, mean mpg by cyl for manual cars
2: > # The traditional way:
3: > carsByCyl <- arrange(mtcars, cyl)
4: > groupByCyl <- group_by(carsByCyl, cyl)
5: > manualCars <- filter(groupByCyl, am == 1)
6: > summarize(manualCars, Mean.MPG=mean(mpg))
7: Source: local data frame [3 x 2]
8:
9: cyl Mean.MPG
10: 1 4 28.07500
11: 2 6 20.56667
12: 3 8 15.40000
13: >
14: > # Using pipes
15: > mtcars %>%
16: + arrange(cyl) %>%
17: + group_by(cyl) %>%
18: + filter(am == 1) %>%
19: + summarize(Mean.MPG=mean(mpg))
20: Source: local data frame [3 x 2]
21:
22: cyl Mean.MPG
23: 1 4 28.07500
24: 2 6 20.56667
25: 3 8 15.40000
The pipe operator is not to everyone’s taste, and it can be harder to debug than well-written code using a traditional syntax. However, it is becoming an increasingly popular means of working with data—and before long it may not be possible to avoid it!
The data.table package predates dplyr by several years, having been first released to CRAN in April 2006. However, it is still actively maintained by its primary author and maintainer Matt Dowle, and despite the growing popularity of the dplyr package, data.table remains one of the most popular and well-documented packages on CRAN. In addition to the standard help and a quick-start guide, Matt Dowle has written an extensive FAQ document for the package tackling some of the less-obvious aspects of the package.
The focus of the package is very much on reading, processing, and aggregating large data efficiently. The data.table
object is essentially an enhancement to the data.frame
class. It allows us to index, merge, and group data much faster than we can with standard data frames.
Like any analysis workflow the data.table
workflow begins with importing data. In Hour 10 we looked briefly at the performance of the fread
function contained within data.table. The fread
function is similar to read.table
in terms of usage, though it’s much faster for large datasets. Conveniently, the output of the function is a data.table
object.
> dji <- fread("djiData.csv")
> dji
Date DJI.Open DJI.High DJI.Low DJI.Close DJI.Volume DJI.Adj.Close
1: 12/31/2014 17987.66 18043.22 17820.88 17823.07 82840000 17823.07
2: 12/30/2014 18035.02 18035.02 17959.70 17983.07 47490000 17983.07
3: 12/29/2014 18046.58 18073.04 18021.57 18038.23 53870000 18038.23
4: 12/26/2014 18038.30 18103.45 18038.30 18053.71 52570000 18053.71
5: 12/24/2014 18035.73 18086.24 18027.78 18030.21 42870000 18030.21
---
248: 01/08/2014 16527.66 16528.88 16416.69 16462.74 103260000 16462.74
249: 01/07/2014 16429.02 16562.32 16429.02 16530.94 81270000 16530.94
250: 01/06/2014 16474.04 16532.99 16405.52 16425.10 89380000 16425.10
251: 01/03/2014 16456.89 16518.74 16439.30 16469.99 72770000 16469.99
252: 01/02/2014 16572.17 16573.07 16416.49 16441.35 80960000 16441.35
The appearance of a data table is similar to that of a standard data frame. When we choose to print a small dataset (one containing 100 rows or less), the entire dataset is returned, but with the header row repeated at the base of the table. For larger datasets, only the first and last five rows are returned. We can turn existing data frames into data.table
objects by directly calling a data.table
function—for example, air <- data.table(airquality)
. We can also create a data.table
from scratch in the same way we would using the data.frame
function.
Tip: Keeping Track of Tables
If we create many data table objects, the tables
function can be used to find out what tables we have, what they contain, and how much memory they have been allocated.
One of the primary focuses of the data.table package is performance. To achieve this performance, we define a key. In some ways this is similar to a primary key that would be used in a relational database. However, in data.table the key can be made up of several columns and does not have to be unique. In fact, it is often more useful if the key is not unique. The key is used for sorting, indexing, and summarizing. It is defined using a function called setkey
. In Listing 12.5 we define a simple data.table
using the demoData
data in the mangoTraining package and then set the key based on the variables Sex
and Smokes
.
1: > # Create a data.table and define the key
2: > demoDT <- data.table(demoData)
3: > setkey(demoDT, Sex, Smokes)
4: > head(demoDT)
5: Subject Sex Age Weight Height BMI Smokes
6: 1: 3 F 23 72 170 25.1 No
7: 2: 6 F 29 67 169 23.5 No
8: 3: 12 F 32 77 182 23.1 No
9: 4: 15 F 27 73 172 24.8 No
10: 5: 23 F 26 82 175 26.8 No
11: 6: 26 F 25 58 175 18.9 No
The obvious effect of defining a key is that when printing, the data is sorted by the key variables from left to right as we defined them. In Listing 12.5 they are sorted by Sex
and then by Smokes
. The purpose of defining the sort key is not just for printing purposes, however. It enables faster indexing when subscripting.
Notice that we wrote setkey(demog, Sex, Smokes)
as opposed to demog <- setkey(demog, Sex, Smokes)
. Functions in data.table update the data table directly, so we do not need to use <-
to copy/replace the original data. Updating by reference in this way reduces the memory required to perform manipulation tasks and improves speed.
Tip: Querying the Key
We can find out if a data table has a key using the function haskey
, which returns TRUE
if the data table has a key and FALSE
otherwise.
The key
function tells us what the key is.
In the data.table syntax, we can reference columns directly as if they were objects in their own right. In other words, we can drop the “dataName$
” syntax. This saves some typing, though the real benefit is the speed gain we get from using data.table in the first place.
> demoDT[Sex == "F",]
Subject Sex Age Weight Height BMI Smokes
1: 3 F 23 72 170 25.1 No
2: 6 F 29 67 169 23.5 No
3: 12 F 32 77 182 23.1 No
4: 15 F 27 73 172 24.8 No
5: 23 F 26 82 175 26.8 No
6: 26 F 25 58 175 18.9 No
7: 28 F 28 69 172 23.4 No
8: 30 F 33 61 175 19.9 No
9: 17 F 41 62 172 20.9 Yes
10: 27 F 36 82 190 22.6 Yes
If our data table has a key and we want to subset by that key, we can go one step further and drop the reference to the variable we want to subset altogether (for example, demoDT["F",]
). In fact, we don’t even need the comma to specify rows as we would with a data frame, though it can be sometimes be confusing to leave it out.
If we have defined a key using multiple variables, we can provide the subset values by separating with a comma. We enclose the values using J()
, where J
stands for “join.” In the following example, we subset the demography data to return female smokers:
> key(demoDT)
[1] "Sex" "Smokes"
> demoDT[J("F", "Yes"),]
Subject Sex Age Weight Height BMI Smokes
1: 17 F 41 62 172 20.9 Yes
2: 27 F 36 82 190 22.6 Yes
Note: Alternatives to J
The J
function is the data.table specification of a “join” of two keys. The practice of joining based on keys has its roots in SQL, but in practice it is just a means of separating variables. As an alternative, the function list
(base) or .
(plyr) could be used in exactly the same way.
Occasionally we may want to return a subset in which the variables of interest match multiple criteria. To achieve this we can specify a vector of values. If we have defined a key from multiple variables, any vector we specify must be contained within a call to the J
function. An example of this is shown here:
> setkey(demoDT, Sex, Weight)
> demoDT[J("M", c(76, 77)),]
Subject Sex Age Weight Height BMI Smokes
1: 4 M 25 76 188 21.4 No
2: 31 M 25 76 174 25.1 No
3: 13 M 21 77 180 23.6 No
4: 20 M 22 77 183 23.1 No
Caution: Numeric Keys
The data.table package allows us to define a key using numeric variables. However, in order to subset using these keys we must use the .
function. This is because, like data frames, data tables also allow us to subset by specifying the row numbers. If we wanted to return all the rows in demoDT
for which Weight
is equal to 72, we would write the following:
> setkey(demoDT, Weight)
> demoDT[.(72),]
Subject Sex Age Weight Height BMI Smokes
1: 3 F 23 72 170 25.1 No
The data.table package makes adding variables to an existing data table much easier and quicker than when working with standard data frames. Whenever we add a column to a standard data frame, we make a copy of the data. When we work with data tables, the new column is instead appended by reference; in other words, R points to the existing table and tells it to add a new column. This makes it much faster and more efficient.
We create new variables in our data, by reference, using the :=
operator. To create variables by reference we use square, subscript brackets with the existing data table. We avoid any standard R assignment. If we are generating the new variable from existing variables, we refer to them directly as in the following example:
> demoDT[, HeightInM.sq := (Height^2)/10000]
> head(demoDT)
Subject Sex Age Weight Height BMI Smokes HeightInM.sq
1: 1 M 43 57 166 20.7 No 2.7556
2: 2 M 22 71 179 22.2 No 3.2041
3: 3 F 23 72 170 25.1 No 2.8900
4: 4 M 25 76 188 21.4 No 3.5344
5: 5 M 29 82 175 26.8 No 3.0625
6: 6 F 29 67 169 23.5 No 2.8561
Caution: Updating the Values in the Key
If we update the values in any of the columns that make up our key, we need to redefine the key.
To create multiple new columns, we must provide the names of the new columns as a character vector and the transformations as a list. The vector of names and list of transformations should be separated by the :=
operator, as shown in Listing 12.6. We can also remove columns by setting them to NULL
using the :=
operator.
1: > demoDT[, c("SexNum", "SmokesNum") := list(as.numeric(Sex), as.numeric(Smokes))]
2: > head(demoDT)
3: Subject Sex Age Weight Height BMI Smokes HeightInM.sq SexNum SmokesNum
4: 1: 1 M 43 57 166 20.7 No 2.7556 2 1
5: 2: 26 F 25 58 175 18.9 No 3.0625 1 1
6: 3: 30 F 33 61 175 19.9 No 3.0625 1 1
7: 4: 22 M 27 61 170 21.0 No 2.8900 2 1
8: 5: 17 F 41 62 172 20.9 Yes 2.9584 1 2
9: 6: 14 M 26 64 170 22.0 No 2.8900 2 1
We can rename columns using the setnames
function. Once again the renaming is performed by reference to avoid copying the entire dataset. The setnames
function expects a data table as its first argument, with further arguments old
and new
, which respectively expect a vector of column names to change from and to.
Note: Multiple Ways to Create New Variables
There are normally several ways of doing the same thing with data.table, and everyone tends to have their preference. In order to create new variables in Listing 12.6, we could also have used the following syntax:
demoDT[, `:=` (SexNum = as.numeric(Sex), SmokesNum = as.numeric(Smokes))]
We could also have used the set
function to achieve the same result.
Although the rbind
function in base can be used to append rows to a data table, the function rbindlist
is optimized for speed and memory efficiency. The rbindlist
function can be used to join data tables and/or regular data frames that are stored as a list. We can join together as many datasets as we wish, but we must first store them together in a list. Unlike the standard rbind
that we looked at in Hour 11, rbindlist
will permit us to bind together datasets for which the column names do not match by setting fill = TRUE
. An example of this is shown in Listing 12.7. First we generate a list by splitting the airquality
data by the Month
variable and combine this back together in line 5. Then we use rbindlist
again in line 24 to add on new rows of data.
1: > # Create a list containing airquality data for each available month
2: > airSplit <- split(airquality, airquality$Month)
3: >
4: > # Bind these together into a single data table
5: > airDT <- rbindlist(airSplit)
6: > airDT
7: Ozone Solar.R Wind Temp Month Day
8: 1: 41 190 7.4 67 5 1
9: 2: 36 118 8.0 72 5 2
10: 3: 12 149 12.6 74 5 3
11: 4: 18 313 11.5 62 5 4
12: 5: NA NA 14.3 56 5 5
13: ---
14: 149: 30 193 6.9 70 9 26
15: 150: NA 145 13.2 77 9 27
16: 151: 14 191 14.3 75 9 28
17: 152: 18 131 8.0 76 9 29
18: 153: 20 223 11.5 68 9 30
19: >
20: > # Now assume two new records arrive but with missing columns
21: > month10 <- data.table(Ozone = c(24, 28), Month = 10, Day = 1:2)
22: >
23: > # Bind this to our original data
24: > newAirDT <- rbindlist(list(airDT, month10), fill = TRUE)
25: > tail(newAirDT)
26: Ozone Solar.R Wind Temp Month Day
27: 1: NA 145 13.2 77 9 27
28: 2: 14 191 14.3 75 9 28
29: 3: 18 131 8.0 76 9 29
30: 4: 20 223 11.5 68 9 30
31: 5: 24 NA NA NA 10 1
32: 6: 28 NA NA NA 10 2
Merging data tables works in the much same way as a typical merge on a data frame using the merge
function. However, the default behavior of merge
for data tables is to use the respective keys for the two data tables. We must therefore either define keys for the two data tables or specify the “by” variables manually. In Listing 12.8 we create two data tables from the demoData
and pkData
data frames contained within the mangoTraining package and set the keys accordingly. In line 8 we perform a merge
, similar to that used in Hour 11.
1: > # Create data tables and define the keys accordingly
2: > demoDT <- data.table(demoData)
3: > setkey(demoDT, Subject)
4: > pkDT <- data.table(pkData)
5: > setkey(pkDT, Subject)
6: >
7: > # Merge the two data tables together
8: > allPKDT <- merge(demoDT, pkDT)
9: > allPKDT
10: Subject Sex Age Weight Height BMI Smokes Dose Time Conc
11: 1: 1 M 43 57 166 20.7 No 25 0 0.00
12: 2: 1 M 43 57 166 20.7 No 25 1 660.13
13: 3: 1 M 43 57 166 20.7 No 25 6 178.92
14: 4: 1 M 43 57 166 20.7 No 25 12 88.99
15: 5: 1 M 43 57 166 20.7 No 25 24 42.71
16: ---
17: 161: 33 M 30 80 180 24.8 No 25 0 0.00
18: 162: 33 M 30 80 180 24.8 No 25 1 453.13
19: 163: 33 M 30 80 180 24.8 No 25 6 205.30
20: 164: 33 M 30 80 180 24.8 No 25 12 146.69
21: 165: 33 M 30 80 180 24.8 No 25 24 46.84
For large datasets you will notice that using merge
with data tables is significantly faster than the with data frames. For those that need that little bit of extra performance, however, the package offers an alternative that is even faster. To perform the data table merge, we return to using square brackets. For a standard merge (a.k.a. an inner join), we put one data table inside the brackets and one outside. An example of an inner join or standard merge is shown here:
> demoDT[pkDT]
Subject Sex Age Weight Height BMI Smokes Dose Time Conc
1: 1 M 43 57 166 20.7 No 25 0 0.00
2: 1 M 43 57 166 20.7 No 25 1 660.13
3: 1 M 43 57 166 20.7 No 25 6 178.92
4: 1 M 43 57 166 20.7 No 25 12 88.99
5: 1 M 43 57 166 20.7 No 25 24 42.71
---
161: 33 M 30 80 180 24.8 No 25 0 0.00
162: 33 M 30 80 180 24.8 No 25 1 453.13
163: 33 M 30 80 180 24.8 No 25 6 205.30
164: 33 M 30 80 180 24.8 No 25 12 146.69
165: 33 M 30 80 180 24.8 No 25 24 46.84
In addition to transforming and manipulating our data, we can also use data.table to summarize our data. As usual, we start by specifying the name of the data and use square brackets to create a summary. We can perform simple summary operations on columns using standard statistical summary functions such as mean
.
> # Calculate the mean height
> demoDT <- data.table(demoData)
> demoDT[ , mean(Height)]
[1] 176.1515
So far we have seen nothing special. However, data.table permits the use of a “by” argument, which allows aggregation. The return object is also a data table. Here, we calculate the mean height again by sex:
> demoDT[ , mean(Height), by = Sex]
Sex V1
1: M 176.5652
2: F 175.2000
Tip: Counting Records
In data.table we can use .N
to count records within by-groups. For example, to count the number of males and females in the demoDT
data table, we would write demoDT[, .N, by = Sex]
.
We can summarize by multiple variables by providing them as a list using .
or list
. The result is another data table with a column for each “by” variable and an additional column for the summary.
> demoDT[ , mean(Height), by = list(Sex, Smokes)]
Sex Smokes V1
1: M No 177.3158
2: F No 173.7500
3: M Yes 173.0000
4: F Yes 181.0000
We can provide multiple summaries and name them using a list. Again, the result is a data table.
> demoDT[ , list(Mean.Height = mean(Height), Mean.Weight = mean(Weight)),
+ by = list(Sex, Smokes)]
Sex Smokes Mean.Height Mean.Weight
1: M No 177.3158 74.10526
2: F No 173.7500 69.87500
3: M Yes 173.0000 74.25000
4: F Yes 181.0000 72.00000
Caution: Summary Functions That Return Multiple Values
It is possible to summarize using functions that return multiple values, such as range
and quantile
. However, the effect is that a new row is created for each element of the return vector—for example, one for the minimum and one for the maximum if using range
. Other than the sort order, there is no way to tell which row corresponds to which value in the output vector.
The aggregation that we have seen thus far creates a new data table that we can use for publishing, plotting, or modeling. The original table is unaffected by the operation. However, if we want to merge the results of the aggregation back on to the original data, we can easily do so using the :=
operator.
> demoDT[, MeanWeightBySex := mean(Weight), by = Sex]
> head(demoDT, 5)
Subject Sex Age Weight Height BMI Smokes MeanWeightBySex
1: 1 M 43 57 166 20.7 No 74.13043
2: 2 M 22 71 179 22.2 No 74.13043
3: 3 F 23 72 170 25.1 No 70.30000
4: 4 M 25 76 188 21.4 No 74.13043
5: 5 M 29 82 175 26.8 No 74.13043
In order to generate multiple summaries, we may use any of the methods associated with :=
for creating new variables.
There are always many ways of achieving the same goal using data.table, and we have presented just a small selection of options in most cases. There are also many more features, such as rolling means, that we simply do not have the time to cover in any detail. If you are interested in digging into data.table further, Matt Dowle has crammed the package help files full of examples. The package FAQ offers further guidance.
For the vast majority of readers, dplyr and data.table will be more than sufficient for your data needs. In particular, data.table has been shown to be extremely performant. On a standard desktop, it can comfortably handle basic summary operations on datasets with a billion rows, containing several thousand groups, within a matter of minutes. However, for some that might still not be enough!
Without parallelizing your code and/or turning to high-performance computing solutions, you might find two further packages to be of assistance. The first of these is bigmemory. The bigmemory package is designed to work with matrices that can be held in your computer’s memory but cannot be processed by standard R functions for data structures. The package takes advantage of C++ and allows objects to be shared across multiple sessions on the same machine.
An alternative approach to handling very large datasets is to use the ff package. Instead of storing large datasets in memory, the ff package stores data on disc. Only a tiny portion of the data is ever mapped to memory. Though the data is stored on disk, it behaves in very much the same way as standard R objects held in memory. On the back end, C++ is used to perform the requested operations.
Still further options are available beyond the two packages covered in this hour, though typically they involve parallelizing your operation and are beyond the scope of this tutorial.
In this hour, we have looked at the two most popular packages for efficient data handling in R: dplyr and data.table. We have looked at the basic syntax of the packages as well as common data-handling tasks such as sorting, subscripting, merging, and aggregation. If you are still unsure as to which is right for you, you can now have a go at using them both during the workshop.
Having seen how R can be used to import and manipulate data, we will spend the next three hours looking at how we can visualize our data using the graphics package and the popular alternatives lattice and ggplot2.
Q. Which is better, dplyr or data.table?
A. In short, it depends! In terms of speed, most benchmarking examples show the packages to be comparable to a point, but as the number of rows and/or groups increases, data.table comes out on top. If speed or memory usage matter to you and you have more than a million rows or 100,000 groups within your data, you should probably use data.table. If data size (and hence performance) is not that important to you, choose whichever you feel more comfortable with.
Q. We have now seen a data.frame, a tbl_df, and a data.table. Why do I need to learn about three different structures?
A. First of all, both a tbl_df
and a data.table
are just an extension to a data.frame
. Generally, there is therefore very little difference, though functions such as print
behave in a slightly different manner for tbl_df
and data.table
objects than they do with a data.frame
. This is due to R’s S3 class system, which we will look at more closely in Hour 16, “Introduction to R Models and Object Orientation,” and then again in Hour 21, “Writing R Classes.”
The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.
1. True or false? When using select
, you must provide a character vector of columns names.
2. Which of the following is a dplyr function that allows you to create new columns?
A. transform
B. subset
C. mutate
3. Assuming you have created a data.table
object called demoDT
from the demoData
data frame and set the key to be the Smokes
column, which of the following would return a subset containing all records for subjects that smoke?
A. demoDT[demoDT$Smokes == "Yes", ]
B. demoDT[Smokes == "Yes", ]
C. demoDT["Yes", ]
D. demoDT["Yes"]
4. What is “wrong” with the following syntax when working with a data.table called demoDT
?
demoDT$Height.Sq <- demoDT$Height^2
1. False. You specify each column name as a separate argument. In fact, if you do try to use a character vector, the function will return an error.
2. C. The transform
and subset
functions are contained in the base R package. The transform
function is actually quite similar to mutate
, though it does not allow you to base new variables on other variables that you are creating within the call to transform. The subset
function offers similar functionality to the dplyr functions filter
and select
.
3. A, B, C and D. The data.table syntax is extremely flexible, and all four methods achieve the same end result.
4. Nothing is technically “wrong” with the statement, though data.table is optimized for efficiency, and the command shown is a standard, less efficient way of creating a new column, Height.Sq
. The more efficient method in data.table would be
demoDT[, Height.Sq := Height^2]
1. Using the dplyr package, perform the following actions:
Create a tbl_df
object named air
from the airquality
data frame.
Sort the data by the Wind
column.
Remove any rows for which the Ozone
column has a missing value.
Remove the Solar.R
column and create a new column containing the ratio of Ozone
to Wind
.
Create a subset of the original airquality
data containing just three columns: Month
, Day
, and Solar.R
. The data should only contain data for June and July. Name the output solar
.
Merge the air
and solar
datasets together, retaining all records from the air
dataset (that is, a left join).
Calculate the median Ozone
value by Month
for the merged data.
2. Now using the data.table package, perform the same following actions:
Create a data.frame
object named air
from the airquality
data frame.
Sort the data by the Wind
column.
Remove any rows for which the Ozone
column has a missing value.
Remove the Solar.R
column and create a new column containing the ratio of Ozone
to Wind
.
Create a subset of the original airquality
data containing just three columns: Month
, Day
, and Solar.R
. The data should only contain data for June and July. Name the output solar
.
Merge the air
and solar
datasets together, retaining all records from the air
dataset (that is, a left join).
Calculate the median Ozone
value by Month
for the merged data.