Hour 12. Efficient Data Handling in R


What You’ll Learn in This Hour:

Image The dplyr package

Image Piping commands together

Image The data.table package

Image Options for improving efficiency


In Hour 11, “Data Manipulation and Transformation,” we looked at some standard methods for processing data in R. In particular, you saw how to sort and merge data. In previous hours we discussed how to subscript and summarize data using the “apply” family of functions. Now we will look at two packages, dplyr and data.table, that enable us to do all of these tasks for data frames within consistent, highly efficient frameworks.

We will begin the hour by looking at Hadley Wickham’s incredibly popular dplyr package. Although dplyr is actually the more recent of the two packages we’ll discuss in this hour, it fits in with packages such as readr and tidyr from the previous two hours. The data.table package is a standalone package for data manipulation that offers greater efficiency for very large data.

dplyr: A New Way of Handling Data

The dplyr package is another Hadley Wickham package that is revolutionizing the way people work with data in R. The package, which was first released in January 2014, fits into an analysis workflow that Hadley Wickham has helped define. In Hour 10, “Importing and Exporting,” you saw how packages such as readr, haven, and readxl can be used to import data into R. In Hour 11, you saw how the tidyr package can be used to transform data into a new shape. We will now look at how dplyr can be used to sort, subset, merge and summarize data.

The dplyr package can be thought of as an evolution of the popular plyr package, although it focuses solely on the manipulation of rectangular data structures, whereas plyr provides a more general framework. The focus of dplyr is very much on usability; however, there has also been considerable effort to ensure that dplyr is fast and efficient.

Creating a dplyr (tbl_df) Object

The dplyr package is intended to be used in a data analysis workflow in which data is imported using packages such as readr, haven, and readxl and then (possibly) transformed using tidyr. Each of these packages contains functions that produce an object of the tbl_df class. A tbl_df object is a dplyr construct that extends a data frame, affecting the way it prints.

The tbl_df class extension does not affect standard data frame operations; however, each of the data-manipulation functions within dplyr returns a tbl_df object and so it is worth us spending a little time to see what a tbl_df actually looks like. We can create a tbl_df object directly from a data.frame using the tbl_df function. An example of this is shown in Listing 12.1.

LISTING 12.1 Creating tbl_df Objects


 1 : > library(dplyr)
 2 : >
 3 : > # Create a tbl_df object from mtcars
 4 : > head(mtcars)
 5 :                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
 6 : Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
 7 : Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
 8 : Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
 9 : Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
10 : Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
11 : Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
12 : >
13 : > carData <- tbl_df(mtcars)
14 : > carData
15 : Source: local data frame [32 x 11]
16 :
17 :     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
18 : 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
19 : 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
20 : 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
21 : 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
22 : 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
23 : 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
24 : 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
25 : 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
26 : 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
27 : 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
28 :..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...
29 : >
30 : > class(carData)    # A dbl_df object is just an extension to a data.frame object
31 : [1] "tbl_df"     "tbl"        "data.frame"


In addition to changing the way in which data frames print, the creation of a tbl_df object also removes row names. In Listing 12.1 we can see how the creation of the carDatatbl_df” removes the row names from the original mtcars data. This is intentional and enforces the tidy data principle that all meaningful information should be stored in the same way (in columns). However, it can of course be a little frustrating if you have meaningful row names! The terms “tbl_df” and “data frame” will be used interchangeably throughout the remainder of this hour.


Note: Working with Data Tables

The dplyr package allows us to work with data table objects via the tbl_dt function, which extends the data.table class to create a tbl_dt object. A tbl_dt object behaves just like a tbl_df object.


Sorting

In dplyr we sort data using the arrange function. The arrange function expects a data frame (or a tbl_df) as the first argument. We can then list any number of columns as the subsequent arguments. The data is sorted by the first column we provide, then by the second, and so on. By default, an ascending sort is used. In the example below, we sort the carData data by carb and then by cyl:

> arrange(carData, carb, cyl)
Source: local data frame [32 x 11]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
2  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
3  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
4  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
5  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
6  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
7  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
10 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...

If we want to sort by descending values for any of our sort columns, we can wrap the column name in a call to the desc function; for example, to sort by carb and then descending values of cyl we would write arrange(carData, carb, desc(cyl)). Alternatively, we can simply place a minus sign in front of the column name, as shown here:

arrange(carData, carb, -cyl)

Subscripting

The dplyr package defines subscripting as two distinct operations: choosing rows and choosing columns. These are defined respectively as filter and select. As with all of the dplyr functions we are discussing in this hour, each function expects a data frame (or tbl_df object) as the first argument. This allows us to reference variables directly in subsequent arguments without using dollar signs or square brackets. In the second argument, we choose how we wish to “filter” the rows or “select” the columns. Let’s start by using the filter function to create a subset of carData containing only four-cylinder cars:

> cyl4 <- filter(carData, cyl == 4)
> cyl4
Source: local data frame [11 x 11]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
2  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
3  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
4  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
5  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
6  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
7  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
8  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
9  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
10 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
11 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

We can use any standard logical operations to filter our data. In addition to the standard ampersand (&), dplyr also permits us to separate “and” operations with a comma:

> filter(carData, cyl == 4, gear == 5)    # equivalent to cyl == 4 & gear == 5
Source: local data frame [2 x 11]

   mpg cyl  disp  hp drat    wt qsec vs am gear carb
1 26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
2 30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2

The select function operates in much the same way as filter. We can either use column names or column numbers to select which columns to keep or drop, much like the select option in the subset function. The standard way to select multiple columns is to separate each column with a comma. Note again that we do not use quotes to specify columns.

> select(carData, mpg, wt, cyl)    # Return just these columns
Source: local data frame [32 x 3]

    mpg    wt cyl
1  21.0 2.620   6
2  21.0 2.875   6
3  22.8 2.320   4
4  21.4 3.215   6
5  18.7 3.440   8
6  18.1 3.460   6
7  14.3 3.570   8
8  24.4 3.190   4
9  22.8 3.150   4
10 19.2 3.440   6
..  ...   ... ...
> select(carData, -vs, -am)    # Return everything except these columns
Source: local data frame [32 x 9]

    mpg cyl  disp  hp drat    wt  qsec gear carb
1  21.0   6 160.0 110 3.90 2.620 16.46    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02    4    4
3  22.8   4 108.0  93 3.85 2.320 18.61    4    1
4  21.4   6 258.0 110 3.08 3.215 19.44    3    1
5  18.7   8 360.0 175 3.15 3.440 17.02    3    2
6  18.1   6 225.0 105 2.76 3.460 20.22    3    1
7  14.3   8 360.0 245 3.21 3.570 15.84    3    4
8  24.4   4 146.7  62 3.69 3.190 20.00    4    2
9  22.8   4 140.8  95 3.92 3.150 22.90    4    2
10 19.2   6 167.6 123 3.92 3.440 18.30    4    4
..  ... ...   ... ...  ...   ...   ...  ...  ...

Another nice property of the select function is that we can choose a sequence of columns using the column names in addition to the column numbers. For example, we could specify select(carData, mpg:wt). Choosing the columns that we want is simplified via a number of additional utility functions, as listed in Table 12.1.

Image

TABLE 12.1 Utility Functions for Selecting Columns


Caution: Specialist functions within select

The functions described in Table 12.1 only work inside the select function and cannot be used to find patterns in standard character vectors.


Adding New Columns

The mutate function enables us to easily add new columns to our data. We can either provide a vector of values in the same way we would with a standard data frame or we can create new columns from existing variables. In the following example, we create a new column containing the original row names from the mtcars data frame. We then use the information contained with the hp and wt columns to create a second new column containing the power-to-weight ratio.

> fullCarData <- mutate(carData, type = rownames(mtcars), pwr2wt = hp/wt)
> fullCarData
Source: local data frame [32 x 13]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb              type   pwr2wt
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4         Mazda RX4 41.98473
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     Mazda RX4 Wag 38.26087
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1        Datsun 710 40.08621
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    Hornet 4 Drive 34.21462
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 Hornet Sportabout 50.87209
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1           Valiant 30.34682
7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4        Duster 360 68.62745
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2         Merc 240D 19.43574
9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2          Merc 230 30.15873
10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4          Merc 280 35.75581
..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...               ...      ...

We can also drop columns by assigning existing names to NULL. The mutate function is similar to the base R function transform. However, unlike transform, the mutate function creates variables in the order in which we specify them, allowing variables that we create to themselves create new variables.

> fullCarData <- mutate(carData, type = rownames(mtcars),
+                       drat = NULL, qsec = NULL,
+                       pwr2wt = hp/wt, pwr2wt.Sq = pwr2wt^2)
> head(fullCarData,3)
Source: local data frame [3 x 12]

   mpg cyl disp  hp    wt vs am gear carb          type   pwr2wt pwr2wt.Sq
1 21.0   6  160 110 2.620  0  1    4    4     Mazda RX4 41.98473  1762.718
2 21.0   6  160 110 2.875  0  1    4    4 Mazda RX4 Wag 38.26087  1463.894
3 22.8   4  108  93 2.320  1  1    4    1    Datsun 710 40.08621  1606.904

Merging

In Hour 11, you saw how the merge function can be used to merge data frames. The merge function allows us to specify arguments such as all.x in order to achieve what is also commonly known as a “left join.” In contrast, dplyr splits these arguments out into separate functions. These can be seen in Table 12.2. As with merge, we refer to our two datasets as x and y.

Image

TABLE 12.2 Functions for Merging Data in dplyr

The first four functions listed in Table 12.2 operate in the same way as the merge function. For example, inner_join(demoData, pkData) provides an equivalent to merge(demoData, pkData). In addition, dplyr offers us the concepts of a semi-join and an anti-join. The semi_join function does not actually perform a merge. Instead, it returns rows in x that would be retained if we were to merge x with y. Conversely, the anti_join function returns rows of x that would not be retained if we were to merge with y. Listing 12.2 illustrates a semi-join and an anti-join using two (fabricated) sample data frames.

LISTING 12.2 Sample Joins


 1 : > # Fabricate two datasets to merge
 2 : > beerData <- data.frame(ID = c(1, 2, 3), Beer = c(75, 64, 92))
 3 : > diaperData <- data.frame(ID = c(1, 3, 4), Diapers = c(51, 68, 32))
 4 : > beerData
 5 :   ID Beer
 6 : 1  1   75
 7 : 2  2   64
 8 : 3  3   92
 9 : > diaperData
10 :   ID Diapers
11 : 1  1      51
12 : 2  3      68
13 : 3  4      32
14 : >
15 : > # Rows of beerData that have a corresponding "ID" in diaperData
16 : > semi_join(beerData, diaperData, by = "ID")
17 :   ID Beer
18 : 1  1   75
19 : 2  3   92
20 : > # Rows of beerData that do not have a corresponding "ID" in diaperData
21 : > anti_join(beerData, diaperData, by = "ID")
22 :   ID Beer
23 : 1  2   64
24 : > # An inner join of the two datasets
25 : > inner_join(beerData, diaperData, by = "ID")
26 :   ID Beer Diapers
27 : 1  1   75      51
28 : 2  3   92      68


Note that in each case we specified the “by” variable for the merge as "ID" but we did not have to. Like merge, each of the dplyr *join functions will automatically determine the merge by variables for us if we do not specify it. Because we stated that the data in the example is to be merged by the ID variable, the semi-join looks for ID values in beerData that also appear in diaperData. These are the rows that would be merged using either inner_join (as in lines 25 to 28) or left_join. Accordingly, anti_join returns the remaining rows that would not be merged.

Aggregation

In addition to facilitating data manipulation, dplyr also provides an easy-to-use syntax for data aggregation that is a marked improvement upon the more generic predecessor, the plyr package. In dplyr terminology, data aggregation is referred to as a data summary. We therefore use a function called summarize to obtain numeric summaries of our data. As always, when using dplyr we pass the data as the first argument. In the subsequent arguments we can use standard summary functions to summarize columns in the data. In the following example, we use the mean function to summarize the mpg column within carData:

> summarize(carData, mean(mpg))
Source: local data frame [1 x 1]

  mean(mpg)
1  20.09062

We can summarize using any function we like, including custom-written functions. The only restrictions are that the function we use must expect a vector as the input and that it must return a single value. We cannot therefore use a function such as range because this returns a vector of length 2. However, we can make as many summaries as we like in a single call to summarize.

> summarize(carData, min(mpg), median(mpg), max(mpg))
Source: local data frame [1 x 3]

  min(mpg) median(mpg) max(mpg)
1     10.4        19.2     33.9

When creating multiple summaries in this way, it can be helpful to be able to manually control the labels of the resulting data. In order to do so we simply specify the name of the resulting output column when creating the summary, as follows:

> mpgSummary <- summarize(carData, Min=min(mpg), Median=median(mpg), Max=max(mpg))
> mpgSummary
Source: local data frame [1 x 3]

   Min Median  Max
1 10.4   19.2 33.9

Sometimes we may find that we need to pass additional arguments to our summary functions. For example, we may need to specify na.rm = TRUE when summarizing a variable with missing values. In order to pass extra arguments to our summary functions, we pass the arguments as if we were calling the function directly. Here’s an example:

summarize(airquality, mean(Ozone, na.rm = TRUE)).

Grouped Data

If all we needed to do was summarize columns of data using standard numeric summary functions, then dplyr doesn’t really offer anything new. If anything, it makes the process more tedious. However, the real advantage of using the summarize function is that it facilitates easy “by” operations. In order to summarize our data by variable(s), we use the group_by function to define a grouping within our data. We can actually group our data at any time, and the grouping will be retained by any other operations we perform. We can group by as many variables as we like.

To demonstrate the concept of grouped data, let’s group carData by the cyl variable and observe what happens when we filter the data by carb. The code for the operation is shown in Listing 12.3.

LISTING 12.3 The Effect of group_by


 1: > cylGrouping <- group_by(carData, cyl)
 2: > head(cylGrouping)
 3: Source: local data frame [6 x 11]
 4: Groups: cyl
 5:
 6:    mpg cyl disp  hp drat    wt  qsec vs am gear carb
 7: 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
 8: 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
 9: 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
10: 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
11: 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
12: 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
13: >
14: > filter(cylGrouping, carb == 4)
15: Source: local data frame [10 x 11]
16: Groups: cyl
17:
18:     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
19: 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
20: 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
21: 3  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
22: 4  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
23: 5  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
24: 6  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
25: 7  10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
26: 8  14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
27: 9  13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
28: 10 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4


Notice first of all that grouping by the cyl variable has the effect of adding a line to the output (see line 4). As can be seen in line 16, the cyl grouping was retained when we filtered the data. In both cases the sort order remains unaffected by the grouping. The effect of grouping our data is only felt when we summarize it. In the following example, we summarize the mpg column in our grouped data, cylGrouping:

> mpgSummaryByCyl <- summarize(cylGrouping, min(mpg), median(mpg), max(mpg))
> mpgSummaryByCyl
Source: local data frame [3 x 4]

  cyl min(mpg) median(mpg) max(mpg)
1   4     21.4        26.0     33.9
2   6     17.8        19.7     21.4
3   8     10.4        15.2     19.2

The result of performing a summary operation on grouped data is that the output is summarized by each level of the grouping variable(s). In keeping with the concept of tidy data, the output is a data frame (in fact, a tbl_df). The operation returns a separate column for each variable that we grouped by, with additional columns for each summary we specified.

Other Uses of group_by

You have already seen that when we filter our data, the grouping variables are retained. However, we can also use the grouping to our advantage within the filter itself. In the following example, we use a grouping on the cyl variable to extract the maximum mpg value for each value of cyl. The comparison mpg == max(mpg) is performed within each group (that is, each value of cyl).

> cylGrouping <- group_by(carData, cyl)
> # Extract maximum mpg by for each cyl category
> filter(cylGrouping, mpg == max(mpg))
Source: local data frame [3 x 11]
Groups: cyl

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
2 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
3 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

Grouping our data also facilitates the generation of new aggregation variables. For example, we could create a new variable, meanMPGbyCyl, that is the mean of the mpg column for each value of cyl, as shown here:

> mutate(cylGrouping, meanMPGbyCyl = mean(mpg))
Source: local data frame [32 x 12]
Groups: cyl

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb meanMPGbyCyl
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     19.74286
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     19.74286
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1     26.66364
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1     19.74286
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2     15.10000
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1     19.74286
7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4     15.10000
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2     26.66364
9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2     26.66364
10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     19.74286
..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...          ...


Note: Remove a Grouping

We can remove any groupings in our data using the ungroup function.


The Pipe Operator

Functions in dplyr have been written in order to take advantage of what is commonly referred to as the “pipe” operator. The pipe operator, %>%, originates in the magrittr package and is by no means restricted to usage within dplyr. The pipe operator allows us to chain functions together such that the output from one function becomes the input to the first argument (by default) of the next. This has led to it being called the “then” operator in some quarters (do this, then this, then this, and so on). It is particularly useful if we have many steps to perform on a single type of object such as a data frame. The advantage of this approach is that it avoids intermediary objects (that is, those that we create simply to break up nested function calls).


Note: Piping to Other Arguments

When you use the pipe operator, the output from a function does not have to be used as the input to the first argument of the next function. It can in fact become the input to any argument within the following function. However, the code is generally a lot more readable if we feed the output into the first argument of the following function.


The dplyr package has been written with the pipe operator very much in mind. In a typical analysis workflow we might arrange, filter, select, mutate, group_by, and summarize several times over. Each of these functions takes a data frame as its first input and returns another data frame as the output. This is ideal for piping together function calls. Consider the example in Listing 12.4 using mtcars. In the first instance we use the traditional approach to data processing. To avoid nesting, we end up creating three intermediate datasets on the way to obtaining our summary. We then perform the same operations using the pipe operator. In the second case, no intermediate datasets are required.

LISTING 12.4 Workflow Examples With and Without the Pipe Operator


 1: > # A standard workflow, mean mpg by cyl for manual cars
 2: > # The traditional way:
 3: > carsByCyl <- arrange(mtcars, cyl)
 4: > groupByCyl <- group_by(carsByCyl, cyl)
 5: > manualCars <- filter(groupByCyl, am == 1)
 6: > summarize(manualCars, Mean.MPG=mean(mpg))
 7: Source: local data frame [3 x 2]
 8:
 9:   cyl Mean.MPG
10: 1   4 28.07500
11: 2   6 20.56667
12: 3   8 15.40000
13: >
14: > # Using pipes
15: > mtcars %>%
16: +   arrange(cyl) %>%
17: +   group_by(cyl) %>%
18: +   filter(am == 1) %>%
19: +   summarize(Mean.MPG=mean(mpg))
20: Source: local data frame [3 x 2]
21:
22:   cyl Mean.MPG
23: 1   4 28.07500
24: 2   6 20.56667
25: 3   8 15.40000


The pipe operator is not to everyone’s taste, and it can be harder to debug than well-written code using a traditional syntax. However, it is becoming an increasingly popular means of working with data—and before long it may not be possible to avoid it!

Efficient Data Handling with data.table

The data.table package predates dplyr by several years, having been first released to CRAN in April 2006. However, it is still actively maintained by its primary author and maintainer Matt Dowle, and despite the growing popularity of the dplyr package, data.table remains one of the most popular and well-documented packages on CRAN. In addition to the standard help and a quick-start guide, Matt Dowle has written an extensive FAQ document for the package tackling some of the less-obvious aspects of the package.

The focus of the package is very much on reading, processing, and aggregating large data efficiently. The data.table object is essentially an enhancement to the data.frame class. It allows us to index, merge, and group data much faster than we can with standard data frames.

Creating a data.table

Like any analysis workflow the data.table workflow begins with importing data. In Hour 10 we looked briefly at the performance of the fread function contained within data.table. The fread function is similar to read.table in terms of usage, though it’s much faster for large datasets. Conveniently, the output of the function is a data.table object.

> dji <- fread("djiData.csv")
> dji
           Date DJI.Open DJI.High  DJI.Low DJI.Close DJI.Volume DJI.Adj.Close
  1: 12/31/2014 17987.66 18043.22 17820.88  17823.07   82840000      17823.07
  2: 12/30/2014 18035.02 18035.02 17959.70  17983.07   47490000      17983.07
  3: 12/29/2014 18046.58 18073.04 18021.57  18038.23   53870000      18038.23
  4: 12/26/2014 18038.30 18103.45 18038.30  18053.71   52570000      18053.71
  5: 12/24/2014 18035.73 18086.24 18027.78  18030.21   42870000      18030.21
 ---
248: 01/08/2014 16527.66 16528.88 16416.69  16462.74  103260000      16462.74
249: 01/07/2014 16429.02 16562.32 16429.02  16530.94   81270000      16530.94
250: 01/06/2014 16474.04 16532.99 16405.52  16425.10   89380000      16425.10
251: 01/03/2014 16456.89 16518.74 16439.30  16469.99   72770000      16469.99
252: 01/02/2014 16572.17 16573.07 16416.49  16441.35   80960000      16441.35

The appearance of a data table is similar to that of a standard data frame. When we choose to print a small dataset (one containing 100 rows or less), the entire dataset is returned, but with the header row repeated at the base of the table. For larger datasets, only the first and last five rows are returned. We can turn existing data frames into data.table objects by directly calling a data.table function—for example, air <- data.table(airquality). We can also create a data.table from scratch in the same way we would using the data.frame function.


Tip: Keeping Track of Tables

If we create many data table objects, the tables function can be used to find out what tables we have, what they contain, and how much memory they have been allocated.


Setting a Key

One of the primary focuses of the data.table package is performance. To achieve this performance, we define a key. In some ways this is similar to a primary key that would be used in a relational database. However, in data.table the key can be made up of several columns and does not have to be unique. In fact, it is often more useful if the key is not unique. The key is used for sorting, indexing, and summarizing. It is defined using a function called setkey. In Listing 12.5 we define a simple data.table using the demoData data in the mangoTraining package and then set the key based on the variables Sex and Smokes.

LISTING 12.5 Defining a Key


 1: > # Create a data.table and define the key
 2: > demoDT <- data.table(demoData)
 3: > setkey(demoDT, Sex, Smokes)
 4: > head(demoDT)
 5:    Subject Sex Age Weight Height  BMI Smokes
 6: 1:       3   F  23     72    170 25.1     No
 7: 2:       6   F  29     67    169 23.5     No
 8: 3:      12   F  32     77    182 23.1     No
 9: 4:      15   F  27     73    172 24.8     No
10: 5:      23   F  26     82    175 26.8     No
11: 6:      26   F  25     58    175 18.9     No


The obvious effect of defining a key is that when printing, the data is sorted by the key variables from left to right as we defined them. In Listing 12.5 they are sorted by Sex and then by Smokes. The purpose of defining the sort key is not just for printing purposes, however. It enables faster indexing when subscripting.

Notice that we wrote setkey(demog, Sex, Smokes) as opposed to demog <- setkey(demog, Sex, Smokes). Functions in data.table update the data table directly, so we do not need to use <- to copy/replace the original data. Updating by reference in this way reduces the memory required to perform manipulation tasks and improves speed.


Tip: Querying the Key

We can find out if a data table has a key using the function haskey, which returns TRUE if the data table has a key and FALSE otherwise.

The key function tells us what the key is.


Subscripting

In the data.table syntax, we can reference columns directly as if they were objects in their own right. In other words, we can drop the “dataName$” syntax. This saves some typing, though the real benefit is the speed gain we get from using data.table in the first place.

> demoDT[Sex == "F",]
    Subject Sex Age Weight Height  BMI Smokes
 1:       3   F  23     72    170 25.1     No
 2:       6   F  29     67    169 23.5     No
 3:      12   F  32     77    182 23.1     No
 4:      15   F  27     73    172 24.8     No
 5:      23   F  26     82    175 26.8     No
 6:      26   F  25     58    175 18.9     No
 7:      28   F  28     69    172 23.4     No
 8:      30   F  33     61    175 19.9     No
 9:      17   F  41     62    172 20.9    Yes
10:      27   F  36     82    190 22.6    Yes

If our data table has a key and we want to subset by that key, we can go one step further and drop the reference to the variable we want to subset altogether (for example, demoDT["F",]). In fact, we don’t even need the comma to specify rows as we would with a data frame, though it can be sometimes be confusing to leave it out.

If we have defined a key using multiple variables, we can provide the subset values by separating with a comma. We enclose the values using J(), where J stands for “join.” In the following example, we subset the demography data to return female smokers:

> key(demoDT)
[1] "Sex"    "Smokes"
> demoDT[J("F", "Yes"),]
   Subject Sex Age Weight Height  BMI Smokes
1:      17   F  41     62    172 20.9    Yes
2:      27   F  36     82    190 22.6    Yes


Note: Alternatives to J

The J function is the data.table specification of a “join” of two keys. The practice of joining based on keys has its roots in SQL, but in practice it is just a means of separating variables. As an alternative, the function list (base) or . (plyr) could be used in exactly the same way.


Occasionally we may want to return a subset in which the variables of interest match multiple criteria. To achieve this we can specify a vector of values. If we have defined a key from multiple variables, any vector we specify must be contained within a call to the J function. An example of this is shown here:

> setkey(demoDT, Sex, Weight)
> demoDT[J("M", c(76, 77)),]
   Subject Sex Age Weight Height  BMI Smokes
1:       4   M  25     76    188 21.4     No
2:      31   M  25     76    174 25.1     No
3:      13   M  21     77    180 23.6     No
4:      20   M  22     77    183 23.1     No


Caution: Numeric Keys

The data.table package allows us to define a key using numeric variables. However, in order to subset using these keys we must use the . function. This is because, like data frames, data tables also allow us to subset by specifying the row numbers. If we wanted to return all the rows in demoDT for which Weight is equal to 72, we would write the following:

> setkey(demoDT, Weight)
> demoDT[.(72),]
   Subject Sex Age Weight Height  BMI Smokes
1:        3   F  23      72    170 25.1      No


Adding New Columns and Rows

The data.table package makes adding variables to an existing data table much easier and quicker than when working with standard data frames. Whenever we add a column to a standard data frame, we make a copy of the data. When we work with data tables, the new column is instead appended by reference; in other words, R points to the existing table and tells it to add a new column. This makes it much faster and more efficient.

Adding and Renaming Columns

We create new variables in our data, by reference, using the := operator. To create variables by reference we use square, subscript brackets with the existing data table. We avoid any standard R assignment. If we are generating the new variable from existing variables, we refer to them directly as in the following example:

> demoDT[, HeightInM.sq := (Height^2)/10000]
> head(demoDT)
   Subject Sex Age Weight Height  BMI Smokes HeightInM.sq
1:       1   M  43     57    166 20.7     No       2.7556
2:       2   M  22     71    179 22.2     No       3.2041
3:       3   F  23     72    170 25.1     No       2.8900
4:       4   M  25     76    188 21.4     No       3.5344
5:       5   M  29     82    175 26.8     No       3.0625
6:       6   F  29     67    169 23.5     No       2.8561


Caution: Updating the Values in the Key

If we update the values in any of the columns that make up our key, we need to redefine the key.


To create multiple new columns, we must provide the names of the new columns as a character vector and the transformations as a list. The vector of names and list of transformations should be separated by the := operator, as shown in Listing 12.6. We can also remove columns by setting them to NULL using the := operator.

LISTING 12.6 Creating New Columns


 1: > demoDT[, c("SexNum", "SmokesNum") := list(as.numeric(Sex), as.numeric(Smokes))]
 2: > head(demoDT)
 3:    Subject Sex Age Weight Height  BMI Smokes HeightInM.sq SexNum SmokesNum
 4: 1:       1   M  43     57    166 20.7     No       2.7556      2         1
 5: 2:      26   F  25     58    175 18.9     No       3.0625      1         1
 6: 3:      30   F  33     61    175 19.9     No       3.0625      1         1
 7: 4:      22   M  27     61    170 21.0     No       2.8900      2         1
 8: 5:      17   F  41     62    172 20.9    Yes       2.9584      1         2
 9: 6:      14   M  26     64    170 22.0     No       2.8900      2         1


We can rename columns using the setnames function. Once again the renaming is performed by reference to avoid copying the entire dataset. The setnames function expects a data table as its first argument, with further arguments old and new, which respectively expect a vector of column names to change from and to.


Note: Multiple Ways to Create New Variables

There are normally several ways of doing the same thing with data.table, and everyone tends to have their preference. In order to create new variables in Listing 12.6, we could also have used the following syntax:

demoDT[, `:=` (SexNum = as.numeric(Sex), SmokesNum = as.numeric(Smokes))]

We could also have used the set function to achieve the same result.


Adding Rows

Although the rbind function in base can be used to append rows to a data table, the function rbindlist is optimized for speed and memory efficiency. The rbindlist function can be used to join data tables and/or regular data frames that are stored as a list. We can join together as many datasets as we wish, but we must first store them together in a list. Unlike the standard rbind that we looked at in Hour 11, rbindlist will permit us to bind together datasets for which the column names do not match by setting fill = TRUE. An example of this is shown in Listing 12.7. First we generate a list by splitting the airquality data by the Month variable and combine this back together in line 5. Then we use rbindlist again in line 24 to add on new rows of data.

LISTING 12.7 Adding New Rows


 1: > # Create a list containing airquality data for each available month
 2: > airSplit <- split(airquality, airquality$Month)
 3: >
 4: > # Bind these together into a single data table
 5: > airDT <- rbindlist(airSplit)
 6: > airDT
 7:      Ozone Solar.R Wind Temp Month Day
 8:   1:    41     190  7.4   67     5   1
 9:   2:    36     118  8.0   72     5   2
10:   3:    12     149 12.6   74     5   3
11:   4:    18     313 11.5   62     5   4
12:   5:    NA      NA 14.3   56     5   5
13:  ---
14: 149:    30     193  6.9   70     9  26
15: 150:    NA     145 13.2   77     9  27
16: 151:    14     191 14.3   75     9  28
17: 152:    18     131  8.0   76     9  29
18: 153:    20     223 11.5   68     9  30
19: >
20: > # Now assume two new records arrive but with missing columns
21: > month10 <- data.table(Ozone = c(24, 28), Month = 10, Day = 1:2)
22: >
23: > # Bind this to our original data
24: > newAirDT <- rbindlist(list(airDT, month10), fill = TRUE)
25: > tail(newAirDT)
26:    Ozone Solar.R Wind Temp Month Day
27: 1:    NA     145 13.2   77     9  27
28: 2:    14     191 14.3   75     9  28
29: 3:    18     131  8.0   76     9  29
30: 4:    20     223 11.5   68     9  30
31: 5:    24      NA   NA   NA    10   1
32: 6:    28      NA   NA   NA    10   2


Merging

Merging data tables works in the much same way as a typical merge on a data frame using the merge function. However, the default behavior of merge for data tables is to use the respective keys for the two data tables. We must therefore either define keys for the two data tables or specify the “by” variables manually. In Listing 12.8 we create two data tables from the demoData and pkData data frames contained within the mangoTraining package and set the keys accordingly. In line 8 we perform a merge, similar to that used in Hour 11.

LISTING 12.8 Merging Two Data Tables


 1: > # Create data tables and define the keys accordingly
 2: > demoDT <- data.table(demoData)
 3: > setkey(demoDT, Subject)
 4: > pkDT <- data.table(pkData)
 5: > setkey(pkDT, Subject)
 6: >
 7: > # Merge the two data tables together
 8: > allPKDT <- merge(demoDT, pkDT)
 9: > allPKDT
10:      Subject Sex Age Weight Height  BMI Smokes Dose Time   Conc
11:   1:       1   M  43     57    166 20.7     No   25    0   0.00
12:   2:       1   M  43     57    166 20.7     No   25    1 660.13
13:   3:       1   M  43     57    166 20.7     No   25    6 178.92
14:   4:       1   M  43     57    166 20.7     No   25   12  88.99
15:   5:       1   M  43     57    166 20.7     No   25   24  42.71
16:  ---
17: 161:      33   M  30     80    180 24.8     No   25    0   0.00
18: 162:      33   M  30     80    180 24.8     No   25    1 453.13
19: 163:      33   M  30     80    180 24.8     No   25    6 205.30
20: 164:      33   M  30     80    180 24.8     No   25   12 146.69
21: 165:      33   M  30     80    180 24.8     No   25   24  46.84


For large datasets you will notice that using merge with data tables is significantly faster than the with data frames. For those that need that little bit of extra performance, however, the package offers an alternative that is even faster. To perform the data table merge, we return to using square brackets. For a standard merge (a.k.a. an inner join), we put one data table inside the brackets and one outside. An example of an inner join or standard merge is shown here:

> demoDT[pkDT]
     Subject Sex Age Weight Height  BMI Smokes Dose Time   Conc
  1:       1   M  43     57    166 20.7     No   25    0   0.00
  2:       1   M  43     57    166 20.7     No   25    1 660.13
  3:       1   M  43     57    166 20.7     No   25    6 178.92
  4:       1   M  43     57    166 20.7     No   25   12  88.99
  5:       1   M  43     57    166 20.7     No   25   24  42.71
 ---
161:      33   M  30     80    180 24.8     No   25    0   0.00
162:      33   M  30     80    180 24.8     No   25    1 453.13
163:      33   M  30     80    180 24.8     No   25    6 205.30
164:      33   M  30     80    180 24.8     No   25   12 146.69
165:      33   M  30     80    180 24.8     No   25   24  46.84

Aggregation

In addition to transforming and manipulating our data, we can also use data.table to summarize our data. As usual, we start by specifying the name of the data and use square brackets to create a summary. We can perform simple summary operations on columns using standard statistical summary functions such as mean.

> # Calculate the mean height
> demoDT <- data.table(demoData)
> demoDT[ , mean(Height)]
[1] 176.1515

So far we have seen nothing special. However, data.table permits the use of a “by” argument, which allows aggregation. The return object is also a data table. Here, we calculate the mean height again by sex:

> demoDT[ , mean(Height), by = Sex]
   Sex       V1
1:   M 176.5652
2:   F 175.2000


Tip: Counting Records

In data.table we can use .N to count records within by-groups. For example, to count the number of males and females in the demoDT data table, we would write demoDT[, .N, by = Sex].


We can summarize by multiple variables by providing them as a list using . or list. The result is another data table with a column for each “by” variable and an additional column for the summary.

> demoDT[ , mean(Height), by = list(Sex, Smokes)]
   Sex Smokes       V1
1:   M     No 177.3158
2:   F     No 173.7500
3:   M    Yes 173.0000
4:   F    Yes 181.0000

We can provide multiple summaries and name them using a list. Again, the result is a data table.

> demoDT[ , list(Mean.Height = mean(Height), Mean.Weight = mean(Weight)),
+ by = list(Sex, Smokes)]
   Sex Smokes Mean.Height Mean.Weight
1:   M     No    177.3158    74.10526
2:   F     No    173.7500    69.87500
3:   M    Yes    173.0000    74.25000
4:   F    Yes    181.0000    72.00000


Caution: Summary Functions That Return Multiple Values

It is possible to summarize using functions that return multiple values, such as range and quantile. However, the effect is that a new row is created for each element of the return vector—for example, one for the minimum and one for the maximum if using range. Other than the sort order, there is no way to tell which row corresponds to which value in the output vector.


The aggregation that we have seen thus far creates a new data table that we can use for publishing, plotting, or modeling. The original table is unaffected by the operation. However, if we want to merge the results of the aggregation back on to the original data, we can easily do so using the := operator.

> demoDT[, MeanWeightBySex := mean(Weight), by = Sex]
> head(demoDT, 5)
   Subject Sex Age Weight Height  BMI Smokes MeanWeightBySex
1:       1   M  43     57    166 20.7     No        74.13043
2:       2   M  22     71    179 22.2     No        74.13043
3:       3   F  23     72    170 25.1     No        70.30000
4:       4   M  25     76    188 21.4     No        74.13043
5:       5   M  29     82    175 26.8     No        74.13043

In order to generate multiple summaries, we may use any of the methods associated with := for creating new variables.

More with data.table

There are always many ways of achieving the same goal using data.table, and we have presented just a small selection of options in most cases. There are also many more features, such as rolling means, that we simply do not have the time to cover in any detail. If you are interested in digging into data.table further, Matt Dowle has crammed the package help files full of examples. The package FAQ offers further guidance.

Too Large for data.table

For the vast majority of readers, dplyr and data.table will be more than sufficient for your data needs. In particular, data.table has been shown to be extremely performant. On a standard desktop, it can comfortably handle basic summary operations on datasets with a billion rows, containing several thousand groups, within a matter of minutes. However, for some that might still not be enough!

Without parallelizing your code and/or turning to high-performance computing solutions, you might find two further packages to be of assistance. The first of these is bigmemory. The bigmemory package is designed to work with matrices that can be held in your computer’s memory but cannot be processed by standard R functions for data structures. The package takes advantage of C++ and allows objects to be shared across multiple sessions on the same machine.

An alternative approach to handling very large datasets is to use the ff package. Instead of storing large datasets in memory, the ff package stores data on disc. Only a tiny portion of the data is ever mapped to memory. Though the data is stored on disk, it behaves in very much the same way as standard R objects held in memory. On the back end, C++ is used to perform the requested operations.

Still further options are available beyond the two packages covered in this hour, though typically they involve parallelizing your operation and are beyond the scope of this tutorial.

Summary

In this hour, we have looked at the two most popular packages for efficient data handling in R: dplyr and data.table. We have looked at the basic syntax of the packages as well as common data-handling tasks such as sorting, subscripting, merging, and aggregation. If you are still unsure as to which is right for you, you can now have a go at using them both during the workshop.

Having seen how R can be used to import and manipulate data, we will spend the next three hours looking at how we can visualize our data using the graphics package and the popular alternatives lattice and ggplot2.

Q&A

Q. Which is better, dplyr or data.table?

A. In short, it depends! In terms of speed, most benchmarking examples show the packages to be comparable to a point, but as the number of rows and/or groups increases, data.table comes out on top. If speed or memory usage matter to you and you have more than a million rows or 100,000 groups within your data, you should probably use data.table. If data size (and hence performance) is not that important to you, choose whichever you feel more comfortable with.

Q. We have now seen a data.frame, a tbl_df, and a data.table. Why do I need to learn about three different structures?

A. First of all, both a tbl_df and a data.table are just an extension to a data.frame. Generally, there is therefore very little difference, though functions such as print behave in a slightly different manner for tbl_df and data.table objects than they do with a data.frame. This is due to R’s S3 class system, which we will look at more closely in Hour 16, “Introduction to R Models and Object Orientation,” and then again in Hour 21, “Writing R Classes.”

Workshop

The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.

Quiz

1. True or false? When using select, you must provide a character vector of columns names.

2. Which of the following is a dplyr function that allows you to create new columns?

A. transform

B. subset

C. mutate

3. Assuming you have created a data.table object called demoDT from the demoData data frame and set the key to be the Smokes column, which of the following would return a subset containing all records for subjects that smoke?

A. demoDT[demoDT$Smokes == "Yes", ]

B. demoDT[Smokes == "Yes", ]

C. demoDT["Yes", ]

D. demoDT["Yes"]

4. What is “wrong” with the following syntax when working with a data.table called demoDT?

demoDT$Height.Sq <- demoDT$Height^2

Answers

1. False. You specify each column name as a separate argument. In fact, if you do try to use a character vector, the function will return an error.

2. C. The transform and subset functions are contained in the base R package. The transform function is actually quite similar to mutate, though it does not allow you to base new variables on other variables that you are creating within the call to transform. The subset function offers similar functionality to the dplyr functions filter and select.

3. A, B, C and D. The data.table syntax is extremely flexible, and all four methods achieve the same end result.

4. Nothing is technically “wrong” with the statement, though data.table is optimized for efficiency, and the command shown is a standard, less efficient way of creating a new column, Height.Sq. The more efficient method in data.table would be

demoDT[, Height.Sq := Height^2]

Activities

1. Using the dplyr package, perform the following actions:

Image Create a tbl_df object named air from the airquality data frame.

Image Sort the data by the Wind column.

Image Remove any rows for which the Ozone column has a missing value.

Image Remove the Solar.R column and create a new column containing the ratio of Ozone to Wind.

Image Create a subset of the original airquality data containing just three columns: Month, Day, and Solar.R. The data should only contain data for June and July. Name the output solar.

Image Merge the air and solar datasets together, retaining all records from the air dataset (that is, a left join).

Image Calculate the median Ozone value by Month for the merged data.

2. Now using the data.table package, perform the same following actions:

Image Create a data.frame object named air from the airquality data frame.

Image Sort the data by the Wind column.

Image Remove any rows for which the Ozone column has a missing value.

Image Remove the Solar.R column and create a new column containing the ratio of Ozone to Wind.

Image Create a subset of the original airquality data containing just three columns: Month, Day, and Solar.R. The data should only contain data for June and July. Name the output solar.

Image Merge the air and solar datasets together, retaining all records from the air dataset (that is, a left join).

Image Calculate the median Ozone value by Month for the merged data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset