What You’ll Learn in This Hour:
Sorting
Setting and merging
Handling duplicate values
Restructuring data frames
Data Aggregation
In the previous hour, we walked through a variety of methods for reading data into R as well as exporting it. This included working with flat files, R’s .RData format, databases, and Microsoft Excel. However, reading data into R is only the start of the data analysis workflow. As data scientists and statisticians, we rarely get to control the structure and format of our data. In Hour 5, “Dates, Times, and Factors,” and Hour 6, “Common R Utility Functions,” you saw some useful functions for working with the format of your data. We looked at dates, times, factors, and missing data. We also looked at common functions for working with numeric and character data. Now we will look a little closer at the structure of our data.
Analysts will tend to quote all kinds of numbers for the proportion of a data analysis workflow that is taken up with data manipulation, or “data munging” as it is increasingly being referred to. However, one thing that most people agree on is that it takes more time than it should—and takes up significantly more time than the interesting analysis piece at the end! These days you can make a career out of being an expert data wrangler!
Several approaches to data manipulation in R have evolved over time. In this hour, we start by looking at what could be called “traditional” approaches to the data manipulation tasks of sorting, setting, and merging. We will then look at the popular packages reshape, reshape2, and tidyr for data restructuring. We will then continue the data manipulation theme into Hour 12, “Efficient Data Handling in R,” where we will look deeper at two of the most popular packages for data manipulation and aggregation, data.table and dplyr.
In R we are rarely required to sort our data in order to use a particular function. Most functions do it for us if it’s needed. However, if we are calculating cumulative sums, analyzing time series, or if we just want to view our data in a way that makes sense to us, then we will need to sort the data ourselves. Base R contains a function named sort
that enables us to easily sort vectors. By default, the function sorts vectors from low to high, though we can sort in descending order by specifying decreasing = TRUE
.
> sort(airquality$Wind)[1:10]
[1] 1.7 2.3 2.8 3.4 4.0 4.1 4.6 4.6 4.6 4.6
Unfortunately, the sort
function only works with vectors, and it is useless to us if we want to sort data frames. To do so, we need to use the order
function.
The order
function returns a vector of positions or indices corresponding to the elements we would select if we were to order our data. Let’s create a simple numeric vector, myVec
, and examine the output when we feed it to the order
function:
> myVec <- c(63, 31, 48, 82, 51, 20, 72, 99, 84, 53)
> order(myVec)
[1] 6 2 3 5 10 1 7 4 9 8
The first value of the output vector is 6. This tells us that if we were to sort our data from low to high, the first value in the myVec
vector that we should select is the sixth value (in this case, the number 20). Next, we should select the second value, which is 31, and so on. The sort order that the order
function produces can be used to sort vectors; however, the real benefit is felt when working with data frames. In Listing 11.1 we use order
to sort the entire airquality
data frame based on the Wind
column. The order
function is used to select rows in the subscript.
1: > sortedByWind <- airquality[order(airquality$Wind), ]
2: > head(sortedByWind, 10)
3: Ozone Solar.R Wind Temp Month Day
4: 53 NA 59 1.7 76 6 22
5: 121 118 225 2.3 94 8 29
6: 126 73 183 2.8 93 9 3
7: 117 168 238 3.4 81 8 25
8: 99 122 255 4.0 89 8 7
9: 62 135 269 4.1 84 7 1
10: 54 NA 91 4.6 76 6 23
11: 66 64 175 4.6 83 7 5
12: 98 66 NA 4.6 87 8 6
13: 127 91 189 4.6 93 9 4
Another benefit of the order
function is that it allows us to order data by more than one variable. Looking again at Listing 11.1 we can see that each of the last four printed rows has a Wind
value of 4.6. Where two or more values match like this, R uses the original order of the data for the sorting. To instead specify a second ordering variable, we simply have to add the variable as the second argument to order
. We can continue to add as many ordering variables as we like in this way.
The order
function has an argument, decreasing
, which if set to TRUE
, can be used to sort from high to low instead of the default low to high. However, this only really helps us if we are sorting a single variable or if we want to specify that all the order variables should be sorted from high to low. If we want to be specific about which variables will be ascending and which are descending, then we accept the default decreasing = FALSE
and place a minus sign (-
) in front of any variables that require a descending sort. An example of this is shown in Listing 11.2, where the airquality
data is sorted by Wind
and then by descending values of Temp
.
1: > sortedByWindandDescTemp <- airquality[order(airquality$Wind, -airquality$Temp), ]
2: > head(sortedByWindandDescTemp, 10)
3: Ozone Solar.R Wind Temp Month Day
4: 53 NA 59 1.7 76 6 22
5: 121 118 225 2.3 94 8 29
6: 126 73 183 2.8 93 9 3
7: 117 168 238 3.4 81 8 25
8: 99 122 255 4.0 89 8 7
9: 62 135 269 4.1 84 7 1
10: 127 91 189 4.6 93 9 4
11: 98 66 NA 4.6 87 8 6
12: 66 64 175 4.6 83 7 5
13: 54 NA 91 4.6 76 6 23
Appending, also commonly referred to as combining or setting, normally occurs when data are arriving to us in chunks over a time period. Each dataset we receive is structurally identical to the last but contains one or more new rows of data. All we therefore need to do is append the new rows to our existing data. In R this can be achieved using the rbind
function, which you first saw in action with data frames in Hour 4, “Multi-Mode Data Structures.” To use rbind
with data frames, we need to ensure that the column names and the type of data contained within the columns matches between the two data frames. The rbind
function is clever enough to resolve any potential issues with factor levels.
> # New data arrives each month
> jan <- data.frame(Month = "Jan", Value = 46.4)
> feb <- data.frame(Month = "Feb", Value = 55.2)
> rbind(jan, feb)
Month Value
1 Jan 46.4
2 Feb 55.2
For some reason R tends not to be compared favorably with languages such as SAS when it comes to merging, though as a user of both R and SAS I actually find it slightly easier to merge data in R than in SAS, and it certainly beats Excel! In R, there is no need to sort before a merge. In many cases, you can also get away without specifying the variable(s) you want to merge by, though it’s generally considered bad practice not to do so explicitly. The function that we use is the merge
function.
The merge
function allows us to merge two datasets by one or more common variables. The function has a number of arguments that can be used to control the “by” variables and match the rows in each dataset. These arguments are listed in Table 11.1.
In order to see the merge
function in action, let’s walk through an example using two of the datasets contained within the mangoTraining package, demoData
and pkData
. The data frames contain data from a fictitious clinical trial in which 33 subjects were given doses of a drug and then monitored over time. First of all, let’s preview the data frames:
> head(demoData, 3)
Subject Sex Age Weight Height BMI Smokes
1 1 M 43 57 166 20.7 No
2 2 M 22 71 179 22.2 No
3 3 F 23 72 170 25.1 No
> head(pkData, 7)
Subject Dose Time Conc
1 1 25 0 0.00
2 1 25 1 660.13
3 1 25 6 178.92
4 1 25 12 88.99
5 1 25 24 42.71
6 2 25 0 0.00
7 2 25 1 445.55
For each of the 33 subjects in demoData
there are five corresponding records in pkData
representing times at which blood samples were taken during the fictitious study. In order to model drug concentration, Conc
, as a response to Dose
and each subject’s demographic information, we would need to create a single data frame containing all relevant information. We do this by merging the two data frames together by the Subject
column:
> fullPk <- merge(x = demoData, y = pkData, by = "Subject")
The merge
function requires at least an x
and a y
argument to specify the two data frames that we want to merge by. Here, we specified by = "Subject"
to illustrate that we were merging by the common variable Subject
. However, because this is a common variable, we could just as easily have omitted the argument and let R find the common variables to merge by:
> fullPk <- merge(x = demoData, y = pkData)
The arguments by.x
and by.y
come into play when the name of the variable(s) that we want to merge by differs within the two data frames. The x
and y
refer to the first two arguments of the function. Therefore, if Subject
had been labeled ID
in the pkData
data frame (our “y
” data frame), we would have specified by.x = "Subject", by.y = "ID"
.
The all
, all.x
, and all.y
arguments control the way in which records are merged when a value of the by
variable only appears in one of the two data frames. By default, each of these arguments is set to FALSE
, meaning that records will only be merged if the value of the by
variable appears in both data frames. In database terminology, this is commonly referred to as an inner join. This is probably best illustrated with an example. Suppose we take tiny subsets of demoData
and pkData
, keeping only data for the first two subjects in demoData
and subjects 2 and 3 in pkData
.
> demo1and2 <- demoData[demoData$Subject %in% 1:2, ]
> pk2and3 <- pkData[pkData$Subject %in% 2:3, ]
>
> demo1and2
Subject Sex Age Weight Height BMI Smokes
1 1 M 43 57 166 20.7 No
2 2 M 22 71 179 22.2 No
> pk2and3
Subject Dose Time Conc
6 2 25 0 0.00
7 2 25 1 445.55
8 2 25 6 129.31
9 2 25 12 93.33
10 2 25 24 46.11
11 3 25 0 0.00
12 3 25 1 500.65
13 3 25 6 146.04
14 3 25 12 116.93
15 3 25 24 68.25
The default behavior of merge
only merges data for subject 2 because this is the only subject that appears in both data frames:
> merge(demo1and2, pk2and3)
Subject Sex Age Weight Height BMI Smokes Dose Time Conc
1 2 M 22 71 179 22.2 No 25 0 0.00
2 2 M 22 71 179 22.2 No 25 6 129.31
3 2 M 22 71 179 22.2 No 25 12 93.33
4 2 M 22 71 179 22.2 No 25 24 46.11
5 2 M 22 71 179 22.2 No 25 1 445.55
Specifying all.x = TRUE
retains all records in our “x
” data (that is, demo1and2
), regardless of whether they appear in pk2and3
(a.k.a. a “left join”). Specifying all.y = TRUE
does likewise for pk2and3
(a “right join”). An “outer join,” where all records in each data frame are merged regardless of whether there is a matching value to merge by in the other data frame is achieved by specifying all = TRUE
. An example of an outer join is provided next. Notice that in cases where the merge by variable only has records in the “x
” data frame, values for all other variables in the “y
”” data frame are set to NA
, and vice versa.
> merge(demo1and2, pk2and3, all = TRUE)
Subject Sex Age Weight Height BMI Smokes Dose Time Conc
1 1 M 43 57 166 20.7 No NA NA NA
2 2 M 22 71 179 22.2 No 25 0 0.00
3 2 M 22 71 179 22.2 No 25 6 129.31
4 2 M 22 71 179 22.2 No 25 12 93.33
5 2 M 22 71 179 22.2 No 25 24 46.11
6 2 M 22 71 179 22.2 No 25 1 445.55
7 3 <NA> NA NA NA NA <NA> 25 12 116.93
8 3 <NA> NA NA NA NA <NA> 25 0 0.00
9 3 <NA> NA NA NA NA <NA> 25 1 500.65
10 3 <NA> NA NA NA NA <NA> 25 6 146.04
11 3 <NA> NA NA NA NA <NA> 25 24 68.25
Note: Naming Common Variables
If our two datasets have common variables that we do not wish to merge by, then R will append “.x
” and “.y
” to the column names in the resulting data frame. The suffixes
argument can be used to create an alternative suffix.
The duplicated
function finds duplicate values. It does so by asking the question, “Have I seen this before?” For example, take the Month
column from the airquality
data frame. The airquality
data frame contains daily records for five months (May through September). In total there are therefore 153 individual values in the Month
column but most are repeats. Calling duplicated
on the column yields the following:
> isMonthValueADuplicate <- duplicated(airquality$Month)
> isMonthValueADuplicate[1:10] # View first 10 records
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The fact that we can generate these TRUE
and FALSE
values like this is very useful. By placing !
in front of the call to duplicated
, we switch the TRUE
and FALSE
values around. The corresponding logical vector can then be used to remove duplicate values and hence subset our data to leave only the first instance of a value occurring. Here, we use this to extract the first record for each month in the airquality
dataset:
> airquality[!duplicated(airquality$Month), ]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
32 NA 286 8.6 78 6 1
62 135 269 4.1 84 7 1
93 39 83 6.9 81 8 1
124 96 167 6.9 91 9 1
Perhaps a more standard use of the duplicated
function is to find and remove duplicated records. To achieve this, we can call duplicated
directly on a data frame:
> # Create data with a duplicate record for ID==2
> duplicateData <- data.frame(ID = c(1,2,2,3,4), Score = c(57, 45, 45, 63, 54))
> duplicateData
ID Score
1 1 57
2 2 45
3 2 45
4 3 63
5 4 54
> # Remove the duplicate record
> duplicateData[!duplicated(duplicateData),]
ID Score
1 1 57
2 2 45
4 3 63
5 4 54
Tip: Unique Values
If we just want to identify the unique values within a vector, the unique
function removes all duplicates within a vector and returns a smaller subset containing the unique values.
Before we can begin to fit models or even plot our data, we need to ensure that it is in a suitable structure. If it is not, we will need to restructure the data. SAS users would call this transposing the data. Excel users might call it pivoting. Others might call it reshaping or tidying. In R, the best known and most used packages for restructuring data are reshape, reshape2, and recently tidyr. Each of the packages has been written by Hadley Wickham and is based around the notion of what he now refers to as “tidy” data. We can think of the packages as an evolution (beginning with reshape and ending with tidyr). The terminology and usability have improved slightly with each, though the scope of these packages has actually decreased. We will therefore take a little time to look at the packages in turn.
Although the term “tidy data” might be unfamiliar, the concept is nothing new. If you are familiar with relational databases, the basic aim is to structure the data as you would in a database table. In other words, we structure the data such that
Each variable forms a column.
Each observation forms a row.
This differs from Excel, for which it is common to spread values that we want to compare across multiple columns in order to treat them as separate series when working with Excel’s plotting wizards. The tidy structure is, however, very standard in R, and most of the graphical and analytical packages in R expect a data frame in the tidy format.
The reshape and reshape2 packages offer essentially the same functionality for restructuring our data. We will work through an example using reshape and highlight differences within reshape2. There are several utility functions contained within the reshape package, but the main restructuring functions are melt
, cast
, and recast
. The basic idea is to “melt” a data frame (using the melt
function) into a very long and thin structure and then, if necessary, “cast” it (using the cast
function in reshape or dcast
in reshape2) into a new structure.
Tip: Getting to Grips with reshape via reshapeGUI
Reshaping data can be hard! The melt
and cast
functions in reshape are great but can take some getting used to. The reshapeGUI package provides an interactive graphical user interface for practicing using the melt
and cast
functions. When we use the GUI to select ID and measurement variables, it builds up the equivalent line of R code for us. The GUI also allows us to preview the results before we submit to the R console.
The trick to understand the melt
function is to be able to identify what are referred to as ID and measurement (“measured”) variables within the package. ID variables represent fixed information about the data collected; this is usually IDs or names, geographic information about where the data was collected, the date and time the data was collected, and so on. The measurement variables contain the data we have collected. If you consider fitting a model to the data, then as a rough guide the measurement variables would be the response variables and the ID variables would be the explanatory variables.
Once we’ve decided what our ID variables are and what our measurement variables are, we feed them into the respective id.vars
and measure.vars
arguments. Any variables we are not interested in can be ignored and are excluded from the restructuring. To save some typing, we need only specify one of id.vars
and one of measure.vars
. R will assume that the rest of our variables fall into the unused category.
The melt
function is best seen through an example. Listing 11.3 shows a simple example using the french_fries
data contained within the reshape package. The data was originally collected from a sensory experiment to investigate fryer oils conducted at Iowa State University in 2004.
1: > # Let's begin by loading the package and looking at the data
2: > library(reshape)
3: > head(french_fries, 3)
4: time treatment subject rep potato buttery grassy rancid painty
5: 61 1 1 3 1 2.9 0.0 0 0.0 5.5
6: 25 1 1 3 2 14.0 0.0 0 1.1 0.0
7: 62 1 1 10 1 11.0 6.4 0 0.0 0.0
8: > tail(french_fries, 3)
9: time treatment subject rep potato buttery grassy rancid painty
10: 695 10 3 78 2 3.3 0 0 2.5 1.4
11: 666 10 3 86 1 2.5 0 0 7.0 10.5
12: 696 10 3 86 2 2.5 0 0 8.2 9.4
13:
14: # Now we 'melt' having identified the ID variables
15: > fryMelt <- melt(french_fries,
16: + id.vars = c("time", "treatment", "subject", "rep"))
17:
18: # Our new data is long and thin
19: > head(fryMelt, 3)
20: time treatment subject rep variable value
21: 1 1 1 3 1 potato 2.9
22: 2 1 1 3 2 potato 14.0
23: 3 1 1 10 1 potato 11.0
24: > tail(fryMelt, 3)
25: time treatment subject rep variable value
26: 3478 10 3 78 2 painty 1.4
27: 3479 10 3 86 1 painty 10.5
28: 3480 10 3 86 2 painty 9.4
Lines 1 to 11 of the listing show the basic structure of our data. We can deduce from the data that at each time point, a subject was given two French fries to taste that had undergone one of three treatments. The subject rated each of the fries using the criteria defined in the remaining columns. These remaining columns are therefore our measurement variables. The variables time
, treatment
, subject
, and rep
are our ID variables. Once we have identified the ID and measurement variables, the code is fairly straightforward; we call the melt
function and specify the ID variables using id.vars
. As can be seen from line 17 onward in the listing, the resulting data is very long and thin. The column names for the measurement variables have been stacked into a single column named variable
, and the ID variables have been repeated accordingly. The associated values for the measurement variables have been stacked into a column named value
.
Calling the melt
function on a data frame will normally produce a data frame in the desired format. However, more often than not some further work is required in order to “cast” the data into a new structure. The cast
function in reshape (or dcast
in reshape2) accepts a formula that describes the shape of the output format. It has the following basic form:
untouched_column_1 + untouched_column_2 ~ column_to_split_1 + column_to_split_2
On the left side we specify the columns that are to remain as they are. On the right side we specify columns that are to be split apart into new columns. A new column will be created for each unique combination of values contained within the variables on the right side of the equation. We never reference the value
column because this represents our content or measured data. The behavior is best seen using an example. In Listing 11.4 we create two new columns from the fryMelt
data we created in Listing 11.3 based on the rep
variable. The “...
” notation is used to mean “all other columns.” A single period can also be used to represent “no variable” in the casting formula.
1: > # Create two new columns based on the rep variable
2: > fryReCast <- cast(fryMelt, ... ~ rep)
3: > head(fryReCast, 3)
4: time treatment subject variable 1 2
5: 1 1 1 3 potato 2.9 14
6: 2 1 1 3 buttery 0.0 0
7: 3 1 1 3 grassy 0.0 0
Note: Differences Between reshape and reshape2
In reshape2 the distinction is made between casting to data frames and casting to arrays. Instead of the cast
function, we have two new functions: acast
for arrays and dcast
for data frames.
Using melt
and then cast
(or dcast
) helps break up the reshaping process. For more complicated examples, it can be really useful to check that the intermediate “melted” data frame is as expected before casting into a new shape. However, this is not actually a necessary step. The entire transformation can be performed in a single step using the recast
function. The only difference when using recast
is that instead of the id.vars
and measure.vars
arguments that we used in melt
, we drop the “s
” and use id.var
and measure.var
instead.
> recast(french_fries,
+ id.var = c("time", "treatment", "subject", "rep"),
+ formula = ... ~ rep)
time treatment subject variable 1 2
1 1 1 3 potato 2.9 14.0
2 1 1 3 buttery 0.0 0.0
3 1 1 3 grassy 0.0 0.0
...
Note: Aggregation Using reshape
The fun.aggregate
argument to cast
(and dcast
in reshape2) provides the ability to aggregate the data using summary functions such as mean
.
> # Mean across replicates
> replicateMeans <-
+ cast(fryMelt, time + treatment + subject + variable ~ ., mean)
> head(replicateMeans, 3)
time treatment subject variable (all)
1 1 1 3 potato 8.45
2 1 1 3 buttery 0.00
3 1 1 3 grassy 0.00
Although it is possible to aggregate data using reshape
, we will look at more straightforward aggregation techniques later in the hour and then again in Hour 12.
The main difference between the reshape approach to restructuring and tidyr is the terminology. The functions melt
and cast
(or dcast
) become gather
and spread
. Otherwise, the idea is very much the same. In tidyr we also have a third option, separate
, that comes in handy when multiple pieces of information are stored together in a single variable.
When the values of a particular variable are spread over several columns, we look to “gather” the data into a single column. We do this using gather
. The required arguments to the gather
function are shown in Table 11.2.
Let’s look at how we would use gather
with some real data. For this example, we will use the djiData
stock data contained within the mangoTraining package. To simplify the example, we will first subset the data to obtain a data frame with three columns; the date, and the low and high values for the DJI for each date:
> djiHighLow <- djiData[, c("Date", "DJI.High", "DJI.Low")]
> head(djiHighLow, 3)
Date DJI.High DJI.Low
1 12/31/2014 18043.22 17820.88
2 12/30/2014 18035.02 17959.70
3 12/29/2014 18073.04 18021.57
Suppose that we want to create a single graphic of the high and low DJI values using one of the packages described in Hours 13–15. We need one column containing the values to plot and another column specifying whether each value was a high or a low value. We do this using the gather
function.
Having loaded the package, we next specify each of the columns we wish to gather, separated by a comma, referencing each by name directly and without wrapping in quotes. As highlighted in Table 12, we must also specify names for the key
and value
columns in the gathered data frame. In this example, we gather two columns, DJI.High
and DJI.Low
, but in general we can specify as many columns as we like:
> gatheredDJI <- gather(djiHighLow, key="DJI", value="Value", DJI.High, DJI.Low)
> head(gatheredDJI, 4)
Date DJI Value
1 2014-12-31 DJI.High 18043.22
2 2014-12-30 DJI.High 18035.02
3 2014-12-29 DJI.High 18073.04
4 2014-12-26 DJI.High 18103.45
Variables that are not listed, such as Date
in the preceding example, are unaffected by the gathering process. If we find the need to gather the majority of columns within our data, then instead of specifying what to gather we can specify what not to gather. We do so by listing columns that we are not interested in and placing a minus sign in front of each one.
Tip: Lots to Gather?
The tidyr package allows a special use of the :
operator for sequencing. The operator allows us to specify a “from” and a “to” in terms of column names. Therefore, a:z
would be interpreted as start gathering at column “a” and gather all columns up to column “z.”
The term “spread” is similar to “cast” in reshape. It enables us to take a column of values and a column label for these values (the “key”) and “spread” the contents over several columns. The primary arguments to spread are again key
and value
. A new column is created for each label in the key
column. This can be useful if we need to calculate, say, changes over time. We take a column of values, value
, and a column of times, key
, at which these values occurred. We then spread the information, creating a new column for each time point. In the following example, we undo the process of gathering the low and high DJI values into a single column, spreading back into the two original columns:
> backToOriginal <- spread(gatheredDJI, key = DJI, value = Value)
> head(backToOriginal, 3)
Date DJI.High DJI.Low
1 01/02/2014 16573.07 16416.49
2 01/03/2014 16518.74 16439.30
3 01/06/2014 16532.99 16405.52
Tip: Piping Commands
The tidyr package has been designed to work with magrittr’s pipe operator. This allows us to chain commands together, thus avoiding intermediate data frames. You will learn more about the pipe operator in Hour 12.
Occasionally we may find ourselves in a situation where two separate pieces of information are joined together in a single variable. R packages provide a nice example of this. An R package source name is made up of a package name and version number. An example of this is shown here:
> Packages <- data.frame(Source=c("reshape_0.8.5", "tidyr_0.2.0"))
> Packages
Source
1 reshape_0.8.5
2 tidyr_0.2.0
We can use the separate
function to split the package names from the version numbers. Further arguments such as sep
are used to specify the splitting character:
> separate(Packages, Source, into = c("Package", "Version"), sep = "_")
Package Version
1 reshape 0.8.5
2 tidyr 0.2.0
By default, the original variable is deleted. We override this behavior, however, by specifying remove = FALSE
.
In Hour 9, “Loops and Summaries,” you saw two ways of applying simple functions to more complex data structures:
Iterate over sections of data with a loop.
Use one of the apply family of functions.
Let’s consider if we want to add a new column to airquality
, containing the difference between the Wind
speed for a particular day and the median Wind
speed for that Month
. To achieve this, we need to perform three tasks:
Calculate the median Wind
speed by Month
.
Align the median Wind
speed value calculated with the daily Wind
speed data.
Calculate the difference between the daily Wind
speed and the “median” data.
Create an empty column in our data.
For each row in the data:
Look at the Month
value for this row.
Calculate the median Wind
for all data with that Month
value.
Calculate the difference between the daily Wind
value and this median.
Insert this value in the cell.
This approach is very inefficient. For example, it involves calculating a median repeatedly (once per row). Instead, we could calculate the medians using one loop and then reference the values in a second loop, using an approach like this:
Create an empty column in our data.
For each unique Month
value, calculate and store the mean Wind
.
For each row in the data:
Look at the Month
value for this row.
Reference the correct median Wind
for that Month
value (from previous loop).
Calculate the difference between the daily Wind
value and this median.
Insert this value in the cell.
Again, this isn’t ideal. Let’s instead consider (and see) an approach using the “apply” functions that we saw in Hour 9.
The first thing we have to decide is which “apply” function to use. Let’s first use the tapply
function (or split
and sapply
) to return the median Wind
by Month
:
> head(airquality) # Print airquality
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> windMedians <- tapply(airquality$Wind, airquality$Month, median)
> windMedians
5 6 7 8 9
11.5 9.7 8.6 8.6 10.3
This is straightforward and calculates the median Wind
speed by Month
, storing the results in a named vector. The next step is to align the daily values with the corresponding windMedians
values so we can calculate the differences. This is, perhaps, the most complex part of this process.
As you saw in Hour 3, “Single-Mode Data Structures,” we can reference values from a vector using square brackets and specifying with blank, positive, negative, logical, or character inputs. In this case, we have a vector of Month
values to use to reference values from the windMedians
vector. Let’s convert our Month
values to characters and then use those values to reference the (named) elements of windMedians
:
> charMonths <- as.character(airquality$Month) # Converted character values of
Month
> # Use character values to reference named elements
> head(windMedians [ charMonths ])
5 5 5 5 5 5
11.5 11.5 11.5 11.5 11.5 11.5
Now we can create a column of means in our dataset and calculate differences from those. Of course, we don’t have to create the column of intermediate values, but we included it here to help illustrate the process:
> airquality$MedianWind <- windMedians [ charMonths ] # Add Median Wind
column
> airquality$DiffWind <- airquality$Wind - airquality$MedianWind # Calculate
differences
> head(airquality, 3) # First few rows
Ozone Solar.R Wind Temp Month Day MeanWind DiffWind MedianWind
1 41 190 7.4 67 5 1 11.5 -4.1 11.5
2 36 118 8.0 72 5 2 11.5 -3.5 11.5
3 12 149 12.6 74 5 3 11.5 1.1 11.5
> tail(airquality, 3) # Last few rows
Ozone Solar.R Wind Temp Month Day MeanWind DiffWind MedianWind
151 14 191 14.3 75 9 28 10.3 4.0 10.3
152 18 131 8.0 76 9 29 10.3 -2.3 10.3
153 20 223 11.5 68 9 30 10.3 1.2 10.3
This approach works, but the second step (aligning the means with the daily values) was perhaps a little complex. If we decide later that we want to perform the same process for a number of columns, the solution would become more verbose/complex. We can simplify this approach using the aggregate
function.
The aggregate
function allows us apply functions over sections of a data frame, returning a data frame as the output. We can use aggregate
using two different methods:
We can supply a “formula” to describe the data over which to apply.
We can specify a set of variables to summarize and a set of variables by which to summarize separately.
Let’s first see an example using a formula to define the structure of the data.
We can use a formula with aggregate
to specify the variables to summarize and the variables by which to perform the summary. A basic formula is of the form Y ~ X
, where Y
is the variable to summarize and X
is the variable by which to summarize. The aggregate
function additionally accepts a data
argument (specifying the data frame containing the data) and a FUN
argument (specifying the function to apply). Let’s look at a simple example where we again calculate the median Wind
by Month
:
> aggregate(Wind ~ Month, data = airquality, FUN = median)
Month Wind
1 5 11.5
2 6 9.7
3 7 8.6
4 8 8.6
5 9 10.3
As you can see, the return structure is a data frame, which is a very simple and useable structure.
If we want to apply the function by more than one variable, we can add the names of the variables to the set of variables in the formula:
> aggregate(Wind ~ Month + cut(Temp, 2), data = airquality, FUN = median)
Month cut(Temp, 2) Wind
1 5 (56,76.5] 11.5
2 6 (56,76.5] 9.7
3 7 (56,76.5] 10.6
4 8 (56,76.5] 12.6
5 9 (56,76.5] 10.9
6 5 (76.5,97] 10.3
7 6 (76.5,97] 9.7
8 7 (76.5,97] 8.6
9 8 (76.5,97] 8.3
10 9 (76.5,97] 7.7
Again, the return structure is a data frame.
If we want to perform the same summary on a number of variables at the same time, we can combine the summary variables in a call to cbind
. For example, let’s calculate the median Wind
and Ozone
values by Month
:
> aggregate(cbind(Wind, Ozone) ~ Month, data = airquality, FUN = median, na.rm = TRUE)
Month Wind Ozone
1 5 11.5 18
2 6 11.5 23
3 7 7.7 60
4 8 8.0 52
5 9 10.3 23
In the preceding examples, we used the median
function, which returns a single value. If, instead, we used a function that returned multiple values, these would be returned as separate columns. To illustrate this behavior, let’s repeat the last three examples with the range
function:
> # Range of Wind values by Month
> aggregate(Wind ~ Month, data = airquality, FUN = range, na.rm = TRUE)
Month Wind.1 Wind.2
1 5 5.7 20.1
2 6 1.7 20.7
3 7 4.1 14.9
4 8 2.3 15.5
5 9 2.8 16.6
> # Range of Wind AND Ozone values by Month
> aggregate(cbind(Wind, Ozone) ~ Month, data = airquality, FUN = range, na.rm = TRUE)
Month Wind.1 Wind.2 Ozone.1 Ozone.2
1 5 5.7 20.1 1 115
2 6 8.0 20.7 12 71
3 7 4.1 14.9 7 135
4 8 2.3 15.5 9 168
5 9 2.8 16.6 7 96
> # Range of Wind AND Ozone values by Month AND grouped Temp
> aggregate(cbind(Wind, Ozone) ~ Month + cut(Temp, 2), data = airquality,
+ FUN = range, na.rm = TRUE)
Month cut(Temp, 2) Wind.1 Wind.2 Ozone.1 Ozone.2
1 5 (56,76.5] 6.9 20.1 1 41
2 6 (56,76.5] 9.2 20.7 12 37
3 7 (56,76.5] 6.9 14.3 10 16
4 8 (56,76.5] 7.4 14.3 9 23
5 9 (56,76.5] 6.9 16.6 7 30
6 5 (76.5,97] 5.7 14.9 45 115
7 6 (76.5,97] 8.0 14.9 21 71
8 7 (76.5,97] 4.1 14.9 7 135
9 8 (76.5,97] 2.3 15.5 9 168
10 9 (76.5,97] 2.8 15.5 16 96
In these examples, the values returned are named based on the column that was summarized and an index of the return value. If, instead, the function returned “named” elements, these names would be appended to the summarized column names:
> aggregate(Wind ~ Month, data = airquality,
+ FUN = function(X) {
+ c(MIN = min(X), MAX = max(X))
+ })
Month Wind.MIN Wind.MAX
1 5 5.7 20.1
2 6 1.7 20.7
3 7 4.1 14.9
4 8 2.3 15.5
5 9 2.8 16.6
Instead of the formula, we can use aggregate
by specifying variables separately in the function call. Specifically, we specify lists of variables, which we can rename when specifying the variables if we want to control the names of the resulting summary variables:
The first input specifies the variable(s) to summarize.
The second input specifies the grouping variable(s).
The third input is the function to apply.
Let’s again calculate the median Wind
by Month
, this time specifying the inputs as described earlier:
> aggregate(list(aveWind = airquality$Wind), list(Month = airquality$Month), median)
Month aveWind
1 5 11.5
2 6 9.7
3 7 8.6
4 8 8.6
5 9 10.3
The output is a data frame, with the variables named as specified in the input lists.
If we want to apply the function by more than one variable, we can add these variables to the list, as follows:
> aggregate(list(aveWind = airquality$Wind),
+ list(Month = airquality$Month, TempGroup = cut(airquality$Temp, 2)), median)
Month TempGroup aveWind
1 5 (56,76.5] 11.5
2 6 (56,76.5] 9.7
3 7 (56,76.5] 10.6
4 8 (56,76.5] 12.6
5 9 (56,76.5] 10.9
6 5 (76.5,97] 10.3
7 6 (76.5,97] 9.7
8 7 (76.5,97] 8.6
9 8 (76.5,97] 8.3
10 9 (76.5,97] 7.7
Again, this approach allows us to easily control the names of the resulting variables (for example, naming the TempGroup
and aveWind
columns).
If we want to perform the same summary on a number of variables at the same time, we can provide multiple variables in the first input list, as follows:
> aggregate(list(aveWind = airquality$Wind, aveOzone = airquality$Ozone),
+ list(Month = airquality$Month), median, na.rm = TRUE)
Month aveWind aveOzone
1 5 11.5 18
2 6 9.7 23
3 7 8.6 60
4 8 8.6 52
5 9 10.3 23
Tip: Specifying Inputs as Data Frames
Because a data frame is, structurally, a list of vectors, we can supply data frame inputs directly instead of lists, if preferred. This is most useful when there are multiple variables being specified. For example, we could rewrite the last example as follows:
> aggregate(airquality[,c("Wind", "Ozone")],
+ list(Month = airquality$Month), median, na.rm = TRUE)
Month Wind Ozone
1 5 11.5 18
2 6 9.7 23
3 7 8.6 60
4 8 8.6 52
5 9 10.3 23
Although this is far more concise, we do lose the ability to directly rename the variables (for example, to aveWind
and aveOzone
as per the previous example).
As with the example where we specified formulas, we can apply functions that return multiple values. In this case, the index of values is appended to the summarized variable name:
> aggregate(list(Wind = airquality$Wind),
+ list(Month = airquality$Month), range)
Month Wind.1 Wind.2
1 5 5.7 20.1
2 6 1.7 20.7
3 7 4.1 14.9
4 8 2.3 15.5
5 9 2.8 16.6
Again, if our function returns named elements, these are appended instead of the index values:
> aggregate(list(Wind = airquality$Wind),
+ list(Month = airquality$Month),
+ function(X) {
+ c(MIN = min(X), MAX = max(X))
+ })
Month Wind.MIN Wind.MAX
1 5 5.7 20.1
2 6 1.7 20.7
3 7 4.1 14.9
4 8 2.3 15.5
5 9 2.8 16.6
At the start of the Data Aggregation section, we introduced a task that we were aiming to complete and discussed how the previous approaches (for loops and apply functions) could be used to achieve that task. To recap, we are aiming to add a new column to airquality
, containing the difference between the Wind
speed for a particular day and the median Wind
speed for that Month
.
To achieve this, we need to perform three tasks:
Calculate the median Wind
speed by Month
.
Align the median Wind
speed value calculated with the daily Wind
speed data.
Calculate the difference between the daily Wind
speed and the “median” data.
Using the aggregate
function, we can calculate the median Wind
by Month
, returning our results as a data frame:
> windMedians <- aggregate(list(MedianWind = airquality$Wind),
+ list(Month = airquality$Month), median)
> windMedians
Month MedianWind
1 5 11.5
2 6 9.7
3 7 8.6
4 8 8.6
5 9 10.3
Note: Using List Inputs to Aggregate
In this example, I’m specifying the inputs to aggregate as list elements, instead of a formula, so I can explicitly control the naming of the summary (that is, the MedianWind
column). If I used a formula, I’d need to rename the column to MedianWind
as a second step.
Now that we have our median Wind
values in a data frame, we can merge this onto our original dataset to create the MedianWind
column:
> airquality <- merge(airquality, windMedians)
> head(airquality)
Month Ozone Solar.R Wind Temp Day MedianWind
1 5 41 190 7.4 67 1 11.5
2 5 36 118 8.0 72 2 11.5
3 5 12 149 12.6 74 3 11.5
4 5 18 313 11.5 62 4 11.5
5 5 NA NA 14.3 56 5 11.5
6 5 28 NA 14.9 66 6 11.5
In this hour, you saw how to sort, set, and merge data using traditional R functions. We looked at the popular reshape (reshape2) and tidyr packages for restructuring our data, ready for plotting and modeling. We also looked at various options for aggregating data including the powerful aggregate
function.
In the next hour, we will look closer at two packages that are changing the way people manipulate and summarize data with R. The data.table and dplyr packages offer speed and efficiency, borrowing approaches from the database world.
Q. I tried to sort the airquality data using airquality[sort(airquality$Wind),] but got strange results. What happened?
A. To sort a data frame in this way, you need to know which rows to select. The sort order is returned by the order
function, not sort
.
Q. I have two data frames, each containing data for specified locations at specified times. Can I merge by both variables?
A. Absolutely. You can specify as many merge-by-variable operations as you like using merge
. Pass the names to merge
as a character vector.
Q. Is it possible to merge three data frames at once using merge?
A. Unfortunately, no. However, the merge_recurse
function in reshape provides this functionality.
Q. Should I be using reshape2 instead of reshape?
A. Development of reshape ceased in 2011. However, it depends on what you want to do. In some sense, reshape2 supersedes reshape; however, there is arguably more functionality contained within reshape. If you want to use reshape/reshape2 for data aggregation, it is worth noting that the cast
function can handle summary functions such as range
that produce a vector of multiple values, whereas dcast
cannot and fails with an error.
The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.
1. What is the difference between sort
and order
?
2. Which function can be used to return the unique values in a vector?
3. What function would you use to append rows to a data frame?
4. What does the “d” represent in the dcast
function?
1. The sort
function is used to sort vectors. It cannot be used to sort data frames. The order
function provides a sort order that can be used to sort vectors or data frames.
2. The unique
function directly returns the unique values. Alternatively, duplicated
could be used as a means to subscript and obtain the same result.
3. The rbind
function is a simple means of appending new rows to a data frame.
4. The “d” stands for “data frame.” In reshape2, the more generic cast was replaced with acast
and dcast
functions to allow casting to both arrays and data frames via separate functions.
1. Sort the mtcars
data frame by the number of cylinders and then descending by miles per gallon.
2. Extract the “Employees” and “Orders” tables from the Northwind.mdb file contained within the mangoTraining package using RODBC. Merge the two data frames by EmployeeID.
3. Use melt
and dcast
to find the average tip size by the sex and smoking habit of the bill payer using the tips
data contained within the reshape2 package.
4. Separate the Date
column within djiData
into three new columns: Month, Day, and Year. Ensure that you keep the original Date
column.