Sometimes data requires more complex storage than simple vector
s and thankfully R
provides a host of data structures. The most common are the data.frame
, matrix
and list
followed by the array
. Of these, the data.frame
will be most familiar to anyone who has used a spreadsheet, the matrix
to people familiar with matrix math and the list
to programmers.
Perhaps one of the most useful features of R
is the data.frame
. It is one of the most often cited reasons for R
’s ease of use.
On the surface a data.frame
is just like an Excel spreadsheet in that it has columns and rows. In statistical terms, each column is a variable and each row is an observation.
In terms of how R
organizes data.frame
s, each column is actually a vector
, each of which has the same length. That is very important because it lets each column hold a different type of data (see Section 4.3). This also implies that within a column each element must be of the same type, just like with vector
s.
There are numerous ways to construct a data.frame
, the simplest being to use the data.frame
function. Let’s create a basic data.frame
using some of the vector
s we have already introduced, namely x
, y
and q
.
> x <- 10:1
> y <- -4:5
> q <- c("Hockey", "Football", "Baseball", "Curling", "Rugby",
+ "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")
> theDF <- data.frame(x, y, q)
> theDF
x y q
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
This creates a 10x3 data.frame
consisting of those three vector
s. Notice the names of theDF
are simply the variables. We could have assigned names during the creation process, which is generally a good idea.
> theDF <- data.frame(First = x, Second = y, Sport = q)
> theDF
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
data.frame
s are complex objects with many attributes. The most frequently checked attributes are the number of rows and columns. Of course there are functions to do this for us: nrow
and ncol
. And in case both are wanted at the same time there is the dim
function.
> nrow(theDF)
[1] 10
> ncol(theDF)
[1] 3
> dim(theDF)
[1] 10 3
Checking the column names of a data.frame
is as simple as using the names
function. This returns a character vector
listing the columns. Since it is a vector
we can access individual elements of it just like any other vector
.
> names(theDF)
[1] "First" "Second" "Sport"
> names(theDF)[3]
[1] "Sport"
We can also check and assign the row names of a data.frame
.
> rownames(theDF)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
> rownames(theDF) <- c("One", "Two", "Three", "Four", "Five", "Six",
+ "Seven", "Eight", "Nine", "Ten")
> rownames(theDF)
[1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" "Eight"
[9] "Nine" "Ten"
> # set them back to the generic index
> rownames(theDF) <- NULL
> rownames(theDF)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
Usually a data.frame
has far too many rows to print them all to the screen, so thankfully the head
function prints out only the first few rows.
> head(theDF)
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
> head(theDF, n = 7)
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
> tail(theDF)
First Second Sport
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
As we can with other variables, we can check the class
of a data.frame
using the class
function.
> class(theDF)
[1] "data.frame"
Since each column of the data.frame
is an individual vector
, it can be accessed individually and each has its own class
. Like many other aspects of R
, there are multiple ways to access an individual column. There is the $
operator and also the square brackets. Running theDF$Sport
will give the third column in theDF
. That allows us to specify one particular column by name.
> theDF$Sport
[1] Hockey Football Baseball Curling Rugby Lacrosse
[7] Basketball Tennis Cricket Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis
Similar to vector
s, data.frame
s allow us to access individual elements by their position using square brackets, but instead of having one position two are specified. The first is the row number and the second is the column number. So to get the third row from the second column we use theDF[3, 2]
.
> theDF[3, 2]
[1] -2
To specify more than one row or column use a vector
of indices.
> # row 3, columns 2 through 3
> theDF[3, 2:3]
Second Sport
3 -2 Baseball
>
> # rows 3 and 5, column 2
> # since only one column was selected it was returned as a vector
> # hence the column names will not be printed
> theDF[c(3, 5), 2]
[1] -2 0
>
> # rows 3 and 5, columns 2 through 3
> theDF[c(3, 5), 2:3]
Second Sport
3 -2 Baseball
5 0 Rugby
To access an entire row, specify that row while not specifying any column. Likewise, to access an entire column, specify that column while not specifying any row.
> # all of column 3
> # since it is only one column a vector is returned
> theDF[, 3]
[1] Hockey Football Baseball Curling Rugby Lacrosse
[7] Basketball Tennis Cricket Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis
>
> # all of columns 2 through 3
> theDF[, 2:3]
Second Sport
1 -4 Hockey
2 -3 Football
3 -2 Baseball
4 -1 Curling
5 0 Rugby
6 1 Lacrosse
7 2 Basketball
8 3 Tennis
9 4 Cricket
10 5 Soccer
>
> # all of row 2
> theDF[2, ]
First Second Sport
2 9 -3 Football
>
> # all of rows 2 through 4
> theDF[2:4, ]
First Second Sport
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
To access multiple columns by name, make the column argument a character vector
of the names.
> theDF[, c("First", "Sport")]
First Sport
1 10 Hockey
2 9 Football
3 8 Baseball
4 7 Curling
5 6 Rugby
6 5 Lacrosse
7 4 Basketball
8 3 Tennis
9 2 Cricket
10 1 Soccer
Yet another way to access a specific column is to use its column name (or its number) either as second argument to the square brackets or as the only argument to either single or double square brackets.
> # just the "Sport" column
> # since it is one column it returns as a (factor) vector
> theDF[, "Sport"]
[1] Hockey Football Baseball Curling Rugby Lacrosse
[7] Basketball Tennis Cricket Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis
> class(theDF[, "Sport"])
[1] "factor"
>
> # just the "Sport" column
> # this returns a one column data.frame
> theDF["Sport"]
Sport
1 Hockey
2 Football
3 Baseball
4 Curling
5 Rugby
6 Lacrosse
7 Basketball
8 Tennis
9 Cricket
10 Soccer
> class(theDF["Sport"])
[1] "data.frame"
>
> # just the "Sport" column
> # this also returns a (factor) vector
> theDF[["Sport"]]
[1] Hockey Football Baseball Curling Rugby Lacrosse
[7] Basketball Tennis Cricket Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis
> class(theDF[["Sport"]])
[1] "factor"
All of these methods have differing outputs. Some return a vector
, some return a single-column data.frame
. To ensure a single-column data.frame
while using single-square brackets, there is a third argument: drop=FALSE
. This also works when specifying a single column by number.
> theDF[, "Sport", drop = FALSE]
Sport
1 Hockey
2 Football
3 Baseball
4 Curling
5 Rugby
6 Lacrosse
7 Basketball
8 Tennis
9 Cricket
10 Soccer
> class(theDF[, "Sport", drop = FALSE])
[1] "data.frame"
>
> theDF[, 3, drop = FALSE]
Sport
1 Hockey
2 Football
3 Baseball
4 Curling
5 Rugby
6 Lacrosse
7 Basketball
8 Tennis
9 Cricket
10 Soccer
> class(theDF[, 3, drop = FALSE])
[1] "data.frame"
In Section 4.4.2 we see that factor
s are stored specially. To see how they would be represented in data.frame
form, use model.matrix
to create a set of indicator (or dummy) variables. That is one column for each level
of a factor
, with a 1 if a row contains that level
or a 0 otherwise.
> newFactor <- factor(c("Pennsylvania", "New York", "New Jersey", "New York",
+ "Tennessee", "Massachusetts", "Pennsylvania", "New York"))
> model.matrix(~newFactor - 1)
newFactorMassachusetts newFactorNew Jersey newFactorNew York
1 0 0 0
2 0 0 1
3 0 1 0
4 0 0 1
5 0 0 0
6 1 0 0
7 0 0 0
8 0 0 1
newFactorPennsylvania newFactorTennessee
1 1 0
2 0 0
3 0 0
4 0 0
5 0 1
6 0 0
7 1 0
8 0 0
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$newFactor
[1] "contr.treatment"
We learn more about formulas (the argument to model.matrix
) in Sections 11.2 and 12.3.2 and Chapters 15 and 16.
Often a container is needed to hold arbitrary objects of either the same type or varying types. R
accomplishes this through list
s. They store any number of items of any type. A list
can contain all numeric
s or character
s or a mix of the two or data.frame
s or, recursively, other list
s.
List
s are created with the list
function where each argument to the function becomes an element of the list
.
> # creates a three element list
> list(1, 2, 3)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
>
> # creates a single element list where the only element is a vector
> # that has three elements
> list(c(1, 2, 3))
[[1]]
[1] 1 2 3
>
> # creates a two element list
> # the first element is a three element vector
> # the second element is a five element vector
> (list3 <- list(c(1, 2, 3), 3:7))
[[1]]
[1] 1 2 3
[[2]]
[1] 3 4 5 6 7
>
> # two element list
> # first element is a data.frame
> # second element is a 10 element vector
> list(theDF, 1:10)
[[1]]
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10
>
> # three element list
> # first is a data.frame
> # second is a vector
> # third is list3, which holds two vectors
> list5 <- list(theDF, 1:10, list3)
> list5
[[1]]
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10
[[3]]
[[3]][[1]]
[1] 1 2 3
[[3]][[2]]
[1] 3 4 5 6 7
Notice in the previous block of code (where list3
was created) that enclosing an expression in parentheses displays the results after execution.
Like data.frame
s, list
s can have names. Each element has a unique name that can be either viewed or assigned using names
.
> names(list5)
NULL
> names(list5) <- c("data.frame", "vector", "list")
> names(list5)
[1] "data.frame" "vector" "list"
> list5
$data.frame
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
$vector
[1] 1 2 3 4 5 6 7 8 9 10
$list
$list[[1]]
[1] 1 2 3
$list[[2]]
[1] 3 4 5 6 7
Names can also be assigned to list
elements during creation using name-value pairs.
> list6 <- list(TheDataFrame = theDF, TheVector = 1:10, TheList = list3)
> names(list6)
[1] "TheDataFrame" "TheVector" "TheList"
> list6
$TheDataFrame
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
$TheVector
[1] 1 2 3 4 5 6 7 8 9 10
$TheList
$TheList[[1]]
[1] 1 2 3
$TheList[[2]]
[1] 3 4 5 6 7
Creating an empty list
of a certain size is, perhaps confusingly, done with vector
.
> (emptyList <- vector(mode = "list", length = 4))
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
To access an individual element of a list
, use double square brackets, specifying either the element number or name. Note that this allows access to only one element at a time.
> list5[[1]]
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
> list5[["data.frame"]]
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
Once an element is accessed it can be treated as if that actual element is being used, allowing nested indexing of elements.
> list5[[1]]$Sport
[1] Hockey Football Baseball Curling Rugby Lacrosse
[7] Basketball Tennis Cricket Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis
> list5[[1]][, "Second"]
[1] -4 -3 -2 -1 0 1 2 3 4 5
> list5[[1]][, "Second", drop = FALSE]
Second
1 -4
2 -3
3 -2
4 -1
5 0
6 1
7 2
8 3
9 4
10 5
It is possible to append elements to a list
simply by using an index (either numeric or named) that does not exist.
> # see how long it currently is
> length(list5)
[1] 3
>
> # add a fourth element, unnamed
> list5[[4]] <- 2
> length(list5)
[1] 4
>
> # add a fifth element, named
> list5[["NewElement"]] <- 3:6
> length(list5)
[1] 5
>
> names(list5)
[1] "data.frame" "vector" "list" "" "NewElement"
> list5
$data.frame
First Second Sport
1 10 -4 Hockey
2 9 -3 Football
3 8 -2 Baseball
4 7 -1 Curling
5 6 0 Rugby
6 5 1 Lacrosse
7 4 2 Basketball
8 3 3 Tennis
9 2 4 Cricket
10 1 5 Soccer
$vector
[1] 1 2 3 4 5 6 7 8 9 10
$list
$list[[1]]
[1] 1 2 3
$list[[2]]
[1] 3 4 5 6 7
[[4]]
[1] 2
$NewElement
[1] 3 4 5 6
Occasionally appending to a list
—or vector
or data.frame
for that matter—is fine, but doing so repeatedly is computationally expensive. So it is best to create a list
as long as its final desired size and then fill it in using the appropriate indices.
A very common mathematical structure that is essential to statistics is a matrix
. This is similar to a data.frame
in that it is rectangular with rows and columns except that every single element, regardless of column, must be the same type, most commonly all numeric
s. They also act similarly to vector
s with element-by-element addition, multiplication, subtraction, division and equality. The nrow
, ncol
and dim
functions work just like they do for data.frame
s.
> # create a 5x2 matrix
> A <- matrix(1:10, nrow = 5)
> # create another 5x2 matrix
> B <- matrix(21:30, nrow = 5)
> # create another 5x2 matrix
> C <- matrix(21:40, nrow = 2)
> A
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> B
[,1] [,2]
[1,] 21 26
[2,] 22 27
[3,] 23 28
[4,] 24 29
[5,] 25 30
> C
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 21 23 25 27 29 31 33 35 37 39
[2,] 22 24 26 28 30 32 34 36 38 40
> nrow(A)
[1] 5
> ncol(A)
[1] 2
> dim(A)
[1] 5 2
> # add them
> A + B
[,1] [,2]
[1,] 22 32
[2,] 24 34
[3,] 26 36
[4,] 28 38
[5,] 30 40
> # multiply them
> A * B
[,1] [,2]
[1,] 21 156
[2,] 44 189
[3,] 69 224
[4,] 96 261
[5,] 125 300
> # see if the elements are equal
> A == B
[,1] [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE
Matrix
multiplication is a commonly used operation in mathematics, requiring the number of columns of the left-hand matrix
to be the same as the number of rows of the right-hand matrix
. Both A
and B
are 5X2 so we will transpose B
so it can be used on the right-hand side.
> A %*% t(B)
[,1] [,2] [,3] [,4] [,5]
[1,] 177 184 191 198 205
[2,] 224 233 242 251 260
[3,] 271 282 293 304 315
[4,] 318 331 344 357 370
[5,] 365 380 395 410 425
Another similarity with data.frame
s is that matrices
can also have row and column names.
> colnames(A)
NULL
> rownames(A)
NULL
> colnames(A) <- c("Left", "Right")
> rownames(A) <- c("1st", "2nd", "3rd", "4th", "5th")
>
> colnames(B)
NULL
> rownames(B)
NULL
> colnames(B) <- c("First", "Second")
> rownames(B) <- c("One", "Two", "Three", "Four", "Five")
>
> colnames(C)
NULL
> rownames(C)
NULL
> colnames(C) <- LETTERS[1:10]
> rownames(C) <- c("Top", "Bottom")
There are two special vector
s, letters
and LETTERS
, that contain the lower-case and upper-case letters, respectively.
Notice the effect when transposing a matrix
and multiplying matrices
. Transposing naturally flips the row and column names. Matrix
multiplication keeps the row names from the left matrix
and the column names from the right matrix
.
> t(A)
1st 2nd 3rd 4th 5th
Left 1 2 3 4 5
Right 6 7 8 9 10
> A %*% C
A B C D E F G H I J
1st 153 167 181 195 209 223 237 251 265 279
2nd 196 214 232 250 268 286 304 322 340 358
3rd 239 261 283 305 327 349 371 393 415 437
4th 282 308 334 360 386 412 438 464 490 516
5th 325 355 385 415 445 475 505 535 565 595
An array
is essentially a multidimensional vector
. It must all be of the same type and individual elements are accessed in a similar fashion using square brackets. The first element is the row index, the second is the column index and the remaining elements are for outer dimensions.
> theArray <- array(1:12, dim = c(2, 3, 2))
> theArray
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
> theArray[1, , ]
[,1] [,2]
[1,] 1 7
[2,] 3 9
[3,] 5 11
> theArray[1, , 1]
[1] 1 3 5
> theArray[, , 1]
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
The main difference between an array
and a matrix
is that matrices
are restricted to two dimensions while array
s can have an arbitrary number.
Data come in many types and structures, which can pose a problem for some analysis environments but R
handles them with aplomb. The most common data structure is the one-dimensional vector
, which forms the basis of everything in R
. The most powerful structure is the data.frame
—something special in R
that most other languages do not have—which handles mixed data types in a spreadsheet-like format. List
s are useful for storing collections of items like a hash in Perl.