Factors

If you recall from Chapter 1, Introducing Machine Learning, features that represent a characteristic with categories of values are known as nominal. Although it is possible to use a character vector to store nominal data, R provides a data structure known as a factor specifically for this purpose. A factor is a special case of vector that is solely used for representing nominal variables. In the medical dataset we are building, we might use a factor to represent gender, because it uses two categories: MALE and FEMALE.

Why not use character vectors? An advantage of using factors is that they are generally more efficient than character vectors because the category labels are stored only once. Rather than storing MALE, MALE, FEMALE, the computer may store 1, 1, 2. This can save memory. Additionally, certain machine learning algorithms use special routines to handle categorical variables. Coding categorical variables as factors ensures that the model will treat this data appropriately.

To create a factor from a character vector, simply apply the factor() function. For example:

> gender <- factor(c("MALE", "FEMALE", "MALE"))
> gender
[1] MALE   FEMALE MALE
Levels: FEMALE MALE

Notice that when the gender data was displayed, R printed additional information indicating the levels of the gender factor. The levels comprise the set of possible categories the data could take, in this case MALE or FEMALE.

When factors are created, we can add additional levels that may not appear in the data. Suppose we added another factor for blood type as shown in the following example :

> blood <- factor(c("O", "AB", "A"),
                  levels = c("A", "B", "AB", "O"))
> blood
[1] O  AB A 
Levels: A B AB O

Notice that when we defined the blood factor for the three patients, we specified an additional vector of four possible blood types using the levels = statement. As a result, even though our data include only types O, AB, and A, all four types are stored with the blood factor as indicated by the output Levels: A B AB O. Storing the additional level allows for the possibility of adding data with the other blood type in the future. It also ensures that if we were to create a table of blood types, we would know that type B exists, despite it not being recorded in our data.

Lists

Another special type of vector, a list, is used for storing an ordered set of values. However, unlike a vector that requires all elements to be the same type, a list allows different types of values to be collected. Due to this flexibility, lists are often used to store various types of input and output data and sets of configuration parameters for machine learning models.

To illustrate lists, consider the medical patient dataset we have been constructing, with data for three patients stored in five vectors. If we wanted to display all the data on John Doe (subject 1), we would need to enter five R commands:

> subject_name[1]
[1] "John Doe"
> temperature[1]
[1] 98.1
> flu_status[1]
[1] FALSE
> gender[1]
[1] MALE
Levels: FEMALE MALE
> blood[1]
[1] O
Levels: A B AB O

This seems like a lot of work to display one patient's medical data. The list structure allows us to group all of a patient's data into one object we can use repeatedly.

Similar to creating a vector with c(), a list is created using the list() function as shown in the following example. One notable difference is that when a list is constructed, you have the option of providing names (fullname in the following example), for each value in the sequence of items. The names are not required, but allow the list's values to be accessed later on by name, rather than by the numbered position as with vectors:

> subject1 <- list(fullname = subject_name[1], 
                   temperature = temperature[1],
                   flu_status = flu_status[1],
                   gender = gender[1],
                   blood = blood[1])

Printing a patient's data is now a matter of typing a single command:

> subject1
$fullname
[1] "John Doe"

$temperature
[1] 98.1

$flu_status
[1] FALSE

$gender
[1] MALE
Levels: FEMALE MALE

$blood
[1] O
Levels: A B AB O

Note that the values are labeled with the names we specified in the preceding command. Although a list can be accessed using the same methods as a vector, the names give additional clarity for accessing the values, rather than needing to remember the position of the temperature value, like this:

> subject1[2]
$temperature
[1] 98.1

It is often easier to access temperature directly, by appending a $ and the value's name to the name of the list:

> subject1$temperature
[1] 98.1

Accessing the value by name also ensures that if you add or remove values from the list, you will not accidentally retrieve the wrong list item when the ordering changes.

It is possible to obtain several items in a list by specifying a vector of names:

> subject1[c("temperature", "flu_status")]
$temperature
[1] 98.1

$flu_status
[1] FALSE

Although entire datasets could be constructed using lists (or lists of lists), constructing a dataset is common enough that R provides a specialized data structure specifically for this task.

Data frames

By far the most important R data structure utilized in machine learning is the data frame, a structure analogous to a spreadsheet or database since it has both rows and columns of data. In R terms, a data frame can be understood as a list of vectors or factors, each having exactly the same number of values. Because the data frame is literally a list of vectors, it combines aspects of both vectors and lists.

Let's create a data frame for our patient dataset. Using the patient data vectors we created previously, the data.frame() function combines them into a data frame:

> pt_data <- data.frame(subject_name, temperature, flu_status,gender, blood, stringsAsFactors = FALSE)

You might notice something new in the preceding code; we included an additional parameter: stringsAsFactors = FALSE. If we do not specify this option, R will automatically convert every character vector to a factor; this a feature which is occasionally useful, but is also sometimes excessive. Here, for example, the subject_name field is definitely not categorical data; names are not categories of values. Therefore, setting the stringsAsFactors option to FALSE allows us to convert to factors only where it makes sense for the project.

When we display the pt_data data frame, we see that the structure is quite different from the data structures we worked with previously:

> pt_data
  subject_name temperature flu_status gender blood
1     John Doe        98.1 FALSE      MALE     O
2     Jane Doe        98.6 FALSE      FEMALE   AB
3 Steve Graves       101.4  TRUE      MALE     A

Compared to the one-dimensional vectors, factors, and lists, a data frame has two dimensions and it is therefore displayed in matrix format. The data frame has one column for each vector of patient data and one row for each patient. In machine learning terms, the columns are the features or attributes and the rows are the examples.

To extract entire columns (vectors) of data, we can take advantage of the fact that a data frame is simply a list of vectors. Similar to lists, the most direct way to extract a single element, in this case a vector or column of data, is by referring to it by name. For example, to obtain the subject_name vector, type:

> pt_data$subject_name
[1] "John Doe"     "Jane Doe"     "Steve Graves"

Also similar to lists, a vector of names can be used to extract several columns from a data frame:

> pt_data[c("temperature", "flu_status")]
  temperature flu_status
1        98.1 FALSE
2        98.6 FALSE
3       101.4  TRUE

When we access the data frame in this way, the result is a data frame containing all rows of data for the requested columns. You could also enter pt_data[2:3] to extract the temperature and flu_status columns, but listing the columns by name results in clear and easy-to-maintain R code.

To extract values in the data frame, we can use methods like those we learned for accessing values in vectors, with an important exception; because the data frame is two-dimensional, you will need to specify the position of both the rows and columns you would like to extract. Rows are specified first, followed by a comma, followed by the columns in a format like this: [rows, columns], starting from the number 1.

For instance, to extract the value in the first row and second column of the patient data frame (the temperature value for John Doe), you would enter:

> pt_data[1, 2]
[1] 98.1

If you would like more than one row or column of data, this can be done by specifying vectors for the row and column numbers you would like. The following statement will pull data from rows 1 and 3, and columns 2 and 4:

> pt_data[c(1, 3), c(2, 4)]
  temperature gender
1        98.1   MALE
3       101.4   MALE

To extract all of the rows or columns, rather than listing every one, simply leave the row or column portion blank. For example, to extract all rows of the first column:

> pt_data[, 1]
[1] "John Doe"     "Jane Doe"     "Steve Graves"

To extract all columns for the first row:

> pt_data[1, ]
  subject_name temperature flu_status gender blood
1     John Doe        98.1      FALSE   MALE     O

And to extract everything:

> pt_data[ , ]
  subject_name temperature flu_status gender blood
1     John Doe        98.1      FALSE   MALE     O
2     Jane Doe        98.6      FALSE FEMALE    AB
3 Steve Graves       101.4       TRUE   MALE     A

The methods we have learned for accessing values in lists and vectors can also be used for retrieving data frame rows and columns. For example, columns can be accessed by name rather than position, and negative signs can be used to exclude rows or columns of data. Therefore, the statement:

> pt_data[c(1, 3), c("temperature", "gender")]

Is equivalent to:

> pt_data[-2, c(-1, -3, -5)]

To become familiar working with data frames, try practicing these operations with the patient data, or better yet, use your own dataset. These types of operations are crucial to much of the work we will do in later chapters.

Matrixes and arrays

In addition to data frames, R provides other structures that store values in tabular form. A matrix is a data structure that represents a two-dimensional table, with rows and columns of data. R matrixes can contain any single type of data, although they are most often used for mathematical operations and therefore typically store only numeric data.

To create a matrix, simply supply a vector of data to the matrix() function, along with a parameter specifying the number of rows (nrow) or number of columns (ncol). For example, to create a 2x2 matrix storing the first four letters of the alphabet, we can use the nrow parameter to request the data to be divided into two rows:

> m <- matrix(c('a', 'b', 'c', 'd'), nrow = 2)
> m
     [,1] [,2]
[1,] "a"  "c" 
[2,] "b"  "d"

This is equivalent to the matrix produced using ncol = 2:

> m <- matrix(c('a', 'b', 'c', 'd'), ncol = 2)
> m
     [,1] [,2]
[1,] "a"  "c" 
[2,] "b"  "d"

You will notice that R loaded the first column of the matrix first, then loaded the second column. This is called column-major order. To illustrate this further, let's see what happens if we add a few more values to the matrix.

With six values, requesting two rows creates a matrix with three columns:

> m <- matrix(c('a', 'b', 'c', 'd', 'e', 'f'), nrow = 2)
> m
     [,1] [,2] [,3]
[1,] "a"  "c"  "e" 
[2,] "b"  "d"  "f"

Similarly, requesting two columns creates a matrix with three rows:

> m <- matrix(c('a', 'b', 'c', 'd', 'e', 'f'), ncol = 2)
> m
     [,1] [,2]
[1,] "a"  "d" 
[2,] "b"  "e" 
[3,] "c"  "f"

As with data frames, values in matrixes can be extracted using [row, column] notation. For instance, m[1, 1] will return the value a and m[3, 2] will extract f from the m matrix. Similarly, entire rows or columns can be requested:

> m[1, ]
[1] "a" "d"
> m[, 1]
[1] "a" "b" "c"

Closely related to the matrix structure is the array, which is a multi-dimensional table of data. Where a matrix has rows and columns of values, an array has rows, columns, and any number of additional layers of values. Although we will occasionally use matrixes in later chapters, the use of arrays is outside the scope of this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset