A.2. Starting with R

R implements a dialect of a statistical programming language called S. The original implementation of S evolved into a commercial package called S+. So most of R’s language-design decisions can be traced back to S. To avoid confusion, we’ll mostly just say R when describing features. You might wonder what sort of command and programming environment S/R is. It’s a pretty powerful one, with a nice command interpreter that we encourage you to type directly into.

Work clean

In R or RStudio, it is important to “work clean”—that is, to start with an empty workspace and explicitly bring in the packages, code, and data you want. This ensures you know how to get into your ready-to-go state (as you have to perform or write down the steps to get there) and you aren’t held hostage to state you don’t know how to restore (what we call the “no alien artifact” rule).

To work clean in R, you must turn off any sort of autorestore of the workspace. In “base R” this is done by restarting R with the --no-restore command-line flag set. In RStudio, the Session > Restart R menu option serves a similar role, if the “Restore .Rdata into workspace on startup” option is not checked.

Working with R and issuing commands to R is in fact scripting or programming. We assume you have some familiarity with scripting (perhaps using Visual Basic, Bash, Perl, Python, Ruby, and so on) or programming (perhaps using C, C#, C++, Java, Lisp, Scheme, and so on), or are willing to use one of our references to learn. We don’t intend to write long programs in R, but we’ll have to show how to issue R commands. R’s programming, though powerful, is a bit different than many of the popular programming languages, but we feel that with a few pointers, anyone can use R. If you don’t know how to use a command, try using the help() call to get at some documentation.

Throughout this book, we’ll instruct you to run various commands in R. This will almost always mean typing the text or the text following the command prompt > into the RStudio console window, followed by pressing Return. For example, if we tell you to type 1/5, you can type that into the console window, and when you press Enter, you’ll see a result such as [1] 0.2. The [1] portion of the result is just R’s way of labeling result rows (and is to be ignored), and the 0.2 is the floating-point representation of one-fifth, as requested.

Help

Always try calling help() to learn about commands. For example, help('if') will bring up help about R’s if command.

Let’s try a few commands to help you become familiar with R and its basic data types. R commands can be terminated with a line break or a semicolon (or both), but interactive content isn’t executed until you press Return. The following listing shows a few experiments you should run in your copy of R.

Listing A.1. Trying a few R commands
1
## [1] 1
1/2
## [1] 0.5
'Joe'
## [1] "Joe"
"Joe"
## [1] "Joe"
"Joe"=='Joe'
## [1] TRUE
c()
## NULL
is.null(c())
## [1] TRUE
is.null(5)
## [1] FALSE
c(1)
## [1] 1
c(1, 2)
## [1] 1 2
c("Apple", 'Orange')
## [1] "Apple"  "Orange"
length(c(1, 2))
## [1] 2
vec <- c(1, 2)
vec
## [1] 1 2
# is R’s comment character

The # mark is R’s comment character. It indicates that the rest of the line is to be ignored. We use it to include comments, and also to include output along with the results.

A.2.1. Primary features of R

R commands look like a typical procedural programming language. This is deceptive, as the S language (which the language R implements) was actually inspired by functional programming and also has a lot of object-oriented features.

Assignment

R has five common assignment operators: =, <-, ->, <<-, and ->>. Traditionally, in R, <- is the preferred assignment operator, and = is thought of as a late addition and an amateurish alias for it.

The main advantage of the <- notation is that <- always means assignment, whereas = can mean assignment, list slot binding, function argument binding, or case statement, depending on the context. One mistake to avoid is accidentally inserting a space in the assignment operator:

x <- 2
x < - 3
## [1] FALSE
print(x)
## [1] 2

We actually like = assignment better because data scientists tend to work in more than one language at a time and more bugs are caught early with =. But this advice is too heterodox to burden others with (see http://mng.bz/hfug). We try to consistently use <- in this book, but some habits are hard to break.

Multiline commands in R

R is good with multiline commands. To enter a multiline command, just make sure it would be a syntax error to stop parsing where you break a line. For example, to enter 1+2 as two lines, add the line break after the plus sign and not before. To get out of R’s multiline mode, press Escape. A lot of cryptic R errors are caused by either a statement ending earlier than you wanted (a line break that doesn’t force a syntax error on early termination) or not ending where you expect (needing an additional line break or semicolon).

The = operator is primarily used to bind values to function arguments (and <- can’t be so used) as shown in the next listing.

Listing A.2. Binding values to function arguments
divide <- function(numerator,denominator) { numerator/denominator }
divide(1, 2)
## [1] 0.5

divide(2, 1)
## [1] 2

divide(denominator = 2, numerator = 1)
## [1] 0.5

divide(denominator <- 2, numerator <- 1)  # wrong symbol <-
     , yields 2, a wrong answer!
## [1] 2

The -> operator is just a left-to-right assignment that lets you write things like x -> 5. It’s cute, but not game changing.

The <<- and ->> operators are to be avoided unless you actually need their special abilities. They are intended to write values outside of the current execution environment, which is an example of a side effect. Side effects seem great when you need them (often for error tracking and logging), but when overused they make code maintenance, debugging, and documentation much harder. In the following listing, we show a good function that doesn’t have a side effect and a bad function that does have one.

Listing A.3. Demonstrating side effects
x<-1
good <- function() { x <- 5}
good()
print(x)
## [1] 1

bad <- function() { x <<- 5}
bad()
print(x)
## [1] 5
Vectorized operations

Many R operations are called vectorized, which means they work on every element of a vector. These operators are convenient and to be preferred over explicit code like for loops. For example, the vectorized logic operators are ==, &, and |. The next listing shows some examples using these operators on R’s logical types TRUE and FALSE.

Listing A.4. R truth tables for Boolean operators
c(TRUE, TRUE, FALSE, FALSE) == c(TRUE, FALSE, TRUE, FALSE)
## [1]  TRUE FALSE FALSE  TRUE

c(TRUE, TRUE, FALSE, FALSE) & c(TRUE, FALSE, TRUE, FALSE)
## [1]  TRUE FALSE FALSE FALSE

c(TRUE, TRUE, FALSE, FALSE) | c(TRUE, FALSE, TRUE, FALSE)
## [1]  TRUE  TRUE  TRUE FALSE

To test if two vectors are a match, we’d use R’s identical() or all.equal() methods.

When to use && or || in R

&& and || work only on scalars, not vectors. So always use && and || in if() statements, and never use & or | in if() statements. Similarly prefer & and | when working with general data (which may need these vectorized versions).

R also supplies a vectorized sector called ifelse(,,) (the basic R-language if statement isn’t vectorized).

R’s object system

Every item in R is an object and has a type definition called a class. You can ask for the type of any item using the class() command. For example, class(c(1,2)) is numeric. R in fact has two object-oriented systems. The first one is called S3 and is closest to what a C++ or Java programmer would expect. In the S3 class system, you can have multiple commands with the same name. For example, there may be more than one command called print(). Which print() actually gets called when you type print(x) depends on what type x is at runtime. S3 is a unique object system in that methods are global functions, and are not strongly associated with object definitions, prototypes, or interfaces. R also has a second object-oriented system called S4, which supports more detailed classes and allows methods to be picked based on the types of more than just the first argument. Unless you’re planning on becoming a professional R programmer (versus a professional R user or data scientist), we advise not getting into the complexities of R’s object-oriented systems. Mostly you just need to know that most R objects define useful common methods like print(), summary(), and class(). We also advise leaning heavily on the help() command. To get class-specific help, you use a notation method.class; for example, to get information on the predict() method associated with objects of class glm, you would type help(predict.glm).

R’s share-by-value characteristics

In R each reference to a value is isolated: changes to one reference are not seen by other references. This is a useful feature similar to what other languages term “call by value semantics,” or even the immutable data types of some languages.

This means, from the programmer’s point of view, that each variable or each argument of a function behaves as if it were a separate copy of what was passed to the function. Technically, R’s calling semantics are actually a combination of references and what is called lazy copying. But until you start directly manipulating function argument references, you see what looks like call-by-value behavior.

Share-by-value is a great choice for analysis software: it makes for fewer side effects and bugs. But most programming languages aren’t share-by-value, so share-by-value semantics often come as a surprise. For example, many professional programmers rely on changes made to values inside a function being visible outside the function. Here’s an example of call-by-value at work.

Listing A.5. Call-by-value effect
a <- c(1, 2)
b <- a

print(b)

a[[1]] <- 5     1

print(a)

print(b)        2

  • 1 Alters a. This is implemented by building an entirely new vector and reassigning a to refer to this new vector. The old value remains as it was, and any references continue to see the old, unaltered value.
  • 2 Notice that b’s value is not changed.

A.2.2. Primary R data types

While the R language and its features are interesting, it’s the R data types that are most responsible for R’s style of analysis. In this section, we’ll discuss the primary data types and how to work with them.

Vectors

R’s most basic data type is the vector, or array. In R, vectors are arrays of same-typed values. They can be built with the c() notation, which converts a comma-separated list of arguments into a vector (see help(c)). For example, c(1,2) is a vector whose first entry is 1 and second entry is 2. Try typing print(c(1,2)) into R’s command prompt to see what vectors look like and notice that print(class(1)) returns numeric, which is R’s name for numeric vectors.

R is fairly unique in having no scalar types. A single number such as the number 5 is represented in R as a vector with exactly one entry (5).

Numbers in R

Numbers in R are primarily represented in double-precision floating-point. This differs from some programming languages, such as C and Java, that default to integers. This means you don’t have to write 1.0/5.0 to prevent 1/5 from being rounded down to 0, as you would in C or Java. It also means that some fractions aren’t represented perfectly. For example, 1/5 in R is actually (when formatted to 20 digits by sprintf("%.20f", 1 / 5)) 0.20000000000000001110, not the 0.2 it’s usually displayed as. This isn’t unique to R; this is the nature of floating-point numbers. A good example to keep in mind is 1 / 5 != 3 / 5 - 2 / 5, because 1 / 5 - (3 / 5 - 2 / 5) is equal to 5.55e-17.

R doesn’t generally expose any primitive or scalar types to the user. For example, the number 1.1 is actually converted into a numeric vector with a length of 1 whose first entry is 1.1. Note that print(class(1.1)) and print(class(c(1.1, 0))) are identical. Note also that length(1.1) and length(c(1.1)) are also identical. What we call scalars (or single numbers or strings) are in R just vectors with a length of 1. R’s most common types of vectors are these:

  • Numeric— Arrays of double-precision floating-point numbers.
  • Character— Arrays of strings.
  • Factor— Arrays of strings chosen from a fixed set of possibilities (called enums in many other languages).
  • Logical— Arrays of TRUE/FALSE.
  • NULL— The empty vector c() (which always has type NULL). Note that length(NULL) is 0 and is.null(c()) is TRUE.

R uses square-bracket notation (and others) to refer to entries in vectors.[7] Unlike most modern programming languages, R numbers vectors starting from 1 and not 0. Here’s some example code showing the creation of a variable named vec holding a numeric vector. This code also shows that most R data types are mutable, in that we’re allowed to change them:

7

The most commonly used index notation is []. When extracting single values, we prefer the double square-bracket notation [[]] as it gives out-of-bounds warnings in situations where [] doesn’t.

vec <- c(2, 3)
vec[[2]] <- 5
print(vec)
## [1] 2 5
Number sequences

Number sequences are easy to generate with commands like 1:10. Watch out: the : operator doesn’t bind very tightly, so you need to get in the habit of using extra parentheses. For example, 1:5 * 4 + 1 doesn’t mean 1:21. For sequences of constants, try using rep().

Lists

In addition to vectors (created with the c() operator), R has two types of lists. Lists, unlike vectors, can store more than one type of object, so they’re the preferred way to return more than one result from a function. The basic R list is created with the list() operator, as in list(6, 'fred'). Basic lists aren’t really that useful, so we’ll skip over them to named lists. In named lists, each item has a name. An example of a named list would be created with list('a' = 6, 'b' = 'fred'). Usually the quotes on the list names are left out, but the list names are always constant strings (not variables or other types). In R, named lists are essentially the only convenient mapping structure (the other mapping structure being environments, which give you mutable lists). The ways to access items in lists are the $ operator and the [[]] operator (see help('[[') in R’s help system). Here’s a quick example.

Listing A.6. Examples of R indexing operators
x <- list('a' = 6, b = 'fred')
names(x)
## [1] "a" "b"
x$a
## [1] 6
x$b
## [1] "fred"
x[['a']]
## $a
## [1] 6

x[c('a', 'a', 'b', 'b')]
## $a
## [1] 6
##
## $a
## [1] 6
##
## $b
## [1] "fred"
##
## $b
## [1] "fred"
Labels use case-sensitive partial match

The R list label operators (such as $) allow partial matches. For example, list('abe' = 'lincoln')$a returns lincoln, which is fine and dandy until you add a slot actually labeled a to such a list and your older code breaks. In general, it would be better if list('abe'='lincoln')$a was an error, so you'd have a chance of being signaled of a potential problem the first time you made such an error. You could try to disable this behavior with options(warnPartialMatchDollar = TRUE), but even if that worked in all contexts, it’s likely to break any other code that’s quietly depending on such shorthand notation.

As you see in our example, the [] operator is vectorized, which makes lists incredibly useful as translation maps.

Selection: [[]] versus []

[[]] is the strictly correct operator for selecting a single element from a list or vector. At first glance, [] appears to work as a convenient alias for [[]], but this is not strictly correct for single-value (scalar) arguments. [] is actually an operator that can accept vectors as its argument (try list(a='b')[c('a','a')]) and return nontrivial vectors (vectors of length greater than 1, or vectors that don’t look like scalars) or lists. The operator [[]] has different (and better) single-element semantics for both lists and vectors (though, unfortunately, [[]] has different semantics for lists than for vectors).

Really, you should never use [] when [[]] can be used (when you want only a single result). Everybody, including the authors, forgets this and uses [] way more often than is safe. For lists, the main issue is that [[]] usefully unwraps the returned values from the list type (as you’d want: compare class(list(a='b')['a']) to class(list(a='b')[['a']])). For vectors, the issue is that [] fails to signal outof-bounds access (compare c('a','b')[[7]] to c('a','b')[7] or, even worse, c('a','b')[NA]).

Data frames

R’s central data structure is the data frame. A data frame is organized into rows and columns. It is a list of columns of different types. Each row has a value for each column. An R data frame is much like a database table: the column types and names are the schema, and the rows are the data. In R, you can quickly create a data frame using the data.frame() command. For example, d = data.frame(x=c(1,2),y=c('x','y')) is a data frame.

The correct way to read a column out of a data frame is with the [[]] or $ operators, as in d[['x']], d$x or d[[1]]. Columns are also commonly read with the d[, 'x'] or d['x'] notations. Note that not all of these operators return the same type (some return data frames, and some return arrays).

Sets of rows can be accessed from a data frame using the d[rowSet,] notation, where rowSet is a vector of Booleans with one entry per data row. We prefer to use d[rowSet,, drop = FALSE] or subset(d,rowSet), as they’re guaranteed to always return a data frame and not some unexpected type like a vector (which doesn’t support all of the same operations as a data frame).[8] Single rows can be accessed with the d[k,] notation, where k is a row index. Useful functions to call on a data frame include dim(), summary(), and colnames(). Finally, individual cells in the data frame can be addressed using a row-and-column notation, like d[1, 'x'].

8

To see the problem, type class(data.frame(x = c(1, 2))[1, ]), which reports the class as numeric, instead of as data.frame.

From R’s point of view, a data frame is a single table that has one row per example you’re interested in and one column per feature you may want to work with. This is, of course, an idealized view. The data scientist doesn’t expect to be so lucky as to find such a dataset ready for them to work with. In fact, 90% of the data scientist’s job is figuring out how to transform data into this form. We call this task data tubing, and it involves joining data from multiple sources, finding new data sources, and working with business and technical partners. But the data frame is exactly the right abstraction. Think of a table of data as the ideal data scientist API. It represents a nice demarcation between preparatory steps that work to get data into this form and analysis steps that work with data in this form.

Data frames are essentially lists of columns. This makes operations like printing summaries or types of all columns especially easy, but makes applying batch operations to all rows less convenient. R matrices are organized as rows, so converting to/from matrices (and using transpose t()) is one way to perform batch operations on data frame rows. But be careful: converting a data frame to a matrix using something like the model.matrix() command (to change categorical variables into multiple columns of numeric level indicators) doesn’t track how multiple columns may have been derived from a single variable and can potentially confuse algorithms that have per-variable heuristics (like stepwise regression and random forests).

Data frames would be useless if the only way to populate them was to type them in. The two primary ways to populate data frames are R’s read.table() command and database connectors (which we’ll cover in section A.3).

Matrices

In addition to data frames, R supports matrices. Matrices are two-dimensional structures addressed by rows and columns. Matrices differ from data frames in that matrices are lists of rows, and every cell in a matrix has the same type. When indexing matrices, we advise using the drop = FALSE notation; without this, selections that should return single-row matrices instead return vectors. This would seem okay, except that in R, vectors aren’t substitutable for matrices, so downstream code that’s expecting a matrix will mysteriously crash at run time. And the crash may be rare and hard to demonstrate or find, as it only happens if the selection happens to return exactly one row.

NULL and NANA (not available) values

R has two special values: NULL and NA. In R, NULL is just an alias for c(), the empty vector. It carries no type information, so an empty vector of numbers is the same type as an empty vector of strings (a design flaw, but consistent with how most programming languages handle so-called null pointers). NULL can only occur where a vector or list is expected; it can’t represent missing scalar values (like a single number or string).

For missing scalar values, R uses a special symbol, NA, which indicates missing or unavailable data. In R, NA behaves like the not-a-number or NaN seen in most floating-point implementations (except NA can represent any scalar, not just a floating-point number). The value NA represents a nonsignaling error or missing value. Nonsignaling means that you don’t get a printed warning, and your code doesn’t halt (not necessarily a good thing). NA is inconsistent if it reproduces. 2+NA is NA, as we’d hope, but paste(NA,'b') is a valid non-NA string.

Even though class(NA) claims to be logical, NAs can be present in any vector, list, slot, or data frame.

Factors

In addition to a string type called character, R also has a special “set of strings” type similar to what Java programmers would call an enumerated type. This type is called a factor, and a factor is just a string value guaranteed to be chosen from a specified set of values called levels. The advantage of factors is they are exactly the right data type to represent the different values or levels of categorical variables.

The following example shows the string red encoded as a factor (note how it carries around the list of all possible values) and a failing attempt to encode apple into the same set of factors (returning NA, R’s special not-a-value symbol).

Listing A.7. R’s treatment of unexpected factor levels
factor('red', levels = c('red', 'orange'))
## [1] red
## Levels: red orange

factor('apple', levels = c('red', 'orange'))
## [1] <NA>
## Levels: red orange

Factors are useful in statistics, and you’ll want to convert most string values into factors at some point in your data science process. Usually, the later you do this, the better (as you tend to know more about the variation in your data as you work)—so we suggest using the optional argument "StringsAsFactors = FALSE" when reading data or creating new data.frames.

Making sure factor levels are consistent

In this book, we often prepare training and test data separately (simulating the fact that new data will usually be prepared after the original training data). For factors, this introduces two fundamental issues: consistency of numbering of factor levels during training, and application and discovery of new factor level values during application. For the first issue, it’s the responsibility of R code to make sure factor numbering is consistent. The following listing demonstrates that lm() correctly handles factors as strings and is consistent even when a different set of factors is discovered during application (this is something you may want to double-check for non-core libraries). For the second issue, discovering a new factor during application is a modeling issue. The data scientist either needs to ensure this can’t happen or develop a coping strategy (such as falling back to a model not using the variable in question).

Listing A.8. Confirming lm() encodes new strings correctly
d <- data.frame(x=factor(c('a','b','c')),
                   y=c(1,2,3))
m <- lm(y~0+x,data=d)                                        1
 print(predict(m,                                            2
   newdata=data.frame(x='b'))[[1]])
 # [1] 2
print(predict(m,
   newdata=data.frame(x=factor('b',levels=c('b'))))[[1]])    3
 # [1] 2

  • 1 Builds a data frame and linear model mapping a,b,c to 1,2,3
  • 2 Shows that the model gets the correct prediction for b as a string
  • 3 Shows that the model gets the correct prediction for b as a factor, encoded with a different number of levels. This shows that lm() is correctly treating factors as strings.
Slots

In addition to lists, R can store values by name in object slots. Object slots are addressed with the @ operator (see help('@')). To list all the slots on an object, try slotNames(). Slots and objects (in particular the S3 and S4 object systems) are advanced topics we don’t cover in this book. You need to know that R has object systems, as some packages will return them to you, but you shouldn’t be creating your own objects early in your R career.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset