Chapter 15

Using Open Source R for Data Science

IN THIS CHAPTER

check Grasping the basics concepts and vocabulary of R

check Exploring objects in R

check Previewing popular R packages

check Playing with more advanced R packages

R is an open-source, free statistical software system that, like Python, has been widely adopted across the data science sector over the past decade. In fact, a somewhat never-ending squabble takes place among data science types about which programming language is best suited for data science. Practitioners who favor R generally do so because of its advanced statistical programming and data visualization capabilities — capabilities that simply can’t be replicated in Python. When it comes to data science practitioners, specifically, R’s user base is broader than Python’s. (For more on Python, see Chapter 14.)

You can download the R programming language and the packages that support it from http://cran.r-project.org.

R is not as easy to learn as Python, but R can be more powerful for certain types of advanced statistical analyses. Although R’s learning curve is somewhat steeper than Python’s, the programming language is nonetheless relatively straightforward. All you really need to do is master the basic vocabulary used to describe the language and then it shouldn’t be too hard to get a grasp on how the software works.

R’s Basic Vocabulary

Although the vocabulary associated with R may sound exotic at first, you can quickly master it through practice. For starters, you can run R in one of these two modes:

  • Non-interactive: You run your R code by executing it as a .r file (the .r file extension is the one that’s assigned to script files created for execution by the R program) directly from the command line.
  • Interactive: You generally work in a software application that interacts with you by prompting you to enter your data and R code. In an R session within interactive mode, you can import datasets or enter the raw data directly; assign names to variables and data objects; and use functions, operators, and built-in iterators to help you gain some insight into your source data.

remember R is an object-oriented language, which simply means that the different parts that comprise the language belong to classes — each class has its own specific definition and role. A specific example of a class is known as an instance of that class, and so it inherits the class’s characteristics. Classes are polymorphic: The subclasses of a class can have their own set of unique behaviors yet share some of the same functionality of the parent class. To illustrate this concept, consider R’s print function: print( ). Because this function is polymorphic, it works slightly differently depending on the class of the object it’s told to print. Thus, this function and many others perform the same general job in many classes but differ slightly according to class. In the section “Observing How Objects Work,” later in this chapter, I elaborate on object-oriented programming and its advantages, but for now I want to introduce objects and their names and definitions.

R works with the following main object types:

  • Vector: A vector is an ordered list of the same mode — character (alphanumeric), numeric, or Boolean. Vectors can have any number of dimensions. For instance, the vector A = [“a”, “cat”, “def”] is a 3-dimensional vector of mode character. B = [2, 3.1, -5, 33] is a 4-dimensional vector of mode numerical. To identify specific elements of these vectors, you could enter the following codes at the prompt in interactive mode to get R to generate the following returns: A[[1]] = “a” or A[[2]] = “cat” or A[[3]] = “def” or B[[1]] = 2 or B[[2]] = 3.1 or B[[3]] = -5 or B[[4]] = 33. R views a single number as a vector of dimension one. Because they can’t be broken down further in R, vectors are also known as atomic vectors (which are not the same as generic vectors that are actually list objects, as I discuss under “Lists”). R’s treatment of atomic vectors gives the language tremendous advantages with respect to speed and efficiency (as I describe in the section “Iterating in R,” later in this chapter).
  • Matrix: Think of a matrix as a collection of vectors. A matrix can be of any mode (numerical, character, or Boolean), but all elements in the matrix must be of the same mode. A matrix is also characterized by its number of dimensions. Unlike a vector, a matrix has only two dimensions: number of rows and number of columns.
  • List: A list is a list of items of arbitrary modes, including other lists or vectors.

    technicalstuff Lists are sometimes also called generic vectors because some of the same operations performed on vectors can be performed on lists as well.

  • Data frame: A data frame is a type of list that’s analogous to a table in a database. Technically speaking, a data frame is a list of vectors, each of which is the same length. A row in a table contains the information for an individual record, but elements in the row most likely will not be of the same mode. All elements in a specific column, however, are all of the same mode. Data frames are structured in this same way — each vector in a data frame corresponds to a column in a data table, and each possible index for these vectors is a row.

There are two ways to access members of vectors, matrices, and lists in R:

  • Single brackets [ ] give a vector, matrix, or list (respectively) of the element(s) that are indexed.
  • Double brackets [[ ]] give a single element.

R users sometimes disagree about the proper use of the brackets for indexing. Generally speaking, the double bracket has several advantages over the single bracket. For example, the double bracket returns an error message if you enter an index that’s out of bounds. If, however, you want to indicate more than one element of a vector, matrix, or list, you should use a single bracket.

Now that you have a grasp of R’s basic vocabulary, you’re probably eager to see how it works with some actual programming. Imagine that you’re using a simple EmployeeRoll dataset and entering the dataset into R by hand. You’d come up with something that looks like Listing 15-1.

LISTING 15-1 Assigning an Object and Concatenating in R

> EmployeeRoll <- data.frame(list(EmployeeName=c("Smith, John","O'Bannon, Tom","Simmons, Sarah"),Grade=c(10,8,12),Salary=c(100000,75000,125000), Union=c(TRUE, FALSE, TRUE)))
> EmployeeRoll
EmployeeName Grade Salary Union
1 Smith,John 10 100000 TRUE
2 O'Bannon, Tom 8 75000 FALSE
3 Simmons, Sarah 12 125000 TRUE

The combined symbol <- in the first line of Listing 15-1 is pronounced “gets.” It assigns the contents on its right to the name on its left. You can think of this relationship in even simpler terms by considering the following statement, which assigns the number 3 to the variable c:

> c <- 3

Line 1 of Listing 15-1 also exhibits the use of R’s concatenate function — c( ) — which is used to create a vector. The concatenate function is being used to form the atomic vectors that comprise the vector list that makes up the EmployeeRoll data frame. Line 2 of Listing 15-1, EmployeeRoll, instructs R to display the object’s contents on the screen. (Figure 15-1 breaks out the data in more diagrammatic form.)

image

FIGURE 15-1: The relationship between atomic vectors, lists, and data-frame objects.

One other object within R is vitally important: the function. Functions use atomic vectors, matrices, lists, and data frames to accomplish whatever analysis or computation you want done. (In the following section, I discuss functions more thoroughly. For now, you should simply understand their general role.) Each analysis you perform in R may be done in one or more sessions, which consists of entering a set of instructions that tells R what you want it to do with the data you’ve entered or imported. In each session, you specify the functions of your script. Then the blocks of code process any input that’s received and return an output. A function’s input (also known as a function’s arguments) can be any R object or combination of objects — vectors, matrices, arrays, data frames, tables, or even other functions.

Invoking a function in R is known as calling a function.

technicalstuff Commenting in R works the same as in Python. (Python is covered in Chapter 14.) As an R coder, you’d insert any comments you may have on the code by prefixing them with a hash symbol — the # symbol, in other words.

Delving into Functions and Operators

You can choose one of two methods when writing your functions: a quick, simple method and a more complex, but ultimately more useful, method. Of course, you achieve the same result from choosing either approach, but each method is advantageous in its own ways. If you want to call a function and generate a result as simply and as quickly as possible, and if you don’t think you’ll want to reuse the function later, use Method 1. If you want to write a function that you can call for different purposes and use with different datasets in the future, then use Method 2 instead.

To illustrate the difference between these two methods, consider again the EmployeeRoll dataset defined in Listing 15-1. Say you want to come up with a function you can use to derive a mean value for employee salary. Using the first, simpler method, you call a single function to handle that task: You simply define an operation by writing the name of the function you want to use, and then include whatever argument(s) the function requires in the set of parentheses following the function name. More specifically, you call the built-in statistical function mean( ) to calculate the mean value of employee salaries, as shown here:

> #Method 1 of Calculating the Mean Salary
> MeanSalary1 <- mean(EmployeeRoll$Salary)
> MeanSalary1
[1] 1e+05

In this method, the mean( ) function calculates and saves the average salary, 100,000 (or 1e+05, in scientific notation) as an object (a vector, of course!) named MeanSalary1.

technicalstuff The $ symbol refers R to a particular field in the dataset. In this example, it’s referring R to the Salary field of the EmployeeRoll dataset.

Method 2 illustrates a more complicated but possibly more useful approach. Rather than define only a single operation, as in Method 1, Method 2’s function can define a series of separate operations if they’re needed; therefore, the method can oftentimes get quite complex. In the following chunk of code, the statement MeanSalary2 <- function(x) creates a function named MeanSalary2, which takes one argument, x. The statements between the curly braces ({ }) make up this function. The job of {return(mean(x))} is to calculate the mean of some entity x and then return that value as a result to the computer screen:

> #Method 2 of Calculating the Mean Salary
> #This method allows the user to create a custom set of instructions for R that can be used again and again.
> MeanSalary2 <- function(x) {return(mean(x))}
>
> MeanSalary2(EmployeeRoll$Salary)
[1] 1e+05

The argument of the function definition isn’t the Salary field from the EmployeeRoll dataset, because this type of function can be called and used for different purposes on different datasets and different fields of said datasets. Also, nothing happens when you finish typing the function and press Return after entering the ending curly brace; in the next line, you just get another prompt (>). That’s because you set up the function correctly. (You know it’s correct because you didn’t get an error message.) You now can call this function when you actually need it — that’s what the last instruction entered at the prompt in the preceding code does. Typing MeanSalary2(EmployeeRoll$Salary) is a function call, and it replaces the function’s placeholder argument x with EmployeeRoll$Salary — a real object that allows the function to generate a solution.

Of course, the function that’s written in Method 2 yields the same mean salary as did the function in Method 1, but the Method 2 function can now be reused for different applications. To illustrate how you’d use this same function on a different dataset, imagine that you have another business with its own payroll. It has five employees with the following salaries: $500,000; $1,000,000; $75,000; $112,000; and $400,000. If you want to call and use the MeanSalary2 function to find the mean salary of these employees, you could simply write the following:

> MeanSalary2(c(500000,1000000,75000,112000,400000))
[1] 417400

As instructed in Method 2, the MeanSalary2 function quickly generates a mean value for this new dataset — in this case, $417,400.

The primary benefit of using functions in R is that they make it easier to write cleaner, more concise code that’s easy to read and more readily reusable. But at the most fundamental level, R is simply using functions to apply operators. Although applying operators and calling functions both serve the same purpose, you can distinguish the two techniques by their differing syntaxes. R uses many of the same operators that are used in other programming languages. Table 15-1 lists the more commonly used operators.

TABLE 15-1 Popular Operators

Operation

Operator

plus

+

minus

times

*

divide

/

modulo

%%

power

^

greater than

>

greater than or equal to

>=

less than

<

less than or equal to

<=

equals

==

not equals

!=

not (logical)

!

and (logical)

&

or (logical)

|

is assigned; gets

<–

is assigned to

–>

remember Operators act as functions in R. (I warned you that learning the vocabulary of R can be tricky!)

This code snippet shows several examples of where operators are used as functions:

> "<"(2,3)
[1] TRUE
> "<"(100,10)
[1] FALSE
> "+"(100,1)
[1] 101
> "/"(4,2)
[1] 2
> "+"(2,5,6,3,10)
Error in `+`(2, 5, 6, 3, 10) : operator needs one or two arguments

In the preceding code, the Boolean operators less than (<) and greater than (>) return a value of either TRUE or FALSE. Also, do you see the error message that’s generated by the last line of code? That error happened because the operator + can take only one or two arguments, and in that example, I provided three arguments more than it could handle.

tip You can use the + operator to add two numbers or two vectors. In fact, all arithmetic operators in R can accept both numbers and vectors as arguments. For more on arithmetic operators, check out the following section.

Iterating in R

Because of the way R handles vectors, programming in R offers you an efficient way to handle loops and iterations. Essentially, R has built-in iterators that automatically loop over elements without the added hassle of you having to write out the loops yourself.

To better conceptualize this process, called vectorization, imagine that you want to add a constant c = 3 to a series of three numbers that you’ve stored as a vector, m = [10, 6, 9]. You can use the following code:

> c <- 3
> m <- c(10, 6, 9)
> m <- m + c
> m
[1] 13 9 12

The preceding method works because of an R property known as recyclability: If you’re performing operations on two vectors that aren’t the same length, R repeats and reuses the smaller vector to make the operation work. In this example, c was a 1-dimensional vector, but R reused it to convert it to a 3-dimensional vector so that the operation could be performed on m.

Here’s the logic behind this process:

10 3   13
6 + 3 = 9
9 3  12

This method works also because of the vectorization of the + operator, which performs the + operation on the vectors m and c — in effect, looping through each of the vectors to add their corresponding elements.

Here’s another way of writing this process that makes the vectorization of the + operator obvious:

> m <- “+”(m,c)

tip R vectorizes all arithmetic operators, including +, -, /, *, and ^.

When you’re using conditional statements within iterative loops, R uses vectorization to make this process more efficient. If you’ve used other programming languages, you’ve probably seen a structure that looks something like this:

for (y = 1 through 5) { if (3*y <= 4) then z = 1 else z = 0}

This loop iterates the code within the brackets ({ }) sequentially for each y equal to 1, 2, 3, 4, and 5. Within this loop, for each y-value, the conditional statement 3*y <= 4 generates either a TRUE or a FALSE statement. For y-values that yield TRUE values, z is set to 1; otherwise, it’s set to 0. This loop thus generates the following:

| y | 3*y | 3*y <= 4 | z |
| 1 | 3 | TRUE | 1 |
| 2 | 6 | FALSE | 0 |
| 3 | 9 | FALSE | 0 |
| 4 | 12 | FALSE | 0 |
| 5 | 15 | FALSE | 0 |

Now check out how you can do this same thing using R:

> y <- 1:5
> z <- ifelse(3*y <= 4, 1, 0)
> z
[1] 1 0 0 0 0

It’s much more compact, right? In the preceding R code, the y term represents the numerical vector [1, 2, 3, 4, 5]. As was the case earlier, in the R code the operator <= is vectorized, and recyclability is again applied so that the apparent scalar 4 is treated as a 5-dimensional vector [4, 4, 4, 4, 4] to make the vector operation work. As before, only where y = 1 is the condition met and, consequently, z[[1]] = 1 and z[2:5] = 0.

tip In R, you often see something that looks like 1:10. This colon operator notates a sequence of numbers — the first number, the last number, and the sequence that lies between them. Thus, the vector 1:10 is equivalent to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and 2:5 is equal to 2, 3, 4, 5.

Observing How Objects Work

R’s object-oriented approach makes deploying and maintaining code relatively quick and easy. As part of this object-oriented functionality, objects in R are distinguished by characteristics known as attributes. Each object is defined by its attributes; more specifically, each object is defined by its class attribute.

As an example, the USDA provides data on the percentages of insect-resistant and herbicide-tolerant corn planted per year, for years ranging from 2000 through 2014. You could take this information and use a linear regression function to predict the percentage of herbicide-tolerant corn planted in Illinois during 2000 to 2014, from the percentage of insect-resistant corn planted in Illinois during these same years. The dataset and function are shown in Listing 15-2.

LISTING 15-2 Exploring Objects in R

> GeneticallyEngineeredCorn <- data.frame(list(year=c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014),Insect =c(13, 12,18,23,26,25,24,19,13, 10, 15, 14, 14, 4, 3), herbicide=c(3,3,3,4,5,6,12,15,15,15,15,17,18,7,5)))
> GeneticallyEngineeredCorn
year Insect herbicide
1 2000 13 3
2 2001 12 3
3 2002 18 3
4 2003 23 4
5 2004 26 5
6 2005 25 6
7 2006 24 12
8 2007 19 15
9 2008 13 15
10 2009 10 15
11 2010 15 15
12 2011 14 17
13 2012 14 18
14 2013 4 7
15 2014 3 5
> PredictHerbicide <- lm(GeneticallyEngineeredCorn$herbicide ~ GeneticallyEngineeredCorn$Insect)
> attributes(PredictHerbicide)$names
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
> attributes(PredictHerbicide)$class
[1] "lm"
> PredictHerbicide$coef
(Intercept) GeneticallyEngineeredCorn$Insect
10.52165581 -0.06362591

In Listing 15-2, the expression PredictHerbicide <- lm(GeneticallyEngineered Corn$herbicide ~ GeneticallyEngineeredCorn$Insect) instructs R to perform a linear regression and assign the results to the PredictHerbicide object. In the linear regression, GeneticallyEngineeredCorn is defined as the source dataset, the Insect column acts as the independent variable, and the herbicide column acts as the dependent variable.

R’s attribute function allows you to get information about an object’s attributes. In this example, typing in the function attribute(PredictHerbicide)$names instructs R to name all attributes of the PredictHerbicide object, and the function attribute(PredictHerbicide)$class instructs R to identify the object’s classes. You can see from Listing 15-2 that the PredictHerbicide object has 12 attributes and has class lm (which stands for linear model).

R allows you to request specifics on each of these attributes; but to keep this example brief, simply ask R to specify the coefficients of the linear regression equation. Looking back, you can see that this is the first attribute that’s provided for the PredictHerbicide object. To ask R to show the coefficients obtained by fitting the linear model to the data, enter PredictHerbicide$coef, as shown in Listing 15-2, and R returns the following information:

(Intercept) GeneticallyEngineeredCorn$Insect
10.52165581 -0.06362591

In plain math, the preceding result translates into the equation shown in Figure 15-2.

image

FIGURE 15-2: Linear regression coefficients from R, translated into a plain math equation.

Translated into mathematical terms, this is equivalent to the following:

Percentage of Genetically Engineered Herbicide-Tolerant Corn = 10.5 – 0.06*Percentage of Genetically Engineered Insect-Resistant Corn

Thus the relationship between the two variables appears rather weak, so the percentage of genetically engineered, insect-resistant corn planted wouldn’t provide a good predictor of percentage of herbicide-resistant corn planted.

This example also illustrates the polymorphic nature of generic functions in R — that is, where the same function can be adapted to the class it’s used with, so that function is applicable to many different classes. The polymorphic function of this example is R’s attributes( ) function. This function is applicable to the lm (linear model) class, the mean class, the histogram class, and many others.

remember If you want to get a quick orientation when working with instances of an unfamiliar class, R’s polymorphic generic functions can come in handy. These functions generally tend to make R a more efficiently mastered programming language.

Sorting Out Popular Statistical Analysis Packages

R has a plethora of easy-to-install packages and functions, many of which are quite useful in data science. In an R context, packages are bundles composed of specific functions, data, and code suited for performing specific types of analyses or sets of analyses. The CRAN site lists the current packages available for download at http://cran.r-project.org/web/packages, along with directions on how to download and install them. In this section, I discuss some popular packages and then delve deeper into the capabilities of a few of the more advanced packages that are available.

The robust R packages can help you do things like forecasting, multivariate analysis, and factor analysis. In this section, I quickly present an overview of a few of the more popular packages that are useful for this type of work.

R’s forecast package contains various forecasting functions that you can adapt to use for ARIMA (AutoRegressive Integrated Moving Average time series forecasting), or for other types of univariate time series forecasts. Or perhaps you want to use R for quality management. You can use R’s Quality Control Charts package (qcc) for quality and statistical process control.

In the practice of data science, you’re likely to benefit from almost any package that specializes in multivariate analysis. If you want to carry out logistic regression, you can use R’s multinomial logit model (mlogit), in which observations of a known class are used to “train” the software so that it can identify classes of other observations whose classes are unknown. (For example, you could use logistic regression to train software so that it can successfully predict customer churn, which you can read about in Chapter 3.)

If you want to use R to take undifferentiated data and identify which of its factors are significant for some specific purpose, you can use factor analysis. To better illustrate the fundamental concept of factor analysis, imagine that you own a restaurant. You want to do everything you can to make sure your customer satisfaction rating is as high as possible, right? Well, factor analysis can help you determine which exact factors have the largest impact on customer satisfaction ratings — those could coalesce into the general factors of ambience, restaurant layout, and employee appearance/attitude/knowledge. With this knowledge, you can work on improving these factors to increase customer satisfaction and, with that, brand loyalty.

remember Few people enter data manually into R. Data is more often imported from either Microsoft Excel or a relational database. You can find driver packages available to import data from various types of relational databases, including RSQLite, RPostgreSQL, RMySQL, and RODBC, as well as packages for many other RDBMSs. One of R’s strengths is how it equips users with the ability to produce publication-quality graphical illustrations or even just data visualizations that can help you understand your data. The ggplot2 package offers a ton of different data visualization options; I tell you more about this package later in this chapter.

For information on additional R packages, look through the R Project website at www.r-project.org. You can find a lot of existing online documentation to help you identify what packages best suit your needs. Also, coders in R’s active community are making new packages and functions available all the time.

Examining Packages for Visualizing, Mapping, and Graphing in R

If you’ve read earlier sections in this chapter, you should have (I hope!) a basic understanding of how functions, objects, and R’s built-in iterators work. You also should be able to think of a few data science tasks that R can help you accomplish. In the remainder of this chapter, I introduce you to some powerful R packages for data visualization, network graph analysis, and spatial point pattern analysis.

Visualizing R statistics with ggplot2

If you’re looking for a fast and efficient way to produce good-looking data visualizations that you can use to derive and communicate insights from your datasets, look no further than R’s ggplot2 package. It was designed to help you create all different types of data graphics in R, including histograms, scatter plots, bar charts, boxplots, and density plots. It offers a wide variety of design options as well, including choices in colors, layout, transparency, and line density. ggplot2 is useful if you want to do data showcasing, but it’s probably not the best option if you’re looking to do data storytelling or data art. (You can read about these data visualization design options in Chapter 9.)

To better understand how the ggplot2 package works, consider the following example. Figure 15-3 shows a simple scatter plot that was generated using ggplot2. This scatter plot depicts the concentrations (in parts per million, or ppm) of four types of pesticides that were detected in a stream between the years 2000 to 2013. The scatter plot could have been designed to show only the pesticide concentrations for each year, but ggplot2 provides an option for fitting a regression line to each of the pesticide types. The regression lines are the solid lines shown on the plot. ggplot2 can also present these pesticide types in different colors. The colored areas enclosing the regression lines represent 95 percent confidence intervals for the regression models.

image

FIGURE 15-3: A scatter plot, generated in the ggplot2 package.

The scatter plot chart makes it clear that all pesticides except for ryanoids are showing decreasing stream concentrations. Organochlorides had the highest concentration in 2000, but then exhibited the greatest decrease in concentration over the 13-year period.

Analyzing networks with statnet and igraph

Social networks and social network data volumes have absolutely exploded over the past decade. Therefore, knowing how to make sense of network data has become increasingly important for analysts. Social network analysis skills enable you to analyze social networks to uncover how accounts are connected and the ways in which information is shared across those connections. You can use network analysis methods to determine how fast information spreads across the Internet. You can even use network analysis methods in genetic mapping to better understand how one gene affects and influences the activity of other genes, or use them in hydraulic modeling to figure out how to best design a water-distribution or sewer-collection system.

Two R packages were explicitly written for network analysis purposes: statnet and igraph. You can use either statnet or igraph to collect network statistics or statistics on network components. Figure 15-4 shows sample output from network analysis in R, generated using the statnet package. This output is just a simple network in which the direction of the arrows shows the direction of flow within the network, from one vertex to another. The network has five vertices and nine faces — connections between the vertices.

image

FIGURE 15-4: A network diagram that was generated using the statnet package.

Mapping and analyzing spatial point patterns with spatstat

If you want to analyze spatial data in R, you can use the spatstat package. This package is most commonly used in analyzing point pattern data, but you can also use it to analyze line patterns, pixels, and linear network data. By default, the package installs with geographical, ecological, and environmental datasets that you can use to support your analyses, if appropriate. With its space-time point pattern analysis capabilities, spatstat can help you visualize a spatiotemporal change in one or several variables over time. The package even comes with 3-dimensional graphing capabilities. Because spatstat is a geographic data analysis package, it’s commonly used in ecology, geosciences, and botany, or for environmental studies, although the package could easily be used for location-based studies that relate to business, logistics, sales, marketing, and more.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset