Now that we have seen some of R
’s basic functionality, it is time to load in data. As with everything in R
, there are numerous ways to get data; the most common is probably reading comma separated values (CSV) files. Of course there are many other options that we will cover as well.
The best way to read data from a CSV file is to use read.table
. It might be tempting to use read.csv
but that is more trouble than it is worth, and all it does is call read.table
with some arguments preset. The result of using read.table
is a data.frame
.
The first argument to read.table
is the full path of the file to be loaded. The file can be sitting on disk or even on the Web. For the purposes of this book we will read from the Web.
Any CSV will work but we have posted an incredibly simple CSV at http://www.jaredlander.com/data/Tomato%20First.csv
. Let’s read that into R
using read.table
.
> theUrl <- "http://www.jaredlander.com/data/Tomato%20First.csv"
> tomato <- read.table (file = theUrl, header = TRUE, sep = ",")
This can now be seen using head
.
> head(tomato)
Round Tomato Price Source Sweet Acid Color Texture
1 1 Simpson SM 3.99 Whole Foods 2.8 2.8 3.7 3.4
2 1 Tuttorosso (blue) 2.99 Pioneer 3.3 2.8 3.4 3.0
3 1 Tuttorosso (green) 0.99 Pioneer 2.8 2.6 3.3 2.8
4 1 La Fede SM DOP 3.99 Shop Rite 2.6 2.8 3.0 2.3
5 2 Cento SM DOP 5.49 D Agostino 3.3 3.1 2.9 2.8
6 2 Cento Organic 4.99 D Agostino 3.2 2.9 2.9 3.1
Overall Avg.of.Totals Total.of.Avg
1 3.4 16.1 16.1
2 2.9 15.3 15.3
3 2.9 14.3 14.3
4 2.8 13.4 13.4
5 3.1 14.4 15.2
6 2.9 15.5 15.1
As mentioned before, the first argument is the file name in quotes (or as a character
variable). Notice how we explicitly used the argument names file
, header
and sep
. As discussed in Section 4.5, function arguments can be specified without the name of the argument (positionally indicated) but specifying the arguments is good practice. The second argument, header
, indicates that the first row of data holds the column names. The third argument gives the delimiter separating data cells. Changing this to other values such as “ ” (tab delimited) or “;” (semicolon delimited) allows it to read other types of files.
One often unknown argument that is helpful to use is stringsAsFactors
. Setting this to FALSE
(the default is TRUE
) prevents character
columns from being converted to factor
columns. This both saves computation time—this can be dramatic if it is a large dataset with many character
columns with many unique values—and keeps the columns as character data
, which are easier to work with.
Although we do not mention this argument in Section 5.1, stringsAsFactors
can be used in data.frame
. Re-creating that first bit of code results in an easier-to-use “Sport” column.
> x <- 10:1
> y <- -4:5
> q <- c("Hockey", "Football", "Baseball", "Curling", "Rugby",
+ "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")
> theDF <- data.frame(First=x, Second=y, Sport=q, stringsAsFactors=FALSE)
> theDF$Sport
[1] "Hockey" "Football" "Baseball" "Curling" "Rugby"
[6] "Lacrosse" "Basketball" "Tennis" "Cricket" "Soccer"
There are numerous other arguments to read.table
, the most useful being quote
and colClasses
, specifying the character used for enclosing cells and the data type for each column, respectively.
Sometimes CSVs (or tab delimited files) are poorly built, where the cell separator has been used inside a cell. In this case read.csv2
(or read.delim2
) should be used instead of read.table
.
While Excel may be the world’s most popular data analysis tool, it is unfortunately difficult to read Excel data into R
. The simplest method would be to use Excel (or another spreadsheet program) to convert the Excel file to a CSV file. That might sound like a cop-out but it is the easiest method to use. The R
community abounds with hacks to get data from Excel into R
, such as using the Clipboard to copy and paste, but those are inelegant at best and can fail with large amounts of data.
A number of packages exist to tackle this problem such as gdata
, XLConnect
, xlsReadWrite
, and others but they all have some erroneous requirement such as Java, Perl or 32-bit R
, which is neither preferable nor so common anymore. The RODBC
package has a function, odbcConnectExcel2007
, that reads Excel files but requires a DSN1 connection, which is not a feasible everyday strategy.
1. A DSN is a data source connection used to describe communication to a data source, often a database.
We understand that Excel 2007 files are essentially XML files. This would mean that they could theoretically be parsed using the XML
package, but we have not seen this done as of yet.
Databases arguably store the vast majority of the world’s data. Most of these, whether they be Microsoft SQL Server, DB2, MySQL or Microsoft Access, provide an ODBC connection. Accordingly, R
makes use of ODBC through the aptly named RODBC
package (which comes with base R
). Like any other package, it must be loaded before use.
> require(RODBC)
The first step to reading from a database is to create a DSN. This differs by operating system but should result in a string name for that connection. This is used in odbcConnect
to create a connection for R
to use. Optional, but common, arguments are uid
and pwd
for the database username and password, respectively.
> db <- odbcConnect("QV Training")
At this point we are ready to run a query on that database using sqlQuery
. This can be any valid SQL query of arbitrary complexity. sqlQuery
returns a data.frame
just like any other. Fortunately, sqlQuery
has the stringsAsFactors
argument first seen in Section 6.1. Again, setting this to TRUE
is usually a good idea, as it will save processing time.
> # simple SELECT * query from one table
> ordersTable <- sqlQuery(db, "SELECT * FROM Orders",
stringsAsFactors=FALSE)
> # simple SELECT * query from one table
> detailsTable <- sqlQuery(db, "SELECT * FROM [Order Details]",
stringsAsFactors=FALSE)
> # do a join between the two tables
> longQuery <- "SELECT * FROM Orders, [Order Details]
WHERE Orders.OrderID = [Order Details].OrderID"
> detailsJoin <- sqlQuery(db, longQuery, stringsAsFactors=FALSE)
We can easily check the results of these queries by viewing the resulting data.frame
s.
> head(ordersTable)
OrderID OrderDate CustomerID EmployeeID ShipperID Freight
1 10248 2008-06-29 4 2 2 43.48
2 10249 2007-06-29 79 7 2 29.20
3 10250 2008-07-03 34 2 2 79.17
4 10251 2007-12-02 1 7 2 43.41
5 10252 2008-04-04 76 5 1 23.20
6 10253 2008-07-05 34 3 2 66.54
> head(detailsTable)
OrderID LineNo ProductID Quantity UnitPrice Discount
1 10402 2 63 65 18.94 0.00
2 10403 1 48 70 31.83 0.15
3 10403 2 16 21 10.15 0.15
4 10404 1 42 40 13.37 0.05
5 10404 2 49 30 19.82 0.05
6 10404 3 26 30 33.93 0.05
> head(detailsJoin)
OrderID OrderDate CustomerID EmployeeID ShipperID Freight
1 10402 2006-04-28 20 4 1 46.63
2 10403 2006-09-28 20 4 1 26.43
3 10403 2006-09-28 20 4 1 26.43
4 10404 2006-04-19 49 6 1 72.73
5 10404 2006-04-19 49 6 1 72.73
6 10404 2006-04-19 49 6 1 72.73
OrderID.1 LineNo ProductID Quantity UnitPrice Discount
1 10402 2 63 65 18.94 0.00
2 10403 1 48 70 31.83 0.15
3 10403 2 16 21 10.15 0.15
4 10404 1 42 40 13.37 0.05
5 10404 2 49 30 19.82 0.05
6 10404 3 26 30 33.93 0.05
While it is not necessary, it is good practice to close the ODBC connection using odbcClose
, although it will close automatically when either R
closes or we open another connection using odbcConnect
. Only one connection may be open at a time.
In an ideal world another tool besides R
would never be needed, but in reality data are sometimes locked in a proprietary format such as those from SAS, SPSS or Octave. The foreign
package provides a number of functions similar to read.table
to read in data from other tools.
A partial list of functions to read data from commonly used statistical tools is in Table 6.1. The arguments for these functions are generally similar to read.table
. These functions usually return the data as a data.frame
but do not always succeed.
While read.ssd
can read SAS data, it requires a valid SAS license. This can be sidestepped by using Revolution R
from Revolution Analytics with their special RxSasData
function in their RevoScaleR
package.
When working with other R
programmers, a good way to pass around data—or any R
objects like variables and functions—is to use RData
files. These are binary files that represent R
objects of any kind. They can store a single object or multiple objects and can be passed among Windows, Mac and Linux without a problem.
First, let’s create an RData
file, remove the object that created it, and then read it back into R
.
> # save the tomato data.frame to disk
> save(tomato, file = "data/tomato.rdata")
> # remove tomato from memory
> rm(tomato)
> # check if it still exists
> head(tomato)
Error: object 'tomato' not found
> # read it from the rdata file
> load("data/tomato.rdata")
> # check if it exists now
> head(tomato)
Round Tomato Price Source Sweet Acid Color Texture
1 1 Simpson SM 3.99 Whole Foods 2.8 2.8 3.7 3.4
2 1 Tuttorosso (blue) 2.99 Pioneer 3.3 2.8 3.4 3.0
3 1 Tuttorosso (green) 0.99 Pioneer 2.8 2.6 3.3 2.8
4 1 La Fede SM DOP 3.99 Shop Rite 2.6 2.8 3.0 2.3
5 2 Cento SM DOP 5.49 D Agostino 3.3 3.1 2.9 2.8
6 2 Cento Organic 4.99 D Agostino 3.2 2.9 2.9 3.1
Overall Avg.of.Totals Total.of.Avg
1 3.4 16.1 16.1
2 2.9 15.3 15.3
3 2.9 14.3 14.3
4 2.8 13.4 13.4
5 3.1 14.4 15.2
6 2.9 15.5 15.1
Now let’s create a few objects to store in a single RData
file, remove them and then load them again.
> # create some objects
> n <- 20
> r <- 1:10
> w <- data.frame(n, r)
> # check them out
> n
[1] 20
> r
[1] 1 2 3 4 5 6 7 8 9 10
> w
n r
1 20 1
2 20 2
3 20 3
4 20 4
5 20 5
6 20 6
7 20 7
8 20 8
9 20 9
10 20 10
> # save them
> save(n, r, w, file = "data/multiple.rdata")
> # delete them
> rm(n, r, w)
> # are they gone?
> n
Error: object 'n' not found
> r
Error: object 'r' not found
> w
Error: object 'w' not found
> # load them back
> load("data/multiple.rdata")
> # check them out again
> n
[1] 20
> r
[1] 1 2 3 4 5 6 7 8 9 10
> w
n r
1 20 1
2 20 2
3 20 3
4 20 4
5 20 5
6 20 6
7 20 7
8 20 8
9 20 9
10 20 10
R
and some packages come with data included, so we can easily have data to use. Accessing these data is simple as long as we know what to look for. ggplot2
, for instance, comes with a dataset about diamonds. It can be loaded using the data
function.
> require(ggplot2)
> data(diamonds)
> head(diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
To find a list of available data, simply type data()
into the console.
These days a lot of data are displayed on Web pages. If we are lucky, it is stored neatly in an HTML
table. If we are not so lucky, we might need to parse the text of the page.
If the data are stored neatly in an HTML
table we can use readHTMLTable
in the XML
package to easily extract it. On my site there is a post about a Super Bowl pool I was asked to analyze at http://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool
. In that post there is a table with three columns that we wish to extract. It is fairly simple to do with the following code.
> require(XML)
> theURL <- "http://www.jaredlander.com/2012/02/another-kind-of-
+ super-bowl-pool/"
> bowlPool <- readHTMLTable(theURL, which = 1, header = FALSE,
+ stringsAsFactors = FALSE)
> bowlPool
V1 V2 V3
1 Participant 1 Giant A Patriot Q
2 Participant 2 Giant B Patriot R
3 Participant 3 Giant C Patriot S
4 Participant 4 Giant D Patriot T
5 Participant 5 Giant E Patriot U
6 Participant 6 Giant F Patriot V
7 Participant 7 Giant G Patriot W
8 Participant 8 Giant H Patriot X
9 Participant 9 Giant I Patriot Y
10 Participant 10 Giant J Patriot Z
Here the first argument was the URL
but it could have also been a file on disk. The which
argument allows us to choose which table to read if there are multiple tables. For this example, there was only one table but it could have easily been the second or third or fourth. We set header
to FALSE
to indicate that no header was in the table. Last, we used stringsAsFactors=FALSE
so that the character
columns would not be converted to factor
s.
If the data are not so neatly stored, it is possible to scrape them off the page, although this is a very involved process. It requires good pattern matching and regular expressions, which are covered in Section 13.14. The idea is to figure out what common pattern surrounds different pieces of data, and this requires at least a basic knowledge of HTML
.
Reading data is the first step to any analysis; without the data there is nothing to do. The most common way to read data into R
is from a CSV using read.table
. RODBC
provides an excellent method for reading from any database with a DSN. Reading from data trapped in HTML
tables is made easy using the XML
package. R
also has a special binary file format, RData
, for the quick storage, loading and transfer of R
objects.