Managing data with R

One of the challenges faced when working with massive datasets involves gathering, preparing, and otherwise managing data from a variety of sources. This task is facilitated by R's tools for loading data from many common formats.

Saving and loading R data structures

When you have spent a lot of time getting a particular data frame into the format that you want, you shouldn't need to recreate your work each time you restart your R session. To save a particular data structure to a file that can be reloaded later or transferred to another system, you can use the save() function. The save() function writes R data structures to the location specified by the file parameter. R data files have the file extension .RData.

If we had three objects named x, y, and z, we could save them to a file mydata.RData using the following command:

> save(x, y, z, file = "mydata.RData")

Regardless of whether x, y, and z are vectors, factors, lists, or data frames, they will be saved to the file.

The load() command will recreate any data structures already saved that were to an .RData file. To load the mydata.RData file we saved in the preceding code, simply type:

> load("mydata.RData")

This will recreate the x, y, and z data structures.

Tip

Be careful what you are loading! All data structures stored in the file you are importing with the load() command will be added to your workspace, even if they overwrite something else you are working on.

If you need to wrap up your R session in a hurry, the save.image() command will write your entire session to a file simply called .RData. By default, R will look for this file the next time when you start R, and your session will be recreated just as you had left it.

Importing and saving data from CSV files

It is very common for publically-available data to be stored in text files. Text files can be read on virtually any computer or operating system, making the format nearly universal. They can also be exported and imported from/to programs such as Microsoft Excel, providing a quick and easy way to work with spreadsheet data.

A tabular (as in "table") data file is structured in matrix form, in such a way that each line of text reflects one example, and each example has the same number of features. The feature values on each line are separated by a predefined symbol known as a delimiter. Often, the first line of a tabular data file lists the names of the columns of data. This is called a header line.

Perhaps the most common tabular text file format is the Comma-Separated Values (CSV) file, which as the name suggests, uses the comma as a delimiter. The CSV files can be imported to and exported from many common applications. A CSV file representing the medical dataset constructed previously would look as follows:

subject_name,temperature,flu_status,gender,blood_type
John Doe,98.1,FALSE,MALE,O
Jane Doe,98.6,FALSE,FEMALE,AB 
Steve Graves,101.4,TRUE,MALE,A

To load this CSV file into R, the read.csv() is used as follows:

> pt_data <- read.csv("pt_data.csv", stringsAsFactors = FALSE)

Given a patient data file named pt_data.csv located in the R working directory, this will read the CSV file into a data frame titled pt_data. Just as we had done previously when constructing a data frame, we need to use the stringsAsFactors = FALSE parameter to prevent R from converting all text variables to factors; this step is better left to you, not R, to perform.

If your data reside outside the R working directory, you can specify the path to the CSV file by specifying the full path, for example, /path/to/mydata.csv when calling the read.csv() function.

By default, R assumes that the CSV file includes a header line listing the names of the features in the dataset. If a CSV file does not have a header, specify the option header = FALSE as shown in the following command, and R will assign default feature names in the form V1, V2, and so on:

> mydata <- read.csv("mydata.csv", stringsAsFactors = FALSE,header = FALSE)

The read.csv() function is a special case of the read.table() function, which can read tabular data in many different forms, including other delimited formats such as Tab-Separated Value (TSV). For more detailed information on the read.table() family of functions, refer to the R help page using the command ?read.table.

To save a data frame to a CSV file, use the write.csv() function. If your data frame is named pt_data, simply enter:

> write.csv(pt_data, file = "pt_data.csv")

This will write a CSV file with the name pt_data.csv to the R working folder.

Importing data from SQL databases

If your data is stored in an ODBC (Open Database Connectivity) SQL (Structured Query Language) database such as Oracle, MySQL, PostgreSQL, Microsoft SQL, or SQLite, the RODBC package created by Brian Ripley can be used to import this data directly into an R data frame.

ODBC is a standard protocol for connecting to databases regardless of operating system or DBMS (Database Management System). If you have previously connected to a database via ODBC, you most likely will have referred to it via its DSN (Data Source Name). You will need the DSN, plus a username and password (if your database requires it) for using RODBC.

Tip

The instructions for configuring an ODBC connection are highly specific to the combination of operating system and DBMS. If you are having trouble setting up an ODBC connection, check with your database administrator. Another way to obtain help is the RODBC package vignette, which you can access in R with the command print(vignette("RODBC")).

If you have not already done so, you will need to install and load the RODBC package:

> install.packages("RODBC")
> library(RODBC)

Next, we will open a connection called mydb to the database with the DSN my_dsn:

> mydb <- odbcConnect("my_dsn")

Alternatively, if your ODBC connection requires a username and password, they should be specified when calling the odbcConnect() function:

> mydb <- odbcConnect("my_dsn", uid = "my_username"pwd = "my_password")

Now that we have an open database connection, we can use the sqlQuery() function to create an R data frame from the database rows pulled by SQL queries. This function, like many functions that create data frames, allows us to specify stringsAsFactors = FALSE, which prevents R from converting character data to factors.

The sqlQuery() function uses typical SQL queries as shown in the following command:

> patient_query <- "select * from patient_data where alive = 1"
> patient_data <- sqlQuery(channel = mydb, query = patient_query,stringsAsFactors = FALSE)

The resulting patient_data variable will be a data frame containing all of the rows selected using the SQL query stored in patient_query.

When you are done using the database, the connection can be closed as shown in the following command:

> odbcClose(mydb)

This will close the mydb connection. Although R will automatically close ODBC connections at the end of an R session, it is better practice to do so explicitly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset