Hour 12. Efficient Data Handling in R

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Hour 12. Efficient Data Handling in R

What You’ll Learn in This Hour:

The dplyr package

Piping commands together

The data.table package

Options for improving efficiency

In Hour 11, “Data Manipulation and Transformation,” we looked at some standard methods for processing data in R. In particular, you saw how to sort and merge data. In previous hours we discussed how to subscript and summarize data using the “apply” family of functions. Now we will look at two packages, dplyr and data.table, that enable us to do all of these tasks for data frames within consistent, highly efficient frameworks.

We will begin the hour by looking at Hadley Wickham’s incredibly popular dplyr package. Although dplyr is actually the more recent of the two packages we’ll discuss in this hour, it fits in with packages such as readr and tidyr from the previous two hours. The data.table package is a standalone package for data manipulation that offers greater efficiency for very large data.

dplyr: A New Way of Handling Data

The dplyr package is another Hadley Wickham package that is revolutionizing the way people work with data in R. The package, which was first released in January 2014, fits into an analysis workflow that Hadley Wickham has helped define. In Hour 10, “Importing and Exporting,” you saw how packages such as readr, haven, and readxl can be used to import data into R. In Hour 11, you saw how the tidyr package can be used to transform data into a new shape. We will now look at how dplyr can be used to sort, subset, merge and summarize data.

The dplyr package can be thought of as an evolution of the popular plyr package, although it focuses solely on the manipulation of rectangular data structures, whereas plyr provides a more general framework. The focus of dplyr is very much on usability; however, there has also been considerable effort to ensure that dplyr is fast and efficient.

Creating a dplyr (tbl_df) Object

The dplyr package is intended to be used in a data analysis workflow in which data is imported using packages such as readr, haven, and readxl and then (possibly) transformed using tidyr. Each of these packages contains functions that produce an object of the tbl_df class. A tbl_df object is a dplyr construct that extends a data frame, affecting the way it prints.

The tbl_df class extension does not affect standard data frame operations; however, each of the data-manipulation functions within dplyr returns a tbl_df object and so it is worth us spending a little time to see what a tbl_df actually looks like. We can create a tbl_df object directly from a data.frame using the tbl_df function. An example of this is shown in Listing 12.1.

LISTING 12.1 Creating tbl_df Objects

Table of Contents for Hour 12. Efficient Data Handling in R

Create new playlist

Sign In

Sign Up

Hour 12. Efficient Data Handling in R

dplyr: A New Way of Handling Data

Creating a dplyr (tbl_df) Object

Sorting

Subscripting

Adding New Columns

Merging

Aggregation

Grouped Data

Other Uses of group_by

The Pipe Operator

Efficient Data Handling with data.table

Creating a data.table

Setting a Key

Subscripting

Adding New Columns and Rows

Adding and Renaming Columns

Adding Rows

Merging

Aggregation

More with data.table

Too Large for data.table

Summary

Q&A

Workshop

Quiz

Answers

Activities

Table of Contents for
Hour 12. Efficient Data Handling in R