Simplifying Data Manipulation with dplyr

Both R and pandas go a step further to make data manipulation a bit more expressive than most programming languages. For example, many iterative tasks that would otherwise require a for loop (such as selecting a column) can be done using a single line of code.

However, there are still aspects of data manipulation that could be expressed a bit more directly. Recall that in previous chapter, a number of processing steps and variables were used to filter the data and find the result. It can be hard to express a large number of data manipulation operations in a way that is descriptive and contained.

Ideally, it should be possible to express each of the steps for processing data in one sequence of code, and in a way that reflects the function of each processing step. A number of packages build on the R programming language and environment in order to make it more expressive, concise, neat, and consistent. One well developed effort to make data processing in R more elegant and intuitive is a collection of packages collectively called the tidyverse.

At the time of writing this, the tidyverse includes five packages, two of which I will be using in this chapter:

tibble is just another version of R's dataframe that has a few improvements. In particular, the printout is a bit cleaner.
dplyr, as the documentation states, is a grammar for data manipulation. It contains a series of functions that allow you to express data manipulation operations easily and intuitively. The syntax for using dplyr takes some getting used to.

The tidyverse also includes a few more packages that may be of use, but I won't cover all of them here. Excellent documentation on all of the tidyverse packages is available at https://www.tidyverse.org.

In this chapter, I will walk through some of the basic functionality of the dplyr package and show how it can be used to manipulate data. This chapter will include the following sections:

Logistical overview
Introducing dplyr
Getting started with dplyr
Chaining operations together
Filtering the rows of a dataframe
Summarizing data by category
Rewriting code using dplyr

Table of Contents for Simplifying Data Manipulation with dplyr

Create new playlist

Sign In

Sign Up

Table of Contents for
Simplifying Data Manipulation with dplyr