Both R and pandas go a step further to make data manipulation a bit more expressive than most programming languages. For example, many iterative tasks that would otherwise require a for loop (such as selecting a column) can be done using a single line of code.
However, there are still aspects of data manipulation that could be expressed a bit more directly. Recall that in previous chapter, a number of processing steps and variables were used to filter the data and find the result. It can be hard to express a large number of data manipulation operations in a way that is descriptive and contained.
Ideally, it should be possible to express each of the steps for processing data in one sequence of code, and in a way that reflects the function of each processing step. A number of packages build on the R programming language and environment in order to make it more expressive, concise, neat, and consistent. One well developed effort to make data processing in R more elegant and intuitive is a collection of packages collectively called the tidyverse.
At the time of writing this, the tidyverse includes five packages, two of which I will be using in this chapter:
- tibble is just another version of R's dataframe that has a few improvements. In particular, the printout is a bit cleaner.
- dplyr, as the documentation states, is a grammar for data manipulation. It contains a series of functions that allow you to express data manipulation operations easily and intuitively. The syntax for using dplyr takes some getting used to.
The tidyverse also includes a few more packages that may be of use, but I won't cover all of them here. Excellent documentation on all of the tidyverse packages is available at https://www.tidyverse.org.
In this chapter, I will walk through some of the basic functionality of the dplyr package and show how it can be used to manipulate data. This chapter will include the following sections:
- Logistical overview
- Introducing dplyr
- Getting started with dplyr
- Chaining operations together
- Filtering the rows of a dataframe
- Summarizing data by category
- Rewriting code using dplyr