Simplifying Data Manipulation with dplyr

Both R and pandas go a step further to make data manipulation a bit more expressive than most programming languages. For example, many iterative tasks that would otherwise require a for loop (such as selecting a column) can be done using a single line of code.

However, there are still aspects of data manipulation that could be expressed a bit more directly. Recall that in previous chapter, a number of processing steps and variables were used to filter the data and find the result. It can be hard to express a large number of data manipulation operations in a way that is descriptive and contained.

Ideally, it should be possible to express each of the steps for processing data in one sequence of code, and in a way that reflects the function of each processing step. A number of packages build on the R programming language and environment in order to make it more expressive, concise, neat, and consistent. One well developed effort to make data processing in R more elegant and intuitive is a collection of packages collectively called the tidyverse.

At the time of writing this, the tidyverse includes five packages, two of which I will be using in this chapter:

  • tibble is just another version of R's dataframe that has a few improvements. In particular, the printout is a bit cleaner.
  • dplyr, as the documentation states, is a grammar for data manipulation. It contains a series of functions that allow you to express data manipulation operations easily and intuitively. The syntax for using dplyr takes some getting used to.

The tidyverse also includes a few more packages that may be of use, but I won't cover all of them here. Excellent documentation on all of the tidyverse packages is available at https://www.tidyverse.org.

In this chapter, I will walk through some of the basic functionality of the dplyr package and show how it can be used to manipulate data. This chapter will include the following sections:

  • Logistical overview
  • Introducing dplyr
  • Getting started with dplyr
  • Chaining operations together
  • Filtering the rows of a dataframe
  • Summarizing data by category
  • Rewriting code using dplyr
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset