Filtering the rows of a dataframe

You can filter rows as a function of the content of a row using the filter() function. (Recall from the previous chapters that you used filtering steps to remove outliers and NA values.)

In the filter() function, each of the arguments following the first is what the documentation refers to as a logical predicateIn other words, each of the arguments are assertions that some logical expression should be true.

The logical expressions used for filtering are defined in terms of the column names of the input dataframe.  Here are some possible examples of logical predicates that could be used as arguments to the filter() function:

  • column.name > 6
  • column.name == "abc"
  • !is.na( column.name )

A good application of the filter() function to the fuel economy dataset could be to find data for just one model. There is likely a lot of variation in the fuel economy data from model to model, so we could get a more consistent result by just focusing on one model.

In the following continuation of dplyr_intro.R, all of rows corresponding to the Toyota Camry model are selected from the original dataset using the filter() function. To keep the printout neat, the only the make, model, and year columns are selected for now: 

## filter data to just the toyota camry
vehicles.camry <- as.tibble(
car_data %>%
filter(
make == "Toyota",
model=="Camry"
) %>%
select(make,model,year)
)
print(vehicles.camry)

Printing the filtered data will reveal a list of entries, where the make and model are Toyota and Camry respectively:

Not surprisingly, there are multiple entries for each make, model, and year; likely due to different sizes, trims, engines, and so on. In order to get a general sense of how the fuel economy changes year to year, one possible approach is to find the average fuel economy across all of the variations of Camry by year. This isn't a very accurate measure, since it does not account for the number of sales of each variation, but it may be enough to see a general trend which is worth exploring further.

In the next section, I will show how to use the group_by() and summarize() functions together in order to create a new dataframe that summarizes the data by group. In particular, the group_by() and summarize() functions used together will allow you to create a new dataframe with the yearly average fuel consumption across all of the Toyota Camry variations. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset