Chapter 23. Filtering

One of the most important factors when cleaning data is deciding whether the data:

  • Can be cleaned up

  • Should be ignored

  • Should be removed

If you decide on the latter option, then you need to filter your data set. This sounds like a very easy decision to make but it shouldn’t be, especially if you are preparing data for others to use. Being certain that you or the end users won’t need this data going forward is difficult. If you are positive the data isn’t needed, don’t remove it except as the last step in the data prep process before publishing, after you’ve considered the following:

  • Does that data give the user context on other data points?

  • Is the data messy but manageable? Just because the data might be hard to tidy up doesn’t mean it couldn’t have value to end users.

  • If the business logic changes—that is, the user has a different business experience—will the data suddenly have meaning?

With those caveats in mind, let’s explore what a filter is and where to use one.

What Is a Filter?

A filter allows you to keep (a Keep Only filter) or remove (an Exclude filter) data from a data set. Once you have decided what you would like to keep or get rid of, you have several types of filters to choose from:

  • Selection

  • Calculation

  • Wildcard

  • Null values

There are also two different forms of filters that can be applied within each type:

  • Data field filters remove columns (fields) of data.

  • Data value filters remove rows of data.

Different Types of Filters

Let’s dig into each type of filter to understand how and when to use it.

Selection

Selection is the most basic form of filtering. The experience is different depending on whether you want to filter out data fields or data values.

You can filter out data fields (columns) in multiple places within Prep Builder, but the metadata grid is ideal. You can easily select and deselect the checkbox options as required. The metadata grid is available on the Input step, as well as Clean steps throughout the flow (Figure 23-1).

The metadata grid in Prep
Figure 23-1. The metadata grid in Prep

Columns can also be removed through the ellipsis menu at the top of a data field (Figure 23-2).

Removing an individual column through the data field menu
Figure 23-2. Removing an individual column through the data field menu

For data value filters, you select, through a range of actions, what to apply the filter to. You can do this in multiple places within Prep Builder, as outlined next.

In the Profile pane

Within the Profile pane, the distribution of instances of a value is indicated by the gray bar, as discussed in Chapter 11. By right-clicking on the bar, you can keep or exclude its value (Figure 23-3). To select multiple values, drag your mouse over the items or hold down the Ctrl or Command key while clicking each item.

Filtering by selecting records in the Profile pane
Figure 23-3. Filtering by selecting records in the Profile pane

In the Data pane

Within the Data pane, you can select any value at the bottom of the screen to filter out all other values (Keep Only) or remove the selected value (Exclude), as shown in Figure 23-4.

Using the Data pane to filter
Figure 23-4. Using the Data pane to filter

From a data field

By clicking on the ellipsis menu when mousing over the data field in the Profile pane, you can choose the filter option Selected Values; this allows you to select the values you want to Keep Only or Exclude (Figure 23-5).

Filtering by Selected Values in the data field menu
Figure 23-5. Filtering by Selected Values in the data field menu

Calculation

After selections, calculations are the most common form of filter. The easiest way to set these filters is through the Filter Values icons located on the gray bar separating the Flow pane and the Profile/Data pane (Figure 23-6).

Filter Values icon in the dynamic options bar
Figure 23-6. Filter Values icon in the dynamic options bar

Selecting Filter Values opens the Add Filter dialog (Figure 23-7). Notice this differs slightly from Add Field dialog that appears when you create calculated fields from scratch.

Add Filter dialog
Figure 23-7. Add Filter dialog

Creating a Boolean calculation will leave the True values in the data set to use in future steps. The False values will be removed from any potential output. You can also trigger this type of filter through the ellipsis menu at the top of a data field in the Profile pane. The only difference with this method is that the selected data field will be automatically added to the Calculation Editor.

Wildcard

Also found in the ellipsis menu at the top of data fields, the Wildcard filter (Figure 23-8) will be familiar to Tableau Desktop users who use this option when filtering by a discrete field.

The Wildcard filter
Figure 23-8. The Wildcard filter

You’ll be prompted to enter a value within the selected field. The wildcard means the value you enter doesn’t have to exactly match the value found in the data field; that is, it can be only a substring of the full value. You have four options to specify where the substring should appear:

Contains
The substring can appear anywhere within the string.
Starts with
The substring has to match the first characters of the value being assessed.
Ends with
The substring has to match the last characters of the value being assessed.
Exact match
The substring must be a perfect match for the entire value in the data field.

If the match is found, the row of data for that value will continue to be used in the data set if the Keep Only filter option is selected; otherwise, that row is removed from the data set. Exact matches will be filtered out if you choose the Exclude filter option.

Null Values

In the Null Values option (again triggered from the ellipsis menu at the top of a data field in the Profile pane), you select whether you want to keep only rows that contain null values or keep those with non-null values (Figure 23-9).

The Null Values filter
Figure 23-9. The Null Values filter

When to Filter Out Columns

Removing columns of data is a data preparation step that you shouldn’t take lightly. Once you’ve made this decision, there is very little (apart from a confusing join from an earlier step) that will bring this data back from oblivion.

That said, being able to remove unnecessary data fields is valuable in the following cases:

  • Removing blank columns. When using Excel or text files, you will often find apparently blank columns, which often have a space lurking in just one of the rows. Remove these and don’t look back!

  • Removing columns that are mostly nulls. Just because certain rows are null doesn’t instantly mean that you should remove the full column. Is there a way to populate those rows of data using the merge functionality? Is there a reason why that column should contain nulls? If so, leave the column in place; otherwise, filter it out.

When to Filter Out Rows

As with columns, there are many reasons for filtering out rows of data. Taking care not to remove data that others might find useful in their analysis is important. Remember, by filtering values out of a data set, you remove all the rest of the information within that row.

Some of the reasons for filtering rows of data out of your data set include:

  • Removing data errors. It’s impossible to correct erroneous data if you don’t know what the value should be. In that case, removing the data record is the only option.

  • Removing rows outside of scope. Your data set might be on the last complete year of transactions, in which case there is no point to having multiple years of data. Filter out the additional years.

  • Improving performance. As mentioned in Chapter 12, reducing the number of rows improves Prep Builder’s response time when forming and running flows.

  • Securing data. You may be in the situation where certain end users shouldn’t have access to certain records. Within Tableau Server, you’d set up User Filters to manage data security, but Prep Builder doesn’t have that feature. Therefore, you would want to use a filter to remove the confidential records and then output a new data set that’s safe for users to access.

Summary

The filter may be the most used cleaning technique in data preparation, so this chapter could have been much longer. This chapter has covered the basics, and Chapter 37 goes into more advanced filtering techniques. It’s crucial to practice filtering and get comfortable with what you are removing from the data set. Deploying the right filtering technique will not only prepare your data for analysis but also future-proof your flows for updating that data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset