Chapter 37. Using Smarter Filtering

The basics of filtering were covered in Chapter 23, but filtering doesn’t have to mean just manual clicking or setting date ranges to specify what data should be included in your output. Calculations, joins, and measuring percentage variance can all make filters smarter and more automated. Removing manual work during the data preparation stage reduces the risk of mistakes and makes it easier to share the data set with others once it is published.

Calculations

Any calculation can be used as a way to trigger filters. However, there are certain types of calculations that are more common than others for use as filters.

Boolean Calculations

Testing whether or not a value in a data field meets a condition—for example, if a value exceeds a logical limit—is a great way to clean data and remove the records that you do not want. In the example shown in Figure 37-1, we know that Percent Complete cannot exceed 100%, so we use a Boolean filter calculation in Prep to ensure the value is 1 or less by removing any values that exceed 1.

Limiting Percent Complete to 100%
Figure 37-1. Limiting Percent Complete to 100%

Remember, percentages are often set as a value between 0 and 1 rather than 0 and 100, as this helps formatting in Tableau Desktop.

Logical Calculations

IF and CASE calculations can contain quite complex logic that makes filtering easier by matching certain scenarios across multiple data fields. In the sports data set in Figure 37-2, we want to return only games where one team went from winning at half-time to losing the game, or vice versa. Here are the fields the calculation is using:

[HTf]
Half-time points for the team being assessed
[HTa]
Half-time points for the opposition
[Result]
Final result in the game
An IF statement in Prep
Figure 37-2. An IF statement in Prep

This calculation returns both game scenarios in the same result, in this case 'Upset', so we can filter 'Expected' results out of the data set.

IF statements work through the conditions from first to last. As soon as one condition or set of conditions is met, the calculation returns the value that is set. The conditions set can be very complicated and detailed, so IF statements can be very large, complex calculations, but sometimes you need that complexity to determine whether or not the data should be filtered out.

Regex Calculations

For more complex filtering calculations, regular expressions (regexes) can be a massive help. As covered in Chapter 31, regexes allow you to set a pattern to test. For example, if a letter is present in an expected numerical field, then it can be handled separately with a regex (Figure 37-3).

Regex function being used in Prep
Figure 37-3. Regex function being used in Prep

This calculation tests whether a letter is present in the For column, which should contain only points scored, and will produce a True/False response (Figure 37-4).

Result of the calculation in #regex_function_being_used_in_prep
Figure 37-4. Result of the calculation in Figure 37-3

A filter can then be configured to return just those values in the column that don’t contain any letters (Figure 37-5).

Filter to remove values
Figure 37-5. Filter to remove values

After converting the data field For to “Number (whole),” we get the data set we needed, containing only the numerical point values (Figure 37-6).

Results of For data field conversion after regex calculation and filter
Figure 37-6. Results of For data field conversion after regex calculation and filter

Join Ranges

Using join conditions that intentionally exclude data prevents you from loading data that you would need to filter out of the data set later. One technique for doing this, a join range, allows you to join two data sets together based on join conditions using less than or greater than instead of equal to (the most common join condition).

Note

Chapter 32 covers advanced join scenarios in a lot more detail. Chapter 39 covers scaffolding, another advanced join technique that uses join ranges to filter data.

By preventing rows from being formed during joins that you’d only have to filter out later, join ranges reduce the amount of data processed and require less computing power. Since Prep Builder is used primarily by “normal business users” on laptop and desktop computers, where processing power can be limited, this is a huge advantage.

Percentage Variance

Another form of smarter filtering is testing key measures to ensure they are within a tolerance level. These tolerance levels will often depend on business logic and rules. Finding significant variances in the previous minimum, maximum, or average is a strong indication that something within the data is amiss and requires addressing.

Percentage variances are often good checks to place into a flow where the data has been manually entered, increasing the risk of typos, or where the data will be reloaded. Let’s look at each scenario in turn.

Manual Entry: Level of Detail Calculations

Because manual entry is likely to be OK most of the time with only the odd mistake, you can set the benchmark within the actual data set being used in the flow and test the values within it.

With the addition of Level of Detail (LOD) calculations in Tableau Prep Builder 2020.1.3, it became a lot easier to check the percentage variance because these calculations add a new column to the data set. We’ll use the Soap Production Cost data set in Figure 37-7 to check for any outliers.

Soap Production Cost data set
Figure 37-7. Soap Production Cost data set

To identify outliers in the data set, first you need to test the “normal” level. For this data set, let’s assume we know that there is normally a different production cost for the different types of product. We can build this calculation using the Visual Editor (Figure 37-8).

Visual Editor for LOD calculations
Figure 37-8. Visual Editor for LOD calculations

This gives us two values to test the production costs against. Creating two calculations that add and subtract the percentage variance will allow us to test each value. For this data set, the percent tolerance level is 30% above and below the average production cost per product. We can create the Upper Check field with the calculation shown in Figure 37-9.

Calculation for the Upper Check field
Figure 37-9. Calculation for the Upper Check field

We create the Lower Check field with the calculation shown in Figure 37-10.

Calculation for the Lower Check field
Figure 37-10. Calculation for the Lower Check field

We can then check the tolerance by assessing whether the Production Cost value lies between the Upper and Lower Check bounds. To do this, we use the IF statement shown in Figure 37-11 in a new calculation.

Using the Upper and Lower Check bounds to set a tolerance level
Figure 37-11. Using the Upper and Lower Check bounds to set a tolerance level

The flow can then be split based on whether the values are within the specified tolerance level. The records containing values that sit outside the Upper and Lower Check bounds can be branched off for further checks and validation (Figure 37-12).

Branching the flow to handle records within and outside tolerance
Figure 37-12. Branching the flow to handle records within and outside tolerance

Reloaded Data: Join to Previous Output

When a system goes wrong or not as expected, it often goes very wrong. If all of a data field’s values have gone very wrong, assessing the tolerance of that data is not going to work. If all of the values fundamentally change, then the average technique we just used will not catch the issues, as the average values are driven by the changed column. To resolve this, we need to test the previous data set against the newly loaded data.

Let’s structure a flow to recognize the issue in the data set shown in Figure 37-13, where the decimal points have suddenly disappeared from the Production Cost field.

Data set resulting from a system error
Figure 37-13. Data set resulting from a system error

We will use many of the same steps as in the earlier technique, but with some key differences.

Aggregating the Average Production Cost per Type

This Aggregate step changes the granularity of the data to the level of the tolerance check. We do this calculation on the original data, not the new data set being loaded. In Figure 37-14, we aggregate up to a single record per product type.

Aggregate step using the original data set
Figure 37-14. Aggregate step using the original data set

The benefit of using an Aggregate step is that it reduces the granularity to one record per element you want to join by. You will still need to calculate the Upper and Lower Checks based on the results from this step.

Joining the Data Sets Together

Now we’ll join the new data set and original data set together to add the checks to the new data set (Figure 37-15).

Join condition to join together the new and old data set
Figure 37-15. Join condition to join together the new and old data set

This results in a data set where we can use the same calculation from earlier to determine whether or not the Production Cost values are within the tolerance levels (Figure 37-16).

Resulting data set of the Join in #join_condition_to_join_together_the_new
Figure 37-16. Resulting data set of the Join in Figure 37-15

We can then make our filtering choices the same way as in the earlier technique.

Combining Techniques

Using the join range and percentage variance techniques together can also automate the filtering to leave only the results you require and flag potential issues in your data. In the previous example, we could set the join condition shown in Figure 37-17 to remove any values outside of the specified tolerance levels.

Join condition set up to combine both smart filtering techniques
Figure 37-17. Join condition set up to combine both smart filtering techniques

This prevents any data that could lead to misinformed decisions from being processed and shared with end users. If there is any doubt that the filtered values may be real records—even if they are outliers—make sure to add checks to your process so you aren’t removing data that could compromise the accuracy of your analysis.

Summary

The techniques described in this chapter can provide safeguards when things go wrong, either through data entry mistakes or because systems that produce data sets go wrong. Calculations, join ranges, and percentage variations take a rules-based approach to help you filter your data sets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset