Chapter 15. Aggregating in Prep Builder

Thanks to the two tools’ similar look and feel, Tableau users can prepare their data in Prep Builder for visual analysis in Desktop quickly and easily—that is, until some of those familiar features start to differ. When I’m teaching Prep Builder, aggregation is a common source of pain for my students for this reason. In this chapter, we will cover how aggregation in Prep Builder differs from Desktop, how the Aggregate step works, and how to overcome the biggest challenge of aggregation: adding back into Desktop any column that you may want for further analysis.

Comparing Calculations in Prep Builder and Desktop

Like Desktop, Prep Builder will work out each calculation at a row level if you don’t use any form of aggregation. For example, if I wanted to calculate the total cost in the simple Bathroom Renovation data set shown in Figure 15-1, I would simply add one cost value to the other, one row at a time.

Bathroom Renovation data set
Figure 15-1. Bathroom Renovation data set

To add these costs together, we would write the same calculation in Prep Builder as we would in Tableau Desktop (Figure 15-2).

Calculation for total cost
Figure 15-2. Calculation for total cost

This adds the values together for each row, while creating a new column to hold the total (Figure 15-3).

Result of total cost calculation
Figure 15-3. Result of total cost calculation

The total cost calculation and the additional column to hold it are identical between the two tools. For heavy Excel users who haven’t used a tool that works with columns, the approach is the same in Excel: to calculate this value, you simply select which cells to add together rather than selecting the column names as you do in Tableau (Figure 15-4).

Excel equivalent of the calculation in #calculation_for_total_cost
Figure 15-4. Excel equivalent of the calculation in Figure 15-2

Which Calculations in Prep Builder Differ?

So if our row-based calculations are the same in both Prep Builder and Desktop, which calculations aren’t? You’ve probably guessed the answer: calculations where you are aggregating multiple rows together in the same column. Aggregations in Prep Builder allow you to change the granularity of the data set and create subtotals or totals.

Let’s use totals as an example. In Desktop, to add up each item’s total cost, we would write the calculation shown in Figure 15-5.

Calculating a data set total in Tableau Desktop
Figure 15-5. Calculating a data set total in Tableau Desktop

If you’ve used Desktop for a while, you probably know that if you add an aggregation to your calculation, Tableau will add up all those values based on what discrete fields are within your view (Figure 15-6).

Note

Discrete fields are blue and continuous fields are green. These fields are sometimes called pills because of their distinctive shape.

Desktop aggregates by the dimensions used on a worksheet
Figure 15-6. Desktop aggregates by the dimensions used on a worksheet

We don’t have any discrete fields in our view, so Desktop simply sums up all of our items’ total costs. The equivalent in Excel looks like Figure 15-7.

The Excel equivalent of the calculation in #calculating_a_data_set_total_in_tableau
Figure 15-7. The Excel equivalent of the calculation in Figure 15-5

But in Prep Builder, the sum() function doesn’t exist within Calculated Fields, so trying to use it throws an error (see Figure 15-8).

Trying to use sum() in Prep throws an error
Figure 15-8. Trying to use sum() in Prep throws an error

The reason for this error is because we need to tell Prep Builder what to sum up and to what level of granularity. In Desktop, Tableau determines this from the discrete fields in the view (placed on the Columns or Rows shelf or used on the Marks card). In Figure 15-9 we have put Type of Item on the Columns shelf, so Desktop is now showing the total costs for each type of item rather than the sum of all of them together.

Note

The Marks card controls how the data point is shown in the visualization. You can control the type, color, size, or even whether the mark has a label attached to it.

Adding the Type of Item dimension breaks down the total cost by each item
Figure 15-9. Adding the Type of Item dimension breaks down the total cost by each item

As you can see, this gives you a huge amount of flexibility and is why it’s so easy to explore data in Desktop. You’d only use Prep Builder for this aggregation when you know what level of granularity you want, so we set this up using the aggregation step rather than a Calculated Field.

Some examples of where you would want to aggregate data in Prep Builder include:

Showing percent of total calculations
You can use the Aggregate step to calculate totals, or, by adding a column to the “Group by” (which we’ll cover shortly) to calculate subtotals. You can then use these totals and subtotals as the denominator for the individual records.
Changing the granularity of the data
Input data sets are not always at the level of granularity you need for analysis. Reducing the granularity by aggregating the data set can boost performance when you are analyzing the data in Desktop. You might also change the granularity so it matches that of another data set in order to make joining data sets easier.
Analyzing data in Prep Builder
You can use aggregation when you want to answer questions about the data within Prep Builder instead of having to output the values to Desktop.

Adding the Aggregate Step

Before we dive into the “how” of aggregation, let’s go over adding an Aggregate step. First, click the plus sign at the end of your flow (or wherever you want to add it) and select Add Aggregate (Figure 15-10).

Selecting Add Aggregate in Prep
Figure 15-10. Selecting Add Aggregate in Prep

The basic Aggregation pane is shown in Figure 15-11.

The Aggregation pane in Prep
Figure 15-11. The Aggregation pane in Prep

The gray pane on the left shows all the fields in your data set and their default aggregation based on their data type. There are two ways to configure this tool. The first is to drag the fields you want to aggregate to the Aggregated Fields pane on the right and select “Group by.” You can change the aggregation by clicking on the current type of aggregation (e.g., Sum) at the top of the column (Figure 15-12). The second option is to click the SUM or GROUP BY text next to each field in the left-hand pane. This opens a drop-down menu of all possible methods of aggregating. Selecting “Group by” will move the field to the Grouped Fields pane at the center of the screen, and selecting a method of aggregation will move the field to the Aggregated Fields pane at the right.

Changing the type of aggregation
Figure 15-12. Changing the type of aggregation

If you haven’t used SQL or other data coding languages, the concept of “Group by” might be a bit strange at first. Basically, adding a field to “Group by” is the equivalent of saying, “For each of these ‘things,’ I want the aggregated value to be returned.” If you have multiple fields in the Group by, then this means you get an aggregated value for each combination of the grouped fields.

To clarify this, let’s look at a few examples with our Bathroom Renovation data set:

Group by: Nothing; Aggregate: sum(Total Cost)

When you group by nothing, Prep Builder will aggregate the data however you’ve asked for it. For this example, summing total cost, Prep Builder adds up all the total costs in the data set (Figure 15-13).

Total cost for the entire data set
Figure 15-13. Total cost for the entire data set

The equivalent in Excel is summing the whole rightmost column (Figure 15-14).

Calculating total cost for the data set in Excel
Figure 15-14. Calculating total cost for the data set in Excel
Group by: Type of Item; Aggregate: sum(Total Cost)

By adding Type of Item to “Group by,” we are asking Prep Builder to break down the total cost by each different type of item in our data set (Figure 15-15).

Breaking down the total cost by type of item
Figure 15-15. Breaking down the total cost by type of item

Notice how the data set has three different types of items, so there are three rows of data shown at the bottom right of the screen. The Profile pane has only two values, as two rows have the same value (65) in our output.

In Excel, we would use the SUMIF() function to calculate the three group totals (Figure 15-16).

If a different, new type of item were added to the data set, the next time Prep Builder is run, it would create a fourth row of data.

Breaking down total cost by type of item in Excel
Figure 15-16. Breaking down total cost by type of item in Excel
Group by: Type of Item, Item; Aggregate: sum(Total Cost)

If we group by each categorical data field in our view (i.e., Type of Item and Item), the aggregated values look the same as they did before the aggregation step (Figure 15-17).

Increasing the number of data fields to group by further divides the total cost
Figure 15-17. Increasing the number of data fields to group by further divides the total cost
In this example, this aggregation isn’t actually aggregating anything, as each row of data was already a unique combination of the Type of Item and Item dimensions. Therefore, we have as many rows flowing in to the step as flowing out (so it’s a pointless aggregation for this particular data set!).

Where’s the Rest of My Data?

Following your aggregation, you might have additional preparation steps to take. The challenge with the Aggregate step is that you output only the fields in your Group by and Aggregation setup.

Let’s go back to the second Group by example, where we are grouping only by Type of Item, and see what flows into the next Clean step (Figure 15-18).

The Aggregate step outputs only grouped by or aggregated data
Figure 15-18. The Aggregate step outputs only grouped by or aggregated data

We have three different types of items, so we have three rows of data, but notice that we have left behind all the other columns of data. To get these back, you need to join the total you have just added back to the original data set. Where you draw the data from before the Aggregate step depends on what changes have been made to that data. Your flow will look something like Figure 15-19.

Self-join to reintroduce the data lost during an Aggregate step
Figure 15-19. Self-join to reintroduce the data lost during an Aggregate step

To add the total value to your existing data, use a join condition(s) for whatever is within the “Group by” part of your aggregation. In this example, the join condition is where Type of Item in our original data set has the same value as Type of Item in our aggregated data set.

Level of Detail Calculation Option

If you want to avoid rejoining your data after the aggregation, consider using Level of Detail calculations. As Chapter 33 will cover, this type of calculation doesn’t change the granularity of the data set like the Aggregate step does. Instead, you can set how and to what level one measure is aggregated but leave the rest of the data set unchanged. Therefore, there is no need to rejoin the data set.

Summary

The Aggregate step in Prep Builder differs significantly from how Desktop users may think about data aggregation. Changing the granularity of the data is a common aspect of data preparation, as it enables you to join data sets together more easily or reduce the number of records to be processed during analysis. You must be careful not to overaggregate the data source if you are visualizing the data in Desktop, as this could negate some of the tool’s benefits. Specifically, because Desktop aggregates the data based on the discrete fields within the view—giving you the freedom to analyze what you want, how you want—you should not aggregate your data unless it is necessary, or you risk losing this flexibility.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset