Chapter 17. Unioning

Most software that works with data demands that you form a single table of data to work from. However, the world is often not that simple, so you’ll often have to pull together many tables of data to build that single table. Unioning is a data preparation technique that will help you with this task.

What Is a Union?

You can think of unioning as stacking one data set on top of another. Columns that contain the same content should be unioned as part of the data preparation process. As you’ll soon see, this requires the data structures to be very similar.

Let’s look at two separate example data sets: York Store Sales (Figure 17-1) and Leeds Store Sales (Figure 17-2).

York Store Sales data set
Figure 17-1. York Store Sales data set
Leeds Store Sales data set
Figure 17-2. Leeds Store Sales data set

Unioning these two tables removes the extra set of column headers and stacks the rows of data on top of each other (Figure 17-3).

Unioned result of York and Leeds data sets
Figure 17-3. Unioned result of York and Leeds data sets

The union is usually determined by whether the data fields are:

  • Named the same in each data set. The contents of columns with the same name will be stacked in rows under that column name in the resulting data set. Currently, Prep Builder allows only for this type of union.

  • Positioned in the same order. The contents of the first column will be stacked on top of the contents of the first column in the other data set, regardless of what it contains. This type of union isn’t supported within Prep Builder.

  • Manually matched by the user. Some software allows the user to match columns of their own choice before unioning the data sets together. This functionality also isn’t supported within Prep Builder.

In the previous example, the tables could have been unioned by either their position or name. In Prep Builder, you can union columns using the other techniques by giving the relevant columns the same name to satisfy the requirements of the first technique.

What If the Data Structure Isn’t Identical?

Because unioning in Prep Builder is based on matching column names, when those names are not the same it creates a mismatch in the resulting data set. For example, the same data is captured under the Scent column header in the Leeds Store Sales table and the Type column header in the York Store Sales table (Figure 17-4 and Figure 17-5).

Leeds Sales Stores with Scent column header
Figure 17-4. Leeds Sales Stores with Scent column header
York Store Sales with Type column header
Figure 17-5. York Store Sales with Type column header

Therefore, in a normal union, even though you might assume the column contents should be stacked on top of each other, they will be mismatched in the resulting table (i.e., there will be two separate columns for the same data). When this happens, rows from each table will have nulls where no matching column is found in the other table (Figure 17-6).

Unioned data with mismatched fields
Figure 17-6. Unioned data with mismatched fields

In Prep Builder, you can clearly see these mismatched fields within the Union step (Figure 17-7).

Mismatched fields from a Union step in the Profile pane
Figure 17-7. Mismatched fields from a Union step in the Profile pane

You can merge the mismatched fields by selecting them and clicking Merge Fields (Figure 17-8).

Merge Fields option
Figure 17-8. Merge Fields option

When to Union Data

So now you have seen how to union, but when do you deploy the technique? Let’s look at some common scenarios where unioning data is a good strategy.

Monthly Data Sets

When working with other teams in your organization or external third parties, you’ll often receive files to analyze on a monthly basis. Automated data preparation flows can save you a lot of time shaping the data into a consistent, useful state for analysis. However, if the data needs to build up over time, then you’ll need to append the files to each other. Unions are the perfect technique for this. Figure 17-9 shows how you’d use a union to change the York Store Sales data from monthly sales files to a single file for analysis.

Monthly data sets unioned into one table
Figure 17-9. Monthly data sets unioned into one table

Data Sets from Web Sources

Pulling together similar data sets from web-based sources is another case where unioning files can be useful. For example, you might need to take team rosters from their individual web pages and union them together. You might also consider adding the URL, or sheet name, as a reference in the data set.

In the example shown in Figure 17-10, which uses ESPN’s team rosters and Google Sheets’ IMPORTHTML() function to pull together a full list of players, you can union the two team roster tables together, but you would lose the team each player plays for. When you use the Union step in Prep, Tableau adds the name of the table where the data originated.

Creating a Table Names column as part of the Union step
Figure 17-10. Creating a Table Names column as part of the Union step

This allows you to split the team name off from the rest of the table source (Figure 17-11).

Splitting the team name from the table name
Figure 17-11. Splitting the team name from the table name

Company Mergers

With the rise of cloud-based software, more companies are using similar tools and data structures. This means that when departments and organizations merge, they can union together their data exports. For example, in a merger of two companies that both use Salesforce, using a union to pull together a combined sales pipeline list would be very simple.

Multiple Tables and Wildcard Unions

When unioning parts of your flow, you can add more than two inputs into a single union. However, if you have several inputs to union, it can quickly become tedious to add a single input for each data source. Instead, during the Input step in Prep Builder, you can select “Wildcard union” rather than “Single table.” The Wildcard has three sections that allow you to select any folder (including subfolders), file, or sheet that matches the pattern you specify (Figure 17-12).

Setting up wildcard unions on the Input step
Figure 17-12. Setting up wildcard unions on the Input step

In the sections called out in Figure 17-12, you can see an asterisk (*) beside the Matching Pattern. The asterisk symbol is used as the wildcard character in data circles and acts as a catchall, enabling you to better control what should and shouldn’t be captured in the union.

For example, say you have one Excel workbook with three sheets: York Sales, Leeds Sales, and Reference Table. If you wanted to pull the York and Leeds Sales tables together, then in the Sheets section, you would enter *Sales for the Matching Pattern.

That way, any characters in the sheet name before the word “Sales” would be ignored, but the York Sales and Leeds Sales sheets would be included. Because the Reference Table doesn’t meet the criteria, it wouldn’t be included in the union or available for analysis. This functionality also future-proofs the flow, as any new sheets that match the specified pattern set would be added to the flow and processed.

Summary

Like joins, unions are a fundamental data preparation technique. Arguably a lot easier to use than joins, as there are no conditions to set up, unioning data simply involves stacking data sets to create a single table for analysis. Finally, using the wildcard union technique on the Input step ensures that as matching files are added to a folder or other location, they will be added to the data set the next time the flow is run, saving you an additional Input step.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset