Whether your data comes from files, databases, or both, you’ll need to master joins, as most software requires one large table of data to reference for analysis. Your data sets will come from a lot of different sources, so being able to combine them is key. This will allow you to take columns from each data source and use them alongside one another in the output. The rows of data they contain are added to the resulting data set.
Most of the examples in this chapter will use the 2019: Week 29 inputs. We will cover the logic and terminology for working with joins, how to join data sets together in Prep Builder, and which type of join to use in different situations.
Joining separate data sources is very easy in Prep Builder. Simply drag one data source toward another, and Prep Builder will give you a number of options (Figure 16-1).
By dragging the second data set (in this case Subscription Pricing) to the original (blue) data flow (Clean 2), you’ll be given the option to union or join the data sets. If you hover over the Join option and then release the mouse button, Prep Builder will automatically link the two flows of data together (Figure 16-2).
Another approach to joining two data sets is to add a Join step, drag the second data set onto it, and then release your mouse button above the Add section (Figure 16-3).
You might have noticed that an error is generated in the flow in Figure 16-3; this is because the join condition hasn’t automatically been set. This occurs when Prep Builder doesn’t find matching data field names in the two separate inputs (Figure 16-4). Be careful to check that the join condition makes sense—just because the column headers match, it doesn’t mean the fields have exactly the same values.
You can add the join condition by clicking Add or the circled plus sign in the Applied Join Clauses section of the Join step. To set the condition, select a field to assess from each data source and then choose the type of assessment. Equals is the most common, but you can also set qualifiers like Does Not Equal or Less Than to handle more complex logic and save the data’s end users from having to filter the result. If you want to add more join conditions, click the plus sign and repeat the process for each addition join condition.
Once the join condition is set, Prep Builder will demonstrate the results of the join, as shown in Figure 16-5.
The Summary of Join Results at the bottom of this view lists the number of rows in the resulting data set (Join Result), as well as how many rows from each original data set are included. If the Join Result is significantly higher than either of the input data sets, you should investigate if the join is creating duplicates or other additional rows. This information can give you insight into the potential impact of different joins as you decide which type to use. To change the join type, click the different sections of the Venn diagram in the central Join Type area. Any areas shaded in dark gray will be included in the resulting output. If you want to prevent a section of the Venn diagram from being returned, click that section again to deselect it.
The majority of data software, including Prep, uses similar logic and terminology for joining data sets together. However, joining data sets together in Prep is a lot more user-friendly than in other tools.
Figure 16-6 and Figure 16-7 show two sales-related tables that we might want to join together to analyze revenue. Since the number of subscription packages sold and the package price appear in separate tables, we need to join the tables to find out the revenue generated.
As noted earlier, there are two aspects to creating this join:
Join condition, which determines how to link the two tables together
Join type, which determines what will be returned by the join
In the sales example, the join condition would be where the values match in the data fields Packages and Subscription Package. To conduct the revenue analysis, we’d want to analyze the Price field for each package along with the number sold. Remember, when joining data sets together, you can set multiple join conditions using different fields, so it doesn’t have to be simply a case of one value matching another in a different data set. Join conditions need to match not just the values but the data types as well: a string with the value 1
would not match an integer with the value 1
.
Joins in many tools, Tableau included, are represented by a Venn diagram. This is because we need to consider what parts of the tables we return, which determines the join type we’ll use. The terminology associated with joins is aligned to Venn logic also; you can perform left joins and right joins. One data set corresponds to the left-hand circle in the Venn diagram, and the other to the right-hand circle. Keep in mind that you can join only two data sets at a time. If you have more than two, simply join two together and add another Join step for each additional data source.
To explore the impact of using each join type in Prep Builder, we’ll use two simple data sets: Left Table (Figure 16-8) and Right Table (Figure 16-9). Each example will demonstrate the output that would result if we set the join condition such that the category from the Left Table is the same as the category from the Right Table.
In a real-world scenario, the output from Prep Builder would contain both tables’ Category columns, but these are usually removed by the user, so I’ve just recorded one here.
Prep Builder gives you the following Join Type options:
The inner join will return only records that meet the join condition(s) (Figure 16-10).
For the example data set, an inner join would return only Category values A
and B
but would have values for both Measure 1 and Measure 2. Figure 16-11 shows the result.
The left join returns all rows from the left table, plus the records from the right table that meet the join condition(s). Any rows from the right table that do not meet the join condition will not be carried through to the next step (Figure 16-12). Records in the left table that do not have a match in the right table will have a null value in the resulting table.
For the example data set, the left join returns all columns and values from the Left Table but returns only the values from the Right Table where there is a matching Category value (Figure 16-13).
The leftOnly join returns records that do not meet the join condition(s) only from the left table (Figure 16-14).
The example data set has only one unmatched Category value and returns only one row of data (Figure 16-15).
Similar in theory to the left join, the right join returns everything from the right table, plus records from the left table that meet the join condition(s) (Figure 16-16).
The example data set would return all the data for the right table, but only the two matching Category values from the left table (Figure 16-17).
Similar to leftOnly, the rightOnly join will return records that do not meet the join condition only from the right table (Figure 16-18).
As in the leftOnly example, one value is returned for the rightOnly join (Figure 16-19).
The full join produces a data set that contains all the values from both tables (Figure 16-20). When a value in either table doesn’t have a match in the other table, you’ll see a null value in the resulting data set.
Using the example data set, all the values from both the Left Table and Right Table data sets are returned, but nulls appear where the Category fields don’t have a corresponding value in the other table (Figure 16-21). Unlike the other join examples where I’ve omitted the duplicate Category field, I’ve left it in here to show the differences in the output. The second Category field will automatically be appended with the value –1
as each data field in a Prep Builder data set must have a unique name.
The notInner join will return the same data as the full join except it will not return the records that meet the join condition (Figure 16-22).
As with the full join, I’ve retained the duplicate Category field for the notInner join in Figure 16-23 to highlight the difference in its output compared to the earlier join types.
Just because a join returns the correct number of records, it doesn’t mean the join is correct. You must carefully think through what data you’ll be including and excluding when choosing a join condition and type of join.
Let’s consider a few situations that call for a specific join type:
Joins can be a fantastic way to add further context to your analysis, as you can add data that didn’t reside in the original source. There are challenges with non-unique join conditions—a topic we will cover in future chapters—but hopefully this chapter has helped you better understand how and when to use different types of joins.