Chapter 16. Joining Data Sets Together

Whether your data comes from files, databases, or both, you’ll need to master joins, as most software requires one large table of data to reference for analysis. Your data sets will come from a lot of different sources, so being able to combine them is key. This will allow you to take columns from each data source and use them alongside one another in the output. The rows of data they contain are added to the resulting data set.

Most of the examples in this chapter will use the 2019: Week 29 inputs. We will cover the logic and terminology for working with joins, how to join data sets together in Prep Builder, and which type of join to use in different situations.

How to Join Data Sets in Prep Builder

Joining separate data sources is very easy in Prep Builder. Simply drag one data source toward another, and Prep Builder will give you a number of options (Figure 16-1).

Dragging one step toward another triggers the Join/Union options
Figure 16-1. Dragging one step toward another triggers the Join/Union options

By dragging the second data set (in this case Subscription Pricing) to the original (blue) data flow (Clean 2), you’ll be given the option to union or join the data sets. If you hover over the Join option and then release the mouse button, Prep Builder will automatically link the two flows of data together (Figure 16-2).

The Join step in Prep
Figure 16-2. The Join step in Prep

Another approach to joining two data sets is to add a Join step, drag the second data set onto it, and then release your mouse button above the Add section (Figure 16-3).

Adding a data set to a Join step
Figure 16-3. Adding a data set to a Join step

You might have noticed that an error is generated in the flow in Figure 16-3; this is because the join condition hasn’t automatically been set. This occurs when Prep Builder doesn’t find matching data field names in the two separate inputs (Figure 16-4). Be careful to check that the join condition makes sense—just because the column headers match, it doesn’t mean the fields have exactly the same values.

Join step configuration pane without a join condition set
Figure 16-4. Join step configuration pane without a join condition set

You can add the join condition by clicking Add or the circled plus sign in the Applied Join Clauses section of the Join step. To set the condition, select a field to assess from each data source and then choose the type of assessment. Equals is the most common, but you can also set qualifiers like Does Not Equal or Less Than to handle more complex logic and save the data’s end users from having to filter the result. If you want to add more join conditions, click the plus sign and repeat the process for each addition join condition.

Once the join condition is set, Prep Builder will demonstrate the results of the join, as shown in Figure 16-5.

Viewing the effects of the join
Figure 16-5. Viewing the effects of the join

The Summary of Join Results at the bottom of this view lists the number of rows in the resulting data set (Join Result), as well as how many rows from each original data set are included. If the Join Result is significantly higher than either of the input data sets, you should investigate if the join is creating duplicates or other additional rows. This information can give you insight into the potential impact of different joins as you decide which type to use. To change the join type, click the different sections of the Venn diagram in the central Join Type area. Any areas shaded in dark gray will be included in the resulting output. If you want to prevent a section of the Venn diagram from being returned, click that section again to deselect it.

Join Logic and Terminology

The majority of data software, including Prep, uses similar logic and terminology for joining data sets together. However, joining data sets together in Prep is a lot more user-friendly than in other tools.

Figure 16-6 and Figure 16-7 show two sales-related tables that we might want to join together to analyze revenue. Since the number of subscription packages sold and the package price appear in separate tables, we need to join the tables to find out the revenue generated.

Customer Order Frequency table
Figure 16-6. Customer Order Frequency table
Subscription Pricing table
Figure 16-7. Subscription Pricing table

As noted earlier, there are two aspects to creating this join:

  • Join condition, which determines how to link the two tables together

  • Join type, which determines what will be returned by the join

In the sales example, the join condition would be where the values match in the data fields Packages and Subscription Package. To conduct the revenue analysis, we’d want to analyze the Price field for each package along with the number sold. Remember, when joining data sets together, you can set multiple join conditions using different fields, so it doesn’t have to be simply a case of one value matching another in a different data set. Join conditions need to match not just the values but the data types as well: a string with the value 1 would not match an integer with the value 1.

Joins in many tools, Tableau included, are represented by a Venn diagram. This is because we need to consider what parts of the tables we return, which determines the join type we’ll use. The terminology associated with joins is aligned to Venn logic also; you can perform left joins and right joins. One data set corresponds to the left-hand circle in the Venn diagram, and the other to the right-hand circle. Keep in mind that you can join only two data sets at a time. If you have more than two, simply join two together and add another Join step for each additional data source.

Types of Join in Prep Builder

To explore the impact of using each join type in Prep Builder, we’ll use two simple data sets: Left Table (Figure 16-8) and Right Table (Figure 16-9). Each example will demonstrate the output that would result if we set the join condition such that the category from the Left Table is the same as the category from the Right Table.

Note

In a real-world scenario, the output from Prep Builder would contain both tables’ Category columns, but these are usually removed by the user, so I’ve just recorded one here.

Left Table data set
Figure 16-8. Left Table data set
Right Table data set
Figure 16-9. Right Table data set

Prep Builder gives you the following Join Type options:

inner

The inner join will return only records that meet the join condition(s) (Figure 16-10).

Inner join
Figure 16-10. Inner join

For the example data set, an inner join would return only Category values A and B but would have values for both Measure 1 and Measure 2. Figure 16-11 shows the result.

Results of the inner join
Figure 16-11. Results of the inner join
left

The left join returns all rows from the left table, plus the records from the right table that meet the join condition(s). Any rows from the right table that do not meet the join condition will not be carried through to the next step (Figure 16-12). Records in the left table that do not have a match in the right table will have a null value in the resulting table.

Left join
Figure 16-12. Left join

For the example data set, the left join returns all columns and values from the Left Table but returns only the values from the Right Table where there is a matching Category value (Figure 16-13).

Results of the left join
Figure 16-13. Results of the left join
leftOnly

The leftOnly join returns records that do not meet the join condition(s) only from the left table (Figure 16-14).

leftOnly join
Figure 16-14. leftOnly join

The example data set has only one unmatched Category value and returns only one row of data (Figure 16-15).

Results of the leftOnly join
Figure 16-15. Results of the leftOnly join
right

Similar in theory to the left join, the right join returns everything from the right table, plus records from the left table that meet the join condition(s) (Figure 16-16).

Right join
Figure 16-16. Right join

The example data set would return all the data for the right table, but only the two matching Category values from the left table (Figure 16-17).

Results of the right join
Figure 16-17. Results of the right join
rightOnly

Similar to leftOnly, the rightOnly join will return records that do not meet the join condition only from the right table (Figure 16-18).

rightOnly join
Figure 16-18. rightOnly join

As in the leftOnly example, one value is returned for the rightOnly join (Figure 16-19).

Results of the rightOnly join
Figure 16-19. Results of the rightOnly join
full

The full join produces a data set that contains all the values from both tables (Figure 16-20). When a value in either table doesn’t have a match in the other table, you’ll see a null value in the resulting data set.

Full join
Figure 16-20. Full join

Using the example data set, all the values from both the Left Table and Right Table data sets are returned, but nulls appear where the Category fields don’t have a corresponding value in the other table (Figure 16-21). Unlike the other join examples where I’ve omitted the duplicate Category field, I’ve left it in here to show the differences in the output. The second Category field will automatically be appended with the value –1 as each data field in a Prep Builder data set must have a unique name.

Result of the full join
Figure 16-21. Result of the full join
notInner

The notInner join will return the same data as the full join except it will not return the records that meet the join condition (Figure 16-22).

notInner join
Figure 16-22. notInner join

As with the full join, I’ve retained the duplicate Category field for the notInner join in Figure 16-23 to highlight the difference in its output compared to the earlier join types.

Results of notInner join
Figure 16-23. Results of notInner join

When to Use Each Join Type

Just because a join returns the correct number of records, it doesn’t mean the join is correct. You must carefully think through what data you’ll be including and excluding when choosing a join condition and type of join.

Let’s consider a few situations that call for a specific join type:

inner
The inner join is the workhorse of join techniques. You use it as a way of filtering out missing records that would be returned as a null value in a left or right join, thereby protecting calculations that require only non-null values.
notInner
This initially seems to be a bizarre join type. Why join two data sets together that fundamentally won’t join due to not matching the specified join condition? The answer is data quality. By creating a notInner join, you can treat the data that does not meet the join condition as if it were going to be rejoined in the future. This is a good error-checking technique, as it helps you validate what has been returned by an inner join. Depending on the data set, you could use leftOnly or rightOnly joins in a similar way but return the data that doesn’t meet the join conditions from only one of the tables.
left
Left joins are almost as common inner joins. Because a left join returns every record from the left table, you can think of the right table as being appended to the relevant rows in the left table. However, when two or more rows from the right table match a single row from the left table, the left table row gets duplicated for each row from the right table.
full
Use a full join when you want to return all data fields from both tables but want to create a single row for data that meets the specified join condition(s). When joining disparate sources together where you want to return all the data—for example, when you are joining two customer data sets during a merger of two organizations—a full join is likely to be the right solution.

Summary

Joins can be a fantastic way to add further context to your analysis, as you can add data that didn’t reside in the original source. There are challenges with non-unique join conditions—a topic we will cover in future chapters—but hopefully this chapter has helped you better understand how and when to use different types of joins.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset