In this chapter, we’ll gain a deeper understanding of how to model and structure data with Tableau. We’ve seen the Data Source screen in previous chapters and briefly explored how to drag and drop tables to form relationships. Now, we’ll explore some of Tableau’s more complex features to gain a good understanding of how Tableau allows you to relate multiple tables together, either logically or physically.
We’ll start with a broad overview of Tableau’s new data model and then examine some details of different types of joins and blends. The data model and blending apply primarily to Tableau Desktop (and Server), but pay special attention to the discussion of joins, as a good understanding of join types will aid you greatly when we discuss Tableau preparation in Chapter 16, Taming Data with Tableau Prep.
The data model was introduced in Tableau 2020.2. If you are using an older version, the explanation of joins and blends will be directly applicable, while the explanation of the data model will serve as inspiration for you to upgrade!
In this chapter, we’ll cover the following topics:
We’ll start by understanding the sample dataset included in the workbook for this chapter. This is so that you have a good foundation of knowledge before working through the examples.
For this chapter, we’ll use a sample dataset of patient visits to the hospital. The data itself is contained in the Excel file Hospital Visits.xlsx
in the Learning TableauChapter 14
directory. In this example, each tab of the Excel file represent tables of data. The principles described here are directly applicable to relational database tables, multiple files, or tables in any data source. The relationship between those tables is illustrated here:
Figure 14.1: The four tabs of the Excel file illustrated as four tables with relationships
Excel does not explicitly define the relationships, but they are shown here as they might exist in a relational database using foreign key lookups. Here is a brief explanation of the tables and their relationships:
The tables relate to each other in different ways. Here are some details:
With a solid grasp of the sample data source, let’s turn our attention to how we might build a data model in Tableau.
Every data source you create in Tableau will use the data model. Any time you relate two or more sets of data, you’ll need to give some thought as to the various ways you might relate them in the data model. We’ll cover some of the basics of what a data model is and how to create it first.
We’ve briefly looked at the Data Source screen in Chapter 2, Connecting to Data in Tableau. Now, we’ll take a deeper look at the concepts behind the interface. Feel free to follow along with the following example in the Chapter 14 Starter.twb
workbook, or examine the end results in Chapter 14 Complete.twbx
.
We’ll start by creating a connection to the Hospital Visits.xlsx
file in the Chapter 14
directory. The Data Source screen will look like this upon first connecting to the file:
Figure 14.2: The Data Source screen lists the tabs in the Excel workbook and invites you to start a data model
We’ll build the data model by dragging and dropping tables onto the canvas. We’ll add all four tables. Tableau will suggest relationships for each new table added based on any matching field names and types. For our tables, we’ll accept the default settings because the ID
fields that indicate the correct relationship are identically named and typed.
The first table added is the root and forms the start of the data model.
You may change the root table by right-clicking a table that is not currently the root table and selecting the Swap with root option.
In this example, the order in which you add the tables won’t matter, though you may notice a slightly different display depending on which table you start with. In the following screenshot, we’ve started with Hospital Visit (which is the primary table and, therefore, makes sense to be the root table) and then added all of the other tables:
Figure 14.3: All tables have been added to the data model
Selecting a connector between tables (sometimes called a “noodle” in the data model) will allow you to edit the relationship in the bottom pane. In Figure 14.3, you’ll notice the relationship options for the connector between Hospital Visit and Patient. Tableau automatically created our relationships because the ID fields had the same name and type between both tables. If necessary, you could manually edit the relationships to change which fields define the relationship.
A relationship simply defines which fields connect the tables together. It does not define exactly how the tables relate to each other. We’ll discuss the concepts of join types (for example, left join or inner join) later in this chapter, but relationships are not restricted to a certain join type. Instead, Tableau will use the appropriate kind of join as well as the correct aggregations depending on which fields you use in your view. For the most part, you won’t have to think about what Tableau is doing behind the scenes, but we’ll examine some unique behaviors in the next section.
Also, notice the Performance Options drop-down menu in the relationship editor, as shown here:
Figure 14.4: The Edit Relationship dialog box includes options to improve performance
These performance options allow Tableau to generate more efficient queries if the nature of the relationship is known. If you do not know the exact nature of the relationship, it is best to leave the options in their default settings as an incorrect setting can lead to incorrect results.
There are two basic concepts covered by the performance options:
If Tableau is able to determine constraints from a relational database, those constraints will be used. Otherwise, Tableau will set the defaults to Many and Some records match. For the examples in this chapter, we do know the precise nature of the relationships (they are described in the previous section), but we’ll accept the performance defaults as the dataset is small enough that there won’t be any perceptible performance gain in modifying them.
With our initial data model created, let’s take a moment to explore the two layers of the data model paradigm.
A data model consists of two layers:
Consider the following screenshot of a canvas containing our four tables:
Figure 14.5: The logical layer of the data model
This initial canvas defines the logical tables of the data model. A logical table is a collection of data that defines a single structure or object that relates to other logical structures of data. Double-click on the Hospital Visit table on the canvas, and you’ll see another layer beneath the logical layer:
Figure 14.6: The physical layer of the physical tables that make up Hospital Visit
This is the physical layer for the logical Hospital Visit table. This physical layer is made up of physical tables of data—potentially unioned or joined together. In this case, we are informed that Hospital Visit is made of 1 table. So, in this case, the logical layer of Hospital Visit is identical to the physical layer underneath. In the Using joins section of this chapter, we’ll explore examples of how we might extend the complexity of the physical layer with multiple tables while still treating the collection of tables as a single object.
Go ahead and close the physical layer of Hospital Visit with the X icon in the upper-right corner. Then navigate to the Analysis tab of the workbook for this chapter, and we’ll explore how the data model works in practice.
For the most part, working with the data model will be relatively intuitive. There are a few data pane interface features that will help you work with the data model. There are also a few data model behaviors you should learn to expect. Once you are comfortable with them, your analysis will exceed expectations!
With multiple tables related in a data model, you’ll find the Data pane looks something like this:
Figure 14.7: The Data pane is organized by logical tables and shows a separation of dimensions and measures per table
You’ll notice that the Data pane is organized by logical tables, with fields belonging to each table. Measures and dimensions are separated by a thin line rather than appearing in different sections as they did previously.
This makes it easier to find the fields relevant to your analysis and also helps you to understand the expected behavior of the data model. Each logical table also has its own number of records field that is named using the convention Table Name (Count), which will appear at the end of the list of measures for that table.
With an overview of how the data pane helps you work with the data model, let’s look at some behaviors you can expect from the data model.
In the Analysis tab of the Starter
workbook, experiment with creating different visualizations. Especially note dimensions, what values are shown, and how measures are aggregated. We’ll walk through a few examples to illustrate (which you can replicate in the Starter
workbook or examine in the Complete
workbook).
First, notice that dragging Name from the Patient table to Rows reveals 10 patients. It turns out that not all of these patients have hospital visits, but when we use one or more dimensions from the same logical table, we see the full domain of values in Tableau. That is, we see all the patients, whether or not they had visited the hospital. We can verify how many visits each patient had by adding the Hospital Visit (Count) field, resulting in the following view:
Figure 14.8: All patients are shown, even those with 0 visits
But if we add Primary Diagnosis to the table, notice that only 6 out of the 10 patients are shown:
Figure 14.9: Only patients with visits are shown; most patients had a single visit with a given diagnosis, but one came in twice with the same diagnosis
This highlights another behavior: when you include dimensions from two or more tables, only matching values are shown. In essence, when you add Name and Primary Diagnosis, Tableau is showing you patients who exist in both the Patient and Hospital Visit tables. This is great if you want to focus on only patients who had visited the hospital.
But what if you truly want to see all patients and a diagnosis where applicable? To accomplish that, simply add a measure from the table of the field where you want to see the full domain. In this case, we could add either the Age at Most Recent Admit or Patient (Count) measures, as both come from the Patient table. Doing so results in the following view:
Figure 14.10: All patients are once again shown
Even though the Age at Most Recent Admit value is NULL
for patients who have never been admitted, simply adding the measure to the view instructs Tableau to show all patients. This demonstrates a third behavior: including a measure from the same table as a dimension will force Tableau to show the full domain of values for that dimension.
Another basic principle of data model behavior is also displayed here. Notice that Age at Most Recent Admit is shown for each patient and each diagnosis. However, Tableau does not incorrectly duplicate the value in totals or subtotals.
If you were to add subtotals for each patient in the Age at Most Recent Admit and Count of Hospital Visit columns, as has been done in the following view, you’ll see that Tableau has the correct values:
Figure 14.11: Tableau calculates the subtotals correctly, even though traditional join behavior would have duplicated the values
This final behavior of the data model can be stated as: aggregates are calculated at the level of detail defined by the logical table of the measure. This is similar to how you might use a Level of Detail (LOD) expression to avoid a LOD duplication, but you didn’t have to write the expression or break your flow of thought to solve the problem. The Tableau data model did the hard work for you!
Take some additional time to build out views and visualizations with the data model you’ve created, and review the following behaviors so you know what to expect and how to control the analysis you want to perform:
With just a bit of practice, you’ll find that the behaviors feel natural, and you’ll especially appreciate Tableau performing aggregations at the correct level of detail.
When you first create a new data model, it is helpful to run through a couple of quick checks similar to the examples in this section. That will help you gain familiarity with the data model, as well as helping you validate that the relationships are working as you expect.
We’ll now turn our focus to learn how to relate data in the physical layer using joins.
A join at the physical level is a row-by-row matching of the data between tables. We’ll look at some different types of joins and then consider how to leverage them in the physical layer of a data model.
In the physical layer, you may specify the following types of joins:
Figure 14.12: Inner join
NULL
values for all fields from the table on the right.
In the following example, the five rows from the left table are kept, with NULL
results for any values in the right table that were not matched:
Figure 14.13: Left join
NULL
values for all fields from the table on the left. Not every data source supports a right join. If it is not supported, the option will be disabled. In the following example, the five rows from the right table are kept, with NULL
results for any values from the left table that were not matched:Figure 14.14: Right join
NULL
values where either the left or the right matching record was not found. Not every data source supports a full outer join.
If it is not supported, the option will be disabled. In the following example, all rows are kept from both sides with NULL
values where matches were not found:
Figure 14.15: Full Outer join
Figure 14.16: Spatial join
When you select spatial objects from the left and right tables, you’ll need to specify Intersects as the operator between the fields to accomplish a spatial join, as shown in Figure 14.17:
Figure 14.17: Assuming the two fields selected represent spatial objects, the Intersects option will be available
With a solid understanding of join types, let’s consider how to use them in the physical layer of Tableau’s data model.
Most databases have multiple tables of data that are related in some way. Additionally, you are able to join together tables of data across various data connections for many different data sources.
For our examples here, let’s once again consider the tables in the hospital database, with a bit of simplification:
Figure 14.18: The primary Hospital Visit table with Patient and Discharge Details as they might exist in a relational database
Let’s consider how we might build a data source using some joins in the physical layer. To follow along, create a new Excel data source in the Chapter 14 Starter.twbx
workbook that references the Hospital Visits (Joins).xlsx
file in the Chapter 14
directory. You may also examine the connection in the Chapter 14 Complete.twbx
workbook.
Just as we did before, we’ll start by dragging the Hospital Visit table onto the data source canvas such that we have a Hospital Visit object in the logical layer, like this:
Figure 14.19: After dragging the table onto the canvas, the Hospital Visit object is created in the logical layer
At this point, the logical layer object simply contains a single physical table. But we’ll extend that next. Double-click on the Hospital Visit object to expand the physical layer. It will look like this:
Figure 14.20: The physical layer, which currently consists of a single physical table
You can extend the physical model by adding additional tables. We’ll do that here, by adding Discharge Detail and Patient. As we add them, Tableau will prompt you with a dialog box to adjust the details of the join.
It will look like this:
Figure 14.21: Joining Discharge Detail to Hospital Visit in the physical layer
The Join dialog allows you to specify the join type (Inner, Left, Right, or Full Outer) and to specify one or more fields on which to join. Between the fields, you may select which kind of operator joins the fields. The default is equality (=
; the fields must be equal), but you may also select inequality (<>
; the fields must not be equal), less than (<
), less than or equal to (<=
), greater than (>
), or greater than or equal to (>=
). The type of join and the field relationships that define the join will determine how many records are returned from the join. We’ll take a look at the details in the next section.
Typically, you’ll want to start by dragging the primary table onto the physical layer canvas. In this case, Hospital Visit contains keys to join additional tables. Additional tables should be dragged and dropped after the primary table.
For now, accept the fields that Tableau automatically detects as shared between the tables (Discharge Details ID for Discharge Details and Patient ID for Patient). Change the join to Discharge Details to a left join. This means that all hospital visits will be included, even if there has not yet been a discharge. Leave Patient as an inner join. This will return only records that are shared between the tables so that only patients with visits will be retained.
Ultimately, the physical layer for Hospital Visit will look like this:
Figure 14.22: The physical layer is made up of three tables joined together
When you close the physical layer, you’ll once again see the logical layer, which contains a single object: Hospital Visit. That object now contains a join icon, indicating that it is made up of joined physical tables. But it remains a single object in the logical layer of the data model and looks like this:
Figure 14.23: The logical layer contains a single object that is made up of three physical tables
All the joins create what you might think of as one flat table, which can be related together with other objects in the data model. Those objects, in turn, might each be made up of a single physical table or multiple physical tables joined together.
If you are following along with the example, rename this data source Hospital Visits (Joins). We’ll leverage this data source for one more example at the end of this chapter. In the meantime, let’s consider a few additional details related to joins.
We conclude this section with some further possibilities to leverage joins, as well as a caution regarding a potential problem that can arise from their use.
In the previous example, we noted that Tableau joins rows in one table to rows in another based on fields in the data. You may come across cases where you need to join based on values that are not present in the data but can be derived from the existing data. For example, imagine that there is a Patient Profile table that would add significant value to your dataset. However, it lacks a Patient ID and only has First Name and Last Name fields.
To join this to our Patient table, we can use a join calculation. This is a calculation that exists only for the purpose of joining tables together. To create a join calculation, use the drop-down list of fields in the Join dialog box and select the final option, Create Join Calculation...:
Figure 14.24: You can create a join calculation to aid in forming the correct joins
Selecting this option allows you to write row-level calculations that can be used in the join. For example, our join calculation might have code like [First Name] + " " + [Last Name]
to return values that match with the Name field.
Try to avoid joining on text fields, especially in larger datasets, for performance reasons. Joining on integers is far more efficient. Also, it is entirely possible for two separate people to share first and last names, so a real-world dataset that followed the structure in this example would be subject to false matches and errors.
You may also leverage the geospatial functions mentioned in Chapter 12, Exploring Mapping and Advanced Geospatial Features, to create a spatial join between two sources, even when one or both lack specific spatial objects on which to join. For example, if you have Latitude
and Longitude
, you might create a join calculation with the code MAKEPOINT([Latitude], [Longitude])
to find the intersection with another spatial object in another table.
Join calculations can also help when you are missing a field for a join. What if the data you want to join is in another database or file completely? In this scenario, we would consider cross-database joins.
With Tableau, you have the ability to join (at the row level) across multiple different data connections. Joining across different data connections is referred to as a cross-database join. For example, you can join SQL Server tables with text files or Excel files, or join tables in one database with tables in another, even if they are on a different server. This opens up all kinds of possibilities for supplementing your data or analyzing data from disparate sources.
Consider the hospital data. Though not part of the data included in the Chapter 14
sample data, it would not be uncommon for billing data to be in a separate system from patient care data. Let’s say you had a file for patient billing that contained data you wanted to include in your analysis of hospital visits. You would be able to accomplish this by adding the text file as a data connection and then joining it to the existing tables, as follows:
Figure 14.25: Joining tables or files based on separate data connections
You’ll notice that the interface on the Data Source screen includes an Add link that allows you to add data connections to a data source. Clicking on each connection will allow you to drag and drop tables from that connection into the Data Source designer and specify the joins as you desire. Each data connection will be color-coded so that you can immediately identify the source of various tables in the designer.
You may also use multiple data sources in the logical layer.
Another consideration with joins is unintentional errors, which we’ll consider next.
Finally, we conclude with a warning about joins—if you are not careful, you could potentially end up with a few extra rows or many times the number of records than you were expecting. Let’s consider a theoretical example.
Let’s say you have a Visit
table like this:
Visit ID |
Patient Name |
Doctor ID |
1 |
Kirk |
1 |
2 |
Picard |
2 |
3 |
Sisko |
3 |
And a Doctor
table like this:
Doctor ID |
Doctor Name |
1 |
McCoy |
2 |
Crusher |
3 |
Bashir |
2 |
Pulaski |
Notice that the value 2
for Doctor ID
occurs twice in the Doctor
table. Joining the table on equality between the Doctor ID
value will result in duplicate records, regardless of which join type is used. Such a join would result in the following dataset:
Visit ID |
Patient Name |
Doctor ID |
Doctor Name |
1 |
Kirk |
1 |
McCoy |
2 |
Picard |
2 |
Crusher |
3 |
Sisko |
3 |
Bashir |
2 |
Picard |
2 |
Pulaski |
This will greatly impact your analysis. For example, if you were counting the number of rows to determine how many patient visits had occurred, you’d overcount. There are times when you may want to intentionally create duplicate records to aid in analysis; however, often, this will appear as an unintentional error.
In addition to the danger of unintentionally duplicating data and ending up with extra rows, there’s also the possibility of losing rows where values you expected to match didn’t match exactly. Get into the habit of verifying the row count of any data sources where you use joins.
A solid understanding of joins will not only help you as you leverage Tableau Desktop and Tableau Server, but it will also give you a solid foundation when we look at Tableau Prep in Chapter 16, Taming Data with Tableau Prep. For now, let’s wrap up this chapter with a brief look at blends.
Data blending allows you to use data from multiple data sources in the same view. Often, these sources may be of different types. For example, you can blend data from Oracle with data from Excel. You can blend Google Analytics data with a spatial file. Data blending also allows you to compare data at different levels of detail. Let’s consider the basics and a simple example.
Data blending is done at an aggregate level and involves different queries sent to each data source, unlike joining, which is done at the row level and (conceptually) involves a single query to a single data source. A simple data blending process involves several steps, as shown in the following diagram:
Figure 14.26: How Tableau accomplishes blending
We can see the following from the preceding diagram:
It is important to note that data blending is different from joining. Joins are accomplished in a single query and results are matched row by row. Data blending occurs by issuing two separate queries and then blending together the aggregate results.
There can only be one primary source, but there can be as many secondary sources as you desire. Steps 3 and 4 are repeated for each secondary source. When all aggregated results have been returned, Tableau matches the aggregated rows based on linking fields.
When you have more than one data source in a Tableau workbook, whichever source you use first in a view becomes the primary source for that view.
Blending is view-specific. You can have one data source as the primary in one view and the same data source as the secondary in another. Any data source can be used in a blend, but OLAP cubes, such as in SQL Server Analysis Services, must be used as the primary source.
In many ways, blending is similar to creating a data model with two or more objects. In many cases, the data model will give you exactly what you need without using blending. However, you have a lot more flexibility with blending because you can change which fields are related at a view level rather than at an object level.
Linking fields are dimensions that are used to match data blended between primary and secondary data sources. Linking fields define the level of detail for the secondary source. Linking fields are automatically assigned if fields match by name and type between data sources.
Otherwise, you can manually assign relationships between fields by selecting, from the menu, Data | Edit Blend Relationships, as shown in Figure 14.27:
Figure 14.27: Defining blending relationships between data sources
The Relationships window will display the relationships recognized between different data sources. You can switch from Automatic to Custom to define your own linking fields.
Linking fields can be activated or deactivated to blend in a view. Linking fields used in the view will usually be active by default, while other fields will not. You can, however, change whether a linking field is active or not by clicking on the link icon next to a linking field in the data pane.
Additionally, use the Edit Data Relationships screen to define the fields that will be used for cross-data source filters. When you use the drop-down menu of a field on Filters in a view, and select Apply to Worksheets | All Using Related Data Sources, the filter works across data sources.
Let’s take this from the conceptual to the practical with an example.
Let’s look at a quick example of blending in action. Let’s say you have the following table representing the service goals of various locations throughout the hospital when it comes to serving patients:
Location |
Avg. Minutes to Service Goal |
Inpatient Surgery |
30 |
Outpatient Surgery |
40 |
ICU |
30 |
OBGYN |
25 |
Lab |
120 |
This data is contained in a simple text file, named Location Goals.txt
, in the Chapter 14
directory. Both the starter and complete workbooks already contain a data source defined for the file.
We’ll start by creating a simple bar chart from the Hospital Visit (Joins) data source you created previously, showing the Average Minutes to Service by Location like so:
Figure 14.28: Average Minutes to Service by Location
Then, in the Data pane, we’ll select the Location Goals data source. Observe the Data pane shown here:
Figure 14.29: Hospital Visit (Joins) is shown as the Primary data source and Location in the Location Goals data source is indicated as a linking field
The blue checkmark on the Hospital Visit (Joins) data source (numbered 1 in Figure 14.29) indicates that the data source is primary. Tableau recognizes Location as a linking field and indicates that it is active with a connected link icon (numbered 2 in Figure 14.29). It is active because you have used Location from the primary data source in the current view. If you had not, Tableau would still show the link, but it would not be active by default. You may click on the link icon to switch from active to inactive or vice versa to control the level of detail at which aggregations are done in the secondary source.
For now, click on Avg. Minutes to Service Goal in the data pane and select Bullet Graph from Show Me, as indicated here:
Figure 14.30: You may drag and drop fields from secondary sources into the view or use Show Me
You may have to right-click on the Avg. Minutes to Service axis in the view and select the Swap Reference Line fields to ensure the goal is the reference line and the bar is the actual metric. Your view should now look like this:
Figure 14.31: A view created from a primary source and a secondary source
Notice that both the Hospital Visits (Joins) data source and the Location Goals data source are used in this view. Hospital Visit (Joins) is the primary data source (indicated by a blue checkmark), while Location Goals is the secondary source (indicated by the orange checkmark). The Avg. Minutes to Service Goal field on Detail in the Marks card is secondary and also indicated by an icon with an orange checkmark.
You may also notice that Main Hospital and Intensive Care do not have goals indicated in the view. Recall that the primary data source is used to determine the full list of values shown in the view. Main Hospital is in the primary source but does not have a match in the secondary source. It is shown in the view, but it does not have a secondary source value.
Intensive Care also does not have a secondary value. This is because the corresponding value in the secondary source is ICU. Values must match exactly between the primary and secondary sources for a blend to find matches. However, blends do also take into account aliases.
An alias is an alternate value for a dimension value that will be used for display and data blending. Aliases for dimensions can be changed by right-clicking on row headers or using the menu on the field in the view or the data pane and selecting the Aliases option.
We can change the alias of a field by right-clicking on the row header in the view and using the Edit Alias… option, as shown here:
Figure 14.32: Using the Edit Alias... option
If we change the alias to ICU, a match is found in the secondary source and our view reflects the secondary value:
Figure 14.33: ICU now finds a match in the secondary source
A final value for Location, Lab, only occurs in the Location Goals.txt
source and is, therefore, not shown in this view. If we were to create a new view and use Location Goals as the primary source, it would show.
We’ve covered quite a few options regarding how to relate data in this chapter. Let’s just take a moment to consider when to use these different techniques.
In one sense, every data source you create using the latest versions of Tableau will use a data model. Even data sources using one physical table will have a corresponding object in the logical layer of a data model. But when should you relate tables using the data model, when should you join them together in the physical layer, and when should you employ blending?
Most of the time, there’s no single right or wrong answer. However, here are some general guidelines to help you think through when it’s appropriate to use a given approach.
In general, use a data model to relate tables:
In general, use joins at the physical level:
In general, use blending when:
As you grow in confidence while using each of these approaches, you’ll be able to better determine which makes sense in a given circumstance.
You now have several techniques to turn to when you need to relate tables of data together. The data model gives a paradigm for relating logical tables of data together. It introduces a few behaviors when it comes to showing the full and partial domains of dimensional values, but it also greatly simplifies aggregations by taking into account the natural level of detail for the aggregation. In the physical layer, you have the option of joining together physical tables.
We covered the various types of joins and discussed possibilities for using join calculations and cross-database joins for ultimate flexibility. We briefly discussed how data blending works and saw a practical example. Finally, you examined a broad outline of when to turn to each approach. You now have a broad toolset to tackle data in different tables or even in different databases or files.
We’ll expand that toolset quite a bit more in the next chapter as we look at Tableau Prep Builder. Tableau Prep gives you incredible power and sophistication, allowing you to bring together data from various sources, clean it, and structure it in any way you like!
Join our community’s Discord space for discussions with the author and other readers: https://packt.link/ips2H