Chapter 42. Documenting Your Data Preparation

Jobs are no longer for life; these days, people look for variety and challenges in their careers. Data roles are no different, as the skills are highly transferable between companies and industries. Therefore, the data sets you prepare are likely to be passed along to others. To ensure the work continues to be understandable, up-to-date, and valuable, it needs to be well documented so you can walk other preppers through your logic once you move on or get promoted. In this chapter, we will cover the fundamentals of documenting your data preparation work, Prep Builder’s built-in functionality to aid basic documentation, and the considerations to keep in mind for each data preparation step in Prep Builder.

Basic Documentation

No matter what tool you use to prepare your data sets, there are a number of aspects you should consider.

Folder Structure

All documentation within the data preparation file is useless if you can’t find the flow file in the first place. Keeping a folder that is available to you and your team for all the flows will help tremendously. My organization uses Google Drive, as it has the benefit of not just controlled sharing but also being available on any computer I log in to. Setting up a standardized structure for those files is also key. In large organizations with more complex data flows, you should consider setting up the following folders:

In production
The holy grail of folders, this file should be under strict control to avoid changes that break data sets your organization relies on. Flow files should be added to this folder only once they are fully tested.
Development flows
This is the work in progress folder, or “sandbox.”
Testing
Once you’ve developed a flow, this is where you lock the version being tested.
Archive
Having a history of key versions can help you track output changes.

Filenames

The naming convention you use for files is also key. If you can’t locate the correct document in a folder of flow files, there is no point to spending time documenting the work in the first place. As a consultant, I have probably seen every naming convention imaginable. There’s no single optimal solution; the key criterion is, Does everyone understand it? If they don’t, then there is no reason to have a naming convention, as your colleagues will soon start to break it or chase you down to find out where a certain file is. You have data to prepare; you don’t have time for that!

Data Sources

Within the file, the key piece of documentation that must be crystal clear is where that input data comes from (Figure 42-1). This may sound obvious, but those source files/tables are likely to move over time or change structure. Tableau tools can read only what is in the underlying source. As the source changes, so will the resulting data being used by the flow and, therefore, the output.

Documentation of the Input step
Figure 42-1. Documentation of the Input step

Recording the original data source, location, filename or table name, and the frequency of updates will help you and your colleagues understand what should, and should not, change when the flow is next run.

Output

Knowing which files it is safe to overwrite is critical in data preparation, as you may not be able to reverse the changes if you need to. The file location should be clear, and the output location shouldn’t be changed, unless the existing output file is also moved. Clear documentation of what the output is and where it is held will help with this (Figure 42-2).

Documentation of the Output step
Figure 42-2. Documentation of the Output step

There are a number of other stages in Prep Builder where documentation can make the difference between work being easy to access and fix if necessary and it being an absolute nightmare.

Step Names

There are currently only eight different types of steps that can be set up, but clearly documenting them will help with handing over and maintaining the work.

Clean Step

The Clean step is the Swiss Army knife of Prep Builder, as one step can comprise hundreds of different preparation techniques. This step can include calculations, filters, cleaning string values, splitting fields, and renaming or even deleting fields. The Clean step can also contain any combination of these actions. Renaming the step with a description of what you are doing will ensure that the users of the flow—and your future self—can follow along (Figure 42-3).

The Clean step icon
Figure 42-3. The Clean step icon

The Prep developers have provided icons that show some of the top-level actions happening within the Clean step. For example, the step shown in Figure 42-4 includes both a filter and a calculation, and a field has been removed.

Clean step with icons demonstrating different clean operations
Figure 42-4. Clean step with icons demonstrating different clean operations

These icons give the user a very good quick overview of what is happening or give you a reminder of what happened if you return to the data at a later time. The step names need to be very concise, as only a limited number of characters will show in the Flow pane.

Step Descriptions

Step descriptions, unlike step names, can be much longer—200 characters. They also have the significant benefit that they can be toggled between visible and invisible. In the union shown in Figure 42-5, clicking the gray quotation icon will hide or show the description in the Flow pane.

Documented Union step with description
Figure 42-5. Documented Union step with description

The description allows you (the flow author) to add much more detail about what is happening at each step in the flow while preparing the data for analysis.

Color

One feature that was added very early on in the development of Tableau Prep was the ability to assign colors to steps to add visual documentation to your flow. This is useful both to the person building the flow and to someone picking up that flow for maintenance or further development. There are two key steps where color particularly makes a difference in Prep Builder.

Joins

When joining data sets together, or self-joining data together as in Figure 42-6, it’s helpful to use color to show there are two different data sets coming together. Not only is this very useful when you are picking up someone else’s flow, but it also helps with ensuring you have used the correct data fields from the incoming data sets.

A flow showing a well-documented join including color logic
Figure 42-6. A flow showing a well-documented join including color logic

The color logic I like to apply to joins mixes together the yellow and blue inputs to create a green output. This way, I can instantly see the two data fields that are being joined (Figure 42-7). This is useful especially on inner joins, where the fields are normally identical; it’s much easier to see where values have not been joined if you can see which input source has the mismatched fields.

Coloring the Join step assists with the setup
Figure 42-7. Coloring the Join step assists with the setup

The green line running above the Profile pane in Figure 42-7 helps highlight to the flow developer the data fields that the join will be creating.

Unions

In a Union step, you can use color to demonstrate the two flows being stacked together. Unlike my approach to joins, my preference here is that the two flows are two shades of the same color rather than a mix, since the data structure is the same or similar (Figure 42-8).

A well-documented Union step within a flow
Figure 42-8. A well-documented Union step within a flow

In the setup of the Union step, the input flows’ colors also represent (as with joins) where they have come from (Figure 42-9).

As with a Join step, the coloring of the flow assists with the Union step setup
Figure 42-9. As with a Join step, the coloring of the flow assists with the Union step setup

In this example, the Time field has come from the Clean Times input, and the 24 Hour Time Format has come from the Unclean Times input. This helps you consider the different field names and where you may want to go back “upstream” in your preparation flow to either amend the names or investigate why they differ. The absence of color indicates nulls, which occur because they lack a corresponding field in the other data set (Figure 42-10).

Absence of color represents nulls in a union
Figure 42-10. Absence of color represents nulls in a union

Summary

Although documentation might sound laborious and time-consuming, this isn’t the case in Prep Builder. Editing step names, adding short descriptions, or changing the color of the preparation steps can make development less error-prone and hand over much easier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset