Jobs are no longer for life; these days, people look for variety and challenges in their careers. Data roles are no different, as the skills are highly transferable between companies and industries. Therefore, the data sets you prepare are likely to be passed along to others. To ensure the work continues to be understandable, up-to-date, and valuable, it needs to be well documented so you can walk other preppers through your logic once you move on or get promoted. In this chapter, we will cover the fundamentals of documenting your data preparation work, Prep Builder’s built-in functionality to aid basic documentation, and the considerations to keep in mind for each data preparation step in Prep Builder.
No matter what tool you use to prepare your data sets, there are a number of aspects you should consider.
All documentation within the data preparation file is useless if you can’t find the flow file in the first place. Keeping a folder that is available to you and your team for all the flows will help tremendously. My organization uses Google Drive, as it has the benefit of not just controlled sharing but also being available on any computer I log in to. Setting up a standardized structure for those files is also key. In large organizations with more complex data flows, you should consider setting up the following folders:
The naming convention you use for files is also key. If you can’t locate the correct document in a folder of flow files, there is no point to spending time documenting the work in the first place. As a consultant, I have probably seen every naming convention imaginable. There’s no single optimal solution; the key criterion is, Does everyone understand it? If they don’t, then there is no reason to have a naming convention, as your colleagues will soon start to break it or chase you down to find out where a certain file is. You have data to prepare; you don’t have time for that!
Within the file, the key piece of documentation that must be crystal clear is where that input data comes from (Figure 42-1). This may sound obvious, but those source files/tables are likely to move over time or change structure. Tableau tools can read only what is in the underlying source. As the source changes, so will the resulting data being used by the flow and, therefore, the output.
Recording the original data source, location, filename or table name, and the frequency of updates will help you and your colleagues understand what should, and should not, change when the flow is next run.
Knowing which files it is safe to overwrite is critical in data preparation, as you may not be able to reverse the changes if you need to. The file location should be clear, and the output location shouldn’t be changed, unless the existing output file is also moved. Clear documentation of what the output is and where it is held will help with this (Figure 42-2).
There are a number of other stages in Prep Builder where documentation can make the difference between work being easy to access and fix if necessary and it being an absolute nightmare.
The Clean step is the Swiss Army knife of Prep Builder, as one step can comprise hundreds of different preparation techniques. This step can include calculations, filters, cleaning string values, splitting fields, and renaming or even deleting fields. The Clean step can also contain any combination of these actions. Renaming the step with a description of what you are doing will ensure that the users of the flow—and your future self—can follow along (Figure 42-3).
The Prep developers have provided icons that show some of the top-level actions happening within the Clean step. For example, the step shown in Figure 42-4 includes both a filter and a calculation, and a field has been removed.
These icons give the user a very good quick overview of what is happening or give you a reminder of what happened if you return to the data at a later time. The step names need to be very concise, as only a limited number of characters will show in the Flow pane.
Step descriptions, unlike step names, can be much longer—200 characters. They also have the significant benefit that they can be toggled between visible and invisible. In the union shown in Figure 42-5, clicking the gray quotation icon will hide or show the description in the Flow pane.
The description allows you (the flow author) to add much more detail about what is happening at each step in the flow while preparing the data for analysis.
One feature that was added very early on in the development of Tableau Prep was the ability to assign colors to steps to add visual documentation to your flow. This is useful both to the person building the flow and to someone picking up that flow for maintenance or further development. There are two key steps where color particularly makes a difference in Prep Builder.
When joining data sets together, or self-joining data together as in Figure 42-6, it’s helpful to use color to show there are two different data sets coming together. Not only is this very useful when you are picking up someone else’s flow, but it also helps with ensuring you have used the correct data fields from the incoming data sets.
The color logic I like to apply to joins mixes together the yellow and blue inputs to create a green output. This way, I can instantly see the two data fields that are being joined (Figure 42-7). This is useful especially on inner joins, where the fields are normally identical; it’s much easier to see where values have not been joined if you can see which input source has the mismatched fields.
The green line running above the Profile pane in Figure 42-7 helps highlight to the flow developer the data fields that the join will be creating.
In a Union step, you can use color to demonstrate the two flows being stacked together. Unlike my approach to joins, my preference here is that the two flows are two shades of the same color rather than a mix, since the data structure is the same or similar (Figure 42-8).
In the setup of the Union step, the input flows’ colors also represent (as with joins) where they have come from (Figure 42-9).
In this example, the Time field has come from the Clean Times input, and the 24 Hour Time Format has come from the Unclean Times input. This helps you consider the different field names and where you may want to go back “upstream” in your preparation flow to either amend the names or investigate why they differ. The absence of color indicates nulls, which occur because they lack a corresponding field in the other data set (Figure 42-10).