Chapter 43. Deciding Where to Prepare Your Data

Like many software providers in the data landscape, Tableau doesn’t just have one tool where you do every task. Frankly, such a tool would be either crazily complex to work with or incomplete in terms of functionality the user needs. Having multiple tools available within the Tableau platform does pose another question, though: Where should you complete certain processes?

Processes to Consider

Data preparation comes down to a number of key steps:

  1. Inputting data

  2. Joining/unioning multiple data sets

  3. Pivoting

  4. Cleaning

  5. Aggregating

  6. Outputting data

Each step could be appropriate for lots of hypothetical situations, but in reality, the majority of them should take place in the data preparation tool. Joins, unions, and pivots are common tasks at the data prep stage, and end users of the data set should be spared that complexity in Desktop. While flexibility might be required in some cases (specific visualization styles, for example, demand a different data structure), the majority of data sets have a relatively standard setup for analysis.

This leaves cleaning (including calculations) and aggregations as two processes that may fit in either the data preparation tool or the visualization tool. With small data sets and simple calculations, which approach is “best” is more ambiguous. However, as the size of the data set or the complexity of the calculations grows, your decision here might determine how successfully your organization can utilize the data.

Data Preparation Versus Visual Analytics

Balancing agility and functionality is a key consideration when you are evaluating which tool to use to complete each task. If you sacrifice agility by separately preparing data in a tool like Prep, you eliminate the option for each user to do this individually. Removing that flexibility might actually be useful, however, as it will potentially prevent mistakes, optimize performance, and enable tasks that would be impossible otherwise.

Considering what tasks should be completed in each tool can help you allocate the work of data preparation. To determine how work should be distributed, you will need to evaluate your organization’s sophistication on a number of factors.

Data Literacy

Data literacy—or how well one understands data products like graphs, results, and the like—is a key determinant of where you will conduct the cleaning and aggregation. Making data easy to work with is important, but ensuring the accuracy of the answers derived from your data sets is even more important. If your peers do not have the data literacy to take on these tasks, you’ll need to complete this work before making the data available to them. This means that you will need to understand and/or anticipate what questions those users want to answer and prepare your data set accordingly.

Organization Size

Having a team that is competent and able to complete the tasks is a major asset, but you’ll also need to evaluate the volume of work required to repeat those tasks multiple times across the organization. If you are asking one person to complete a task once, it doesn’t matter much where that task is completed. If that same task would need to be repeated hundreds or thousands of times by multiple individuals across an organization, however, then this task should be performed in the data preparation tool to reduce the amount of duplicated effort. Data preparation tools are designed to automate such tasks once they are set up.

Quality of Technological Hardware

The hardware on which the tasks are processed has a strong impact on the time it takes to complete them. Companies across the world pay people high salaries but then equip them with older laptops or underpowered computers. This situation can hinder people from being able to work with data, and the problem only gets worse with increasing volumes of data. If the data sets for analysis are small, then any basic data preparation task may still perform fine on the individual’s computer. If the data sets are large, though, a data preparation tool might be a better option. Data prep tools can often work with just a sample data set (as Tableau Prep does automatically for large data sets) and process the full data set only when required. This full processing likely takes place on a server (which has more processing power) once the full end-to-end data flow and logic is established.

History of Data Investment

If organizations have historically invested in data solutions and continue to do so as technology advances, the likelihood is that their databases will contain clean, ready-to-use information. When conducting the analysis, you can add any necessary fields to the database for future use. If this isn’t the case, then it’s likely you’ll be wrestling with messier data from multiple sporadic sources. There’s no clear answer as to where you should do your data preparation; you will likely need to switch between the data visualization tool (to find out which data is useful) and the data preparation tool (to set up more strategic data sources for future use).

All of these contextual factors will help guide you to a decision, but it’s only when beginning the actual work that you’ll determine where is best to complete it.

Software Performance

As you’ll see in this section, Prep is specifically designed to optimize the process of building the data preparation flow and then executing it.

Sampling

When you import a data set in Prep during an Input step, the software runs a sampling algorithm that shows you a suitable profile of the data without having to process all the rows of the full data set (Figure 43-1).

The default sample in Prep for each Input step
Figure 43-1. The default sample in Prep for each Input step

The sample is designed to represent what steps your data preparation will need to include. Tableau Desktop also shows a small sample of data, but that sample is based only on a certain N number of rows. For many data sources, this will be just the first 1,000 or 10,000 rows of data in the table you’re importing. Therefore, if there are issues in the last rows of a table, you might not see them until much later in the analytical process.

Functionality

Data preparation functionality was initially built into Tableau Desktop in the Data Connection window (Figure 43-2), but the need for additional features coupled with a desire to keep this screen uncluttered led to Prep.

Data preparation options in Desktop
Figure 43-2. Data preparation options in Desktop

While basic tasks can still be completed in Desktop, it has limitations where Prep does not. Planning the required data preparation steps will often eliminate Desktop as an option for completing them. Multiple pivots, unioning data sets from different sources, and preaggregating tables before joining are just a few of the tasks that you will need to complete in Tableau Prep rather than the visualization tool.

Documentation

Being able to apply a solution not just to the problem at hand but also for future scenarios is a significant reason to do your data preparation in a Prep tool. Documenting your process—from naming the logical steps to describing what happens within them—makes it much more robust and maintainable (Figure 43-3).

Step names and descriptions in Prep
Figure 43-3. Step names and descriptions in Prep

To rename a step, double-click its current name. Once you have entered the new name, you will also be given the option to add a description. You can show or hide this option by clicking the speech bubble icon that appears after you’ve written a description. This way, you can hide it to keep your screen cleaner, but it’s still available for new users of the flow or for when you need to revise certain steps.

Summary

There is no single answer as to where to prepare your data, but simply considering this question will improve everyone’s ability to use it well. Your individual situation and how your end users access the data set will determine a lot about where you should process your data. The computing power, frequency of data updates, and the size of the data set are all major factors that will inform which approach you take.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset