One of the first steps you will take in any data preparation, visualization, or analytics project is to input data. In this chapter, you will learn how and why you should connect data files to Prep Builder, where to find those files, and what challenges data files may pose for preparing data. Data can come in many different forms, but the most common input for Prep Builder is Excel, CSV, or plain-text files (Figure 5-1).
For many of us, our first experiences of working with data involved a piece of ubiquitous software: Microsoft Excel. The spreadsheet has become the jack-of-all-trades for data storage, manipulation, and basic reporting. Every organization across the world has tens, if not hundreds, of spreadsheets saved on every computer. All of them contain useful information, raw data, or reference tables that can help answer the questions organizations are posing.
However, spreadsheets raise some potential issues (some of which are not unique to spreadsheets) when it comes to data preparation and analysis:
Organizations are complex. Mergers, acquisitions, and splits can create a complex history and network to unpack, and the data often reflects this. Trying to maintain the integrity of the data and your understanding of it is a challenge in itself.
Data files come in many formats besides Excel spreadsheets—CSV, text, statistical, and PDF files are all likely to house data on your computer that could help you answer the questions you have. All these file types are created and used for multiple reasons. Often, they are default outputs from different programs where you’re working with operational or analytical systems in your organization. Let’s take a closer look at these files and where they come from:
These file types can help hold a range of data in lots of different formats. That’s one of the many reasons they are so commonly used, but it creates some issues for us as data preppers. Each file will have its own structure, messiness, and set of challenges that we’ll have to overcome before we can use the data for analysis or augment it with other data sets to add extra value.
Not only that, but the data files could be anywhere and everywhere. It’s key to know the right people in your organization who may have collaborated with others or created the files themselves. Here are some possible sources and some considerations around each:
This flexibility means your files will contain many useful bits of data. Whether it is projected budgets, targets, or the latest reorganization structure to be applied to your earnings, you will need to be comfortable preparing these data sets for analysis. Now, though, rather than using manual manipulation, you will be able to use Prep Builder to separate the useful from the dross and structure the data in preparation for your analysis.
Tableau Prep Builder’s Home screen is very similar to that of Tableau Desktop. To make any data connection, click the plus sign next to the word Connections (Figure 5-2).
As shown in Figure 5-3, this expands the Connections pane to display all the connection types you can make in Prep Builder, including the file types just discussed (except spatial files, as of version 2020.1).
Click the relevant file type to open a File Explorer (Windows) or Finder (Mac) to choose the location of your file. Once you find the file, click Open to connect to that file, after which you’re ready to prepare your data for your purposes.
When saving your Prep Builder flows, you have two main options. First, you can create a Tableau Flow File (.tfl), which just saves the logic of the flow. This means the input file’s location needs to match the location where the current flow is connected. Otherwise, if the flow is rerun then it will throw an error because it can’t find the input data. The second option is to create a Packaged Tableau Flow File (.tflx), which saves both the flow logic and the input data in the same file. This means you can move the input files and the flow will still run. However, if the original input files are updated, this flow will not process the new files; it runs only the file saved within the packaged flow file.
Big data is not always the biggest challenge within the data world. Small data files can be hugely beneficial in terms of containing useful data, but that data is not always easy to extract. Using tools like Prep Builder makes preparing these files much easier and repeatable, unlike manual workarounds to achieve the same result.