Chapter 5. Connecting to Data in Files

Files Upon Files Upon Files

One of the first steps you will take in any data preparation, visualization, or analytics project is to input data. In this chapter, you will learn how and why you should connect data files to Prep Builder, where to find those files, and what challenges data files may pose for preparing data. Data can come in many different forms, but the most common input for Prep Builder is Excel, CSV, or plain-text files (Figure 5-1).

System files generated by software
Figure 5-1. System files generated by software

Spreadsheets

For many of us, our first experiences of working with data involved a piece of ubiquitous software: Microsoft Excel. The spreadsheet has become the jack-of-all-trades for data storage, manipulation, and basic reporting. Every organization across the world has tens, if not hundreds, of spreadsheets saved on every computer. All of them contain useful information, raw data, or reference tables that can help answer the questions organizations are posing.

However, spreadsheets raise some potential issues (some of which are not unique to spreadsheets) when it comes to data preparation and analysis:

Accuracy
The data may have changed since being added to the spreadsheet.
Timelines
The data may have become outdated since being added.
Optimization
Any manipulation or analysis is often manual and nonrepeatable.
Size
With limits on rows and columns, sometimes the data grows beyond spreadsheets.
Sharing
Multiple people using multiple sheets can create confusion and errors.
Single-point solutions
Calculations and analysis are limited to the single workbook in which they’re conducted.

Organizations are complex. Mergers, acquisitions, and splits can create a complex history and network to unpack, and the data often reflects this. Trying to maintain the integrity of the data and your understanding of it is a challenge in itself.

Other File Types

Data files come in many formats besides Excel spreadsheets—CSV, text, statistical, and PDF files are all likely to house data on your computer that could help you answer the questions you have. All these file types are created and used for multiple reasons. Often, they are default outputs from different programs where you’re working with operational or analytical systems in your organization. Let’s take a closer look at these files and where they come from:

CSV (comma-separated value)
These files can be output as a data storage file. Whether you are sharing data for programs that are not in the Microsoft suite or simply moving data into databases, CSVs are very flexible file types. Columns of data are separated by commas (hence the name), with new rows of data listed on separate lines within the file. CSVs are actually a text file but with a fixed delimiter.
Text
Though these files are suitable for holding data in a very simple format, they have the potential to be problematic to work with due to their ability to contain differing structures and data formats. Often, text files are delimited in some way, but this format isn’t always consistent or as easy to use as a CSV file.
Statistical
R and Python are increasingly being taught in universities. Therefore, these file types are being demanded more in the workplace, especially by data scientists and others working with the statistics resulting from R and Python models and packages.
Portable Document Format (PDF)
These commonly used files often contain useful data sets. How a PDF has been created and formatted makes a massive difference as to how easy it is to extract its data using a data preparation flow. For example, if a table is created as an image, its cells can’t be read as data fields and values, but if it’s created as a grid of columns and rows, then they can be.

Where to Find Your Data Files

These file types can help hold a range of data in lots of different formats. That’s one of the many reasons they are so commonly used, but it creates some issues for us as data preppers. Each file will have its own structure, messiness, and set of challenges that we’ll have to overcome before we can use the data for analysis or augment it with other data sets to add extra value.

Not only that, but the data files could be anywhere and everywhere. It’s key to know the right people in your organization who may have collaborated with others or created the files themselves. Here are some possible sources and some considerations around each:

On your computer created by you
Building your own data sets is a common task in the workplace. You might be capturing numbers from reports you read, collecting records, or aggregating other data sets that have been sent to you.
On your computer created by others
Emailed database extracts, survey results, or market research are also common sources of data. The greatest challenge with this data is making sure you’re getting the frequent updates you need to keep your analysis fresh. The hard part isn’t knowing who to ask for current information and insight; it’s waiting for them to reply.
Cloud storage
Team drives held on cloud computers and servers present more of a challenge, as multiple people will be adding files to shared drives or editing existing files, making it a lot tougher to track down the sources for the information. For example, it can be more difficult to learn how data points have been defined, what aggregations have been applied, or what data has been removed. Even the task of refreshing the data set may be challenging due to the difficulty of tracing the origin of the file(s).
On the internet
Many files are now composed of downloads from web pages. The type of file downloaded will affect how well structured it is and hence what challenges you might have to overcome when preparing the data for analysis.

This flexibility means your files will contain many useful bits of data. Whether it is projected budgets, targets, or the latest reorganization structure to be applied to your earnings, you will need to be comfortable preparing these data sets for analysis. Now, though, rather than using manual manipulation, you will be able to use Prep Builder to separate the useful from the dross and structure the data in preparation for your analysis.

How to Connect to Files in Prep

Tableau Prep Builder’s Home screen is very similar to that of Tableau Desktop. To make any data connection, click the plus sign next to the word Connections (Figure 5-2).

Default Connections pane in Prep Builder
Figure 5-2. Default Connections pane in Prep Builder

As shown in Figure 5-3, this expands the Connections pane to display all the connection types you can make in Prep Builder, including the file types just discussed (except spatial files, as of version 2020.1).

Expanded Connections pane
Figure 5-3. Expanded Connections pane

Click the relevant file type to open a File Explorer (Windows) or Finder (Mac) to choose the location of your file. Once you find the file, click Open to connect to that file, after which you’re ready to prepare your data for your purposes.

Considerations for Saving Flows with File Inputs

When saving your Prep Builder flows, you have two main options. First, you can create a Tableau Flow File (.tfl), which just saves the logic of the flow. This means the input file’s location needs to match the location where the current flow is connected. Otherwise, if the flow is rerun then it will throw an error because it can’t find the input data. The second option is to create a Packaged Tableau Flow File (.tflx), which saves both the flow logic and the input data in the same file. This means you can move the input files and the flow will still run. However, if the original input files are updated, this flow will not process the new files; it runs only the file saved within the packaged flow file.

Summary

Big data is not always the biggest challenge within the data world. Small data files can be hugely beneficial in terms of containing useful data, but that data is not always easy to extract. Using tools like Prep Builder makes preparing these files much easier and repeatable, unlike manual workarounds to achieve the same result.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset