Chapter 3. Planning Your Prep

So you know your data isn’t suited to the purpose for which you need it. This is probably because you have tried to analyze it but hit a roadblock early on. Maybe there are multiple data fields containing the information where you would expect just one. Maybe the data you are analyzing has gaps. Or, maybe the data you want to analyze comes from multiple sources.

What do you do after reaching this realization? How do you develop a solution when all you see are the issues in front of you?

This chapter recommends a staged approach to help you plan your data preparation, define the outcome, and build a workflow to solve your challenges. The four stages in the proposed framework are as follows:

  1. Know your data (KYD).

  2. Identify the desired state.

  3. Determine the required transitions from KYD to the desired state.

  4. Build the workflow.

To illustrate this process, we’ll walk through a simple example data set, some sales data from Chin & Beard Suds Co. (Figure 3-1).

Sample data from Chin & Beard Suds Co.
Figure 3-1. Sample data from Chin & Beard Suds Co.

Stage 1: Know Your Data

Without understanding your data set as it currently stands, you will not be able to deliver the results you need. For small data sets, sometimes it’s very easy to develop this understanding and a corresponding plan. With larger data sets, the planning process can take longer, but it is arguably more important since you can hold only so much information in your own memory. Table 3-1 shows what to look for in data sets (using the sample data from Figure 3-1).

Table 3-1. Considerations for your data set
What to look for Our example

Columns, rows, and crosstabs: how is the data structured?

There are two columns of categorical data, with each column representing a month of values.

Headers and columns: are they all there as expected?

Month headers should ideally be one column, with values listed in a separate column.

Data types: what type of data exists in each column?

There should be just one data type in each column. Data types include strings (alphanumeric characters), numbers, dates, and Boolean (true/false) values.

There are two text values, but each subsequent column is a numeric value.

Granularity of rows: what does each row represent?

Each row is a different product sold in a store for a specific date.

Data points: are there any missing?

There are no data points present, so we can disregard this factor for this example.

Number of records: does the data set have the number of observations you expect?

There are three branches and two products, totaling six rows of data. With four months of data, there are 24 values.

Business/organization rules: are there any rules that the data set should adhere to?

In this instance, Lewisham should be recorded as a single branch and not two.

A quick sketch of the data set can often help you think through and answer these questions. Figure 3-2 shows how we’d sketch the sample data.

Sketch of data set highlighting categories and values
Figure 3-2. Sketch of data set highlighting categories and values

By identifying categorical data and the fields that contain the values you need for your analysis, you can see how complete the data set is, why it isn’t ready for analysis yet, and what it might take to prepare the data. Notice that it’s not vital to capture everything. In fact, simplifying the sketch can help you focus on the core issues rather than drowning in the details.

Stage 2: Identify the Desired State

So what is the desired state of your data?

For most modern data analysis tools or visualization software, you will need to structure your data into columns. The first row contains the column heading—the label for the data—and each subsequent row should be an individual record or instance. For example, each row could represent a product purchased from your store or the number of new employees each month (Figure 3-3).

Sample desired state of a data set for analysis
Figure 3-3. Sample desired state of a data set for analysis

Understanding what each row of your data represents is key to preparing and analyzing your data correctly. Most tools require a single input table, so all of the data fields you need for your analysis should go in one table.

Returning to our example, Figure 3-4 shows what we need to output to complete our analysis.

Sketch showing the structure of the data set
Figure 3-4. Sketch showing the structure of the data set

The categorical data fields are Branch, Product, and Date. Note how capturing the data type here is useful too (see the labels above the column headers). The output will have only one field for analysis: a simple sales value.

As you become more experienced in producing your own data sets, this stage becomes a lot easier and almost second nature. You will be able to look at a data set and know what the desired outcome should be. In the meantime, identifying the desired state can seem daunting, so sketch it!

Stage 3: Determine the Required Transitions from KYD to the Desired State

Take your hands away from the keyboard and mouse; for this stage, you won’t need a computer, just your brain. By looking at the original data and identifying the desired end state, you will start to understand some of the transitions you will need—cleaning, pivoting, joining, and aggregating the data—between the two stages. (Don’t worry if these types of transitions within data are completely new concepts for you; upcoming chapters will help you understand and master them to solve your own data challenges.)

Start by making a list of the transitions you think you will need to make. Doing this outside of Prep Builder will help you think through the necessary steps, rather than worrying about how you’ll implement them in the software. If you don’t know how to complete a step, you can deal with that at a later point. You may not end up doing them all anyway, since you’re not yet building the workflow. Here are some of the questions to ask yourself at this stage:

Columns
  • Are there too many? Remove unnecessary fields or pivot columns to turn them into rows if they represent the same thing (e.g., dates).

  • Are there columns missing? Maybe join a secondary data set. A join pulls together two data sets (see Chapters 32 and 33).

  • Are the data field names clear? They should be understandable to the data users. If they’re not, you will need to amend them.

  • Are calculations needed? If the data field you need for your analysis doesn’t exist within your data set, you’ll need to create it. This is possible only if you can work out the logic of how to create the new field based on the existing data within your data set. By completing your calculations in your preparation tool, you free up your analysis tool to focus on forming the data and require fewer skills from the end user. Lots of the calculations in your preparation phase will focus on the categorical data fields, as the actual measures will be aggregated during the analysis phase.

Rows
  • Are there more rows than expected? Filter out unnecessary rows. Aggregate the data to be less granular.

  • Are there fewer than expected? Pivot columns to create more. Add an additional data set through unioning or join a data scaffold.

  • Are the values clean? For example, is there punctuation (or other extraneous characters) where there shouldn’t be any? Take note of individual changes, as they will likely be separate data prep steps.

  • Are there any blanks? Should they be there? If not, you’ll need to filter them or find a way to fill them.

Multiple data sources
  • Do you need to join multiple sources together to add all the data required for the analysis?

  • Do you need to union together multiple sources to add more rows to complete a data set?

Let’s return to our example to see how this works (Figure 3-5).

Sketch planning the transitions required in the sample data set
Figure 3-5. Sketch planning the transitions required in the sample data set

Very quickly we can spot the challenges in this sketch of the shape of the data. For example, we can see:

  • We have two categorical data fields (Branch and Product).

  • The rest of the columns are all headed by dates and contain the sales values.

    • The dates need to be pivoted to form a third categorical field. This will add another row for each month/branch/product combination. The pivoting process will create a column of values.

  • Both the Branch and Product fields contain punctuation (underscores and hyphens, respectively) that will need to be removed.

For the rows of data, the analysis requires a row for each unique combination of branch (town), product, and date. This means we will need to:

  • Aggregate the Lewisham_1 and Lewisham_2 sales together to output the data at the correct granularity. This will also change the number of rows in the desired state data set compared to the original data set.

We also might need to take these steps:

  • Rename fields. Is each column name a clear reference to the data it contains?

  • Change the data types of fields. Changing a string to a number allows for aggregation. (This will be covered in more detail in Part II of the book, Data Types.)

Stage 4: Build the Workflow

OK, back to the mouse and keyboard. Where do you start building out this workflow? Well, making a basic step-by-step plan is a good strategy. With Tableau Prep, you can quickly change the order of the transitions or add forgotten ones to move from the original data to the desired state. You might not get the workflow right the first time, but you will be a lot closer for having planned out these steps.

Also, you might not know which tool, transition, or calculation to use to make the change you require, but you will be able to take a step back to rethink the problem and not lose your place in the process.

So let’s complete our example. With all the steps just captured, I used Tableau Prep Builder to create the workflow shown in Figure 3-6, from inputting the data to outputting it as a CSV (comma-separated value) file. The first step (icon) in Figure 3-6 is the Input step, where the data is imported into Prep Builder for processing. The second icon shows the step of pivoting multiple columns of data to rows of data. The pivoting process will convert all of the different dates into one column, with another column holding the respective value for that date, branch, and product combination.

The Tableau Prep flow from input to output
Figure 3-6. The Tableau Prep flow from input to output

The Clean step (the third icon) contains a lot of detail that is captured in the tools Changes pane (Figure 3-7), namely that we:

  • Created a calculation to change the product from “Liquid-Soap” to “Liquid” and “Soap-Bar” to “Bar.”

  • Grouped two Lewisham stores together.

  • Changed the Pivot1 Names field name to Date.

  • Changed date type for the newly named Date column to a date.

  • Renamed the Pivot1 Values column to Sales.

The Changes pane for the Clean step in #sketch_planning_the_transitions_require
Figure 3-7. The Changes pane for the Clean step in Figure 3-5

This leaves us with a nice clean data set, ready for analysis when we output it in the fourth and final step (Figure 3-8).

We can output the data set in a few ways, but if it will be analyzed in Tableau Desktop, then a Hyper file is a good choice. Output options will be covered in more detail in Part IV.

See the book’s website to access all the files used if you want to have a go at the exercise yourself.

The restructured data
Figure 3-8. The restructured data

Summary

By planning your data preparation, not only will you be more focused on the task of preparing data, but you’ll also have a solid basis to work from when you’re ready to analyze that data. Some of these techniques are challenging, especially when you apply them for the first time, but understanding and practicing them will improve your odds of success. Input and output data sets can be large and complex, so the planning process might require a significant investment of time before you can start making progress on manipulating the data. It’s only normal to want to dive right in, but this planning effort will repay you in the long term by reducing your risk of hitting roadblocks or otherwise getting offtrack in your data analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset