Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Planning Your Prep

So you know your data isn’t suited to the purpose for which you need it. This is probably because you have tried to analyze it but hit a roadblock early on. Maybe there are multiple data fields containing the information where you would expect just one. Maybe the data you are analyzing has gaps. Or, maybe the data you want to analyze comes from multiple sources.

What do you do after reaching this realization? How do you develop a solution when all you see are the issues in front of you?

This chapter recommends a staged approach to help you plan your data preparation, define the outcome, and build a workflow to solve your challenges. The four stages in the proposed framework are as follows:

Know your data (KYD).
Identify the desired state.
Determine the required transitions from KYD to the desired state.
Build the workflow.

To illustrate this process, we’ll walk through a simple example data set, some sales data from Chin & Beard Suds Co. (Figure 3-1).

Stage 1: Know Your Data

Without understanding your data set as it currently stands, you will not be able to deliver the results you need. For small data sets, sometimes it’s very easy to develop this understanding and a corresponding plan. With larger data sets, the planning process can take longer, but it is arguably more important since you can hold only so much information in your own memory. Table 3-1 shows what to look for in data sets (using the sample data from Figure 3-1).

Table 3-1. Considerations for your data set
What to look for	Our example
Columns, rows, and crosstabs: how is the data structured?	There are two columns of categorical data, with each column representing a month of values.
Headers and columns: are they all there as expected?	Month headers should ideally be one column, with values listed in a separate column.
Data types: what type of data exists in each column? There should be just one data type in each column. Data types include strings (alphanumeric characters), numbers, dates, and Boolean (true/false) values.	There are two text values, but each subsequent column is a numeric value.
Granularity of rows: what does each row represent?	Each row is a different product sold in a store for a specific date.
Data points: are there any missing?	There are no data points present, so we can disregard this factor for this example.
Number of records: does the data set have the number of observations you expect?	There are three branches and two products, totaling six rows of data. With four months of data, there are 24 values.
Business/organization rules: are there any rules that the data set should adhere to?	In this instance, Lewisham should be recorded as a single branch and not two.

A quick sketch of the data set can often help you think through and answer these questions. Figure 3-2 shows how we’d sketch the sample data.

Sketch of data set highlighting categories and values

By identifying categorical data and the fields that contain the values you need for your analysis, you can see how complete the data set is, why it isn’t ready for analysis yet, and what it might take to prepare the data. Notice that it’s not vital to capture everything. In fact, simplifying the sketch can help you focus on the core issues rather than drowning in the details.

Stage 2: Identify the Desired State

So what is the desired state of your data?

For most modern data analysis tools or visualization software, you will need to structure your data into columns. The first row contains the column heading—the label for the data—and each subsequent row should be an individual record or instance. For example, each row could represent a product purchased from your store or the number of new employees each month (Figure 3-3).

Sample desired state of a data set for analysis

Understanding what each row of your data represents is key to preparing and analyzing your data correctly. Most tools require a single input table, so all of the data fields you need for your analysis should go in one table.

Sketching Your Data Set

Simple data sets can be very easy to just doodle out. Sketching wide data sets with lots of columns, though, can be a bigger task. Start by listing out your categorical data—that is, subdivide your data into groups. Often there is a limited number of categories compared to the infinite range of numerical values, so categorizing data adds meaning by describing the numerical values in the data set. These categories are parameters that you will analyze the numbers by—for example, region, customer, product, or course. Each unique combination of these categorical data fields will establish the granularity of your desired state data set. In the example data set, the combination of branch, product, and date will set the value being recorded.

Next, add a column for each measure (measurement) you want to analyze. At this point, you should also consider values that might help the analytical tool’s performance. Adding a ranking, market share, or subtotal for each category at this stage could make your data set more accessible to novice users and also more performant, allowing you to complete your analysis more quickly.

Returning to our example, Figure 3-4 shows what we need to output to complete our analysis.

Sketch showing the structure of the data set

The categorical data fields are Branch, Product, and Date. Note how capturing the data type here is useful too (see the labels above the column headers). The output will have only one field for analysis: a simple sales value.

As you become more experienced in producing your own data sets, this stage becomes a lot easier and almost second nature. You will be able to look at a data set and know what the desired outcome should be. In the meantime, identifying the desired state can seem daunting, so sketch it!

Stage 3: Determine the Required Transitions from KYD to the Desired State

Take your hands away from the keyboard and mouse; for this stage, you won’t need a computer, just your brain. By looking at the original data and identifying the desired end state, you will start to understand some of the transitions you will need—cleaning, pivoting, joining, and aggregating the data—between the two stages. (Don’t worry if these types of transitions within data are completely new concepts for you; upcoming chapters will help you understand and master them to solve your own data challenges.)

Start by making a list of the transitions you think you will need to make. Doing this outside of Prep Builder will help you think through the necessary steps, rather than worrying about how you’ll implement them in the software. If you don’t know how to complete a step, you can deal with that at a later point. You may not end up doing them all anyway, since you’re not yet building the workflow. Here are some of the questions to ask yourself at this stage:

Columns

Are there too many? Remove unnecessary fields or pivot columns to turn them into rows if they represent the same thing (e.g., dates).
Are there columns missing? Maybe join a secondary data set. A join pulls together two data sets (see Chapters 32 and 33).
Are the data field names clear? They should be understandable to the data users. If they’re not, you will need to amend them.
Are calculations needed? If the data field you need for your analysis doesn’t exist within your data set, you’ll need to create it. This is possible only if you can work out the logic of how to create the new field based on the existing data within your data set. By completing your calculations in your preparation tool, you free up your analysis tool to focus on forming the data and require fewer skills from the end user. Lots of the calculations in your preparation phase will focus on the categorical data fields, as the actual measures will be aggregated during the analysis phase.

Rows

Are there more rows than expected? Filter out unnecessary rows. Aggregate the data to be less granular.
Are there fewer than expected? Pivot columns to create more. Add an additional data set through unioning or join a data scaffold.
Are the values clean? For example, is there punctuation (or other extraneous characters) where there shouldn’t be any? Take note of individual changes, as they will likely be separate data prep steps.
Are there any blanks? Should they be there? If not, you’ll need to filter them or find a way to fill them.

Multiple data sources

Do you need to join multiple sources together to add all the data required for the analysis?
Do you need to union together multiple sources to add more rows to complete a data set?

Let’s return to our example to see how this works (Figure 3-5).

Sketch planning the transitions required in the sample data set

Very quickly we can spot the challenges in this sketch of the shape of the data. For example, we can see:

We have two categorical data fields (Branch and Product).
The rest of the columns are all headed by dates and contain the sales values.
- The dates need to be pivoted to form a third categorical field. This will add another row for each month/branch/product combination. The pivoting process will create a column of values.
Both the Branch and Product fields contain punctuation (underscores and hyphens, respectively) that will need to be removed.

For the rows of data, the analysis requires a row for each unique combination of branch (town), product, and date. This means we will need to:

Aggregate the Lewisham_1 and Lewisham_2 sales together to output the data at the correct granularity. This will also change the number of rows in the desired state data set compared to the original data set.

We also might need to take these steps:

Rename fields. Is each column name a clear reference to the data it contains?
Change the data types of fields. Changing a string to a number allows for aggregation. (This will be covered in more detail in Part II of the book, Data Types.)

Stage 4: Build the Workflow

OK, back to the mouse and keyboard. Where do you start building out this workflow? Well, making a basic step-by-step plan is a good strategy. With Tableau Prep, you can quickly change the order of the transitions or add forgotten ones to move from the original data to the desired state. You might not get the workflow right the first time, but you will be a lot closer for having planned out these steps.

Also, you might not know which tool, transition, or calculation to use to make the change you require, but you will be able to take a step back to rethink the problem and not lose your place in the process.

So let’s complete our example. With all the steps just captured, I used Tableau Prep Builder to create the workflow shown in Figure 3-6, from inputting the data to outputting it as a CSV (comma-separated value) file. The first step (icon) in Figure 3-6 is the Input step, where the data is imported into Prep Builder for processing. The second icon shows the step of pivoting multiple columns of data to rows of data. The pivoting process will convert all of the different dates into one column, with another column holding the respective value for that date, branch, and product combination.

The Tableau Prep flow from input to output

The Clean step (the third icon) contains a lot of detail that is captured in the tools Changes pane (Figure 3-7), namely that we:

Created a calculation to change the product from “Liquid-Soap” to “Liquid” and “Soap-Bar” to “Bar.”
Grouped two Lewisham stores together.
Changed the Pivot1 Names field name to Date.
Changed date type for the newly named Date column to a date.
Renamed the Pivot1 Values column to Sales.

The Changes pane for the Clean step in #sketch_planning_the_transitions_require

This leaves us with a nice clean data set, ready for analysis when we output it in the fourth and final step (Figure 3-8).

We can output the data set in a few ways, but if it will be analyzed in Tableau Desktop, then a Hyper file is a good choice. Output options will be covered in more detail in Part IV.

See the book’s website to access all the files used if you want to have a go at the exercise yourself.

Summary

By planning your data preparation, not only will you be more focused on the task of preparing data, but you’ll also have a solid basis to work from when you’re ready to analyze that data. Some of these techniques are challenging, especially when you apply them for the first time, but understanding and practicing them will improve your odds of success. Input and output data sets can be large and complex, so the planning process might require a significant investment of time before you can start making progress on manipulating the data. It’s only normal to want to dive right in, but this planning effort will repay you in the long term by reducing your risk of hitting roadblocks or otherwise getting offtrack in your data analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
3. Planning Your Prep

Chapter 3. Planning Your Prep

Figure 3-1. Sample data from Chin & Beard Suds Co.

Stage 1: Know Your Data

Figure 3-2. Sketch of data set highlighting categories and values

Stage 2: Identify the Desired State

Figure 3-3. Sample desired state of a data set for analysis

Figure 3-4. Sketch showing the structure of the data set

Stage 3: Determine the Required Transitions from KYD to the Desired State

Figure 3-5. Sketch planning the transitions required in the sample data set

Stage 4: Build the Workflow

Figure 3-6. The Tableau Prep flow from input to output

Figure 3-7. The Changes pane for the Clean step in Figure 3-5

Figure 3-8. The restructured data

Summary

Table of Contents for 3. Planning Your Prep

Create new playlist

Sign In

Sign Up

Chapter 3. Planning Your Prep

Figure 3-1. Sample data from Chin & Beard Suds Co.

Stage 1: Know Your Data

Figure 3-2. Sketch of data set highlighting categories and values

Stage 2: Identify the Desired State

Figure 3-3. Sample desired state of a data set for analysis

Figure 3-4. Sketch showing the structure of the data set

Stage 3: Determine the Required Transitions from KYD to the Desired State

Figure 3-5. Sketch planning the transitions required in the sample data set

Stage 4: Build the Workflow

Figure 3-6. The Tableau Prep flow from input to output

Figure 3-7. The Changes pane for the Clean step in Figure 3-5

Figure 3-8. The restructured data

Summary

Table of Contents for
3. Planning Your Prep