Chapter 35. Breaking Down Complex Data Preparation Challenges

Previous chapters have discussed techniques for determining the changes required to prepare a data set for analysis, albeit at a relatively simple level. What about those situations where the path isn’t straightforward; how do you approach the problem then? Complex challenges can include building solutions requiring multiple steps, inputting data sources using multiple join conditions, or having to complete many reshaping steps throughout the data prep process. This chapter will cover this exact scenario by taking on one of the most complicated challenges Preppin’ Data has covered to date: 2020: Week 3. The aim of this challenge, created by Jonathan Allenby (my fellow Dr. Prepper), is to turn the National Basketball Association (NBA) game results into the detailed standings you commonly see on websites or in newspapers.

 

The Challenge

This challenge involves taking the results and building the full conference league tables, including rankings, wins, and losses; recent performance; and even winning streaks (Figure 35-1). This really is a tough challenge with many facets; hence, it’s a great example to use to show you how to break down complex problems into smaller, more manageable pieces.

Challenge post for Preppin’ Data 2020: Week 3
Figure 35-1. Challenge post for Preppin’ Data 2020: Week 3

As covered in Chapter 3, developing an understanding of the input and output will help you to get an overview of the task. On large, complex challenges, planning becomes especially important to ensure you are working toward the desired results.

Where to Begin

Figure 35-2 shows the first few rows of one of the monthly inputs for the challenge, demonstrating the structure of the data set.

Sample input data set for NBA challenge
Figure 35-2. Sample input data set for NBA challenge

To understand where to begin, it’s important to think about what is required from the data set. As this is a Preppin’ Data challenge, this has already been detailed for you (Figure 35-3).

Desired output for the NBA challenge
Figure 35-3. Desired output for the NBA challenge

Here’s my initial scope for this challenge (Figure 35-4).

Sketched plan to solve the challenge
Figure 35-4. Sketched plan to solve the challenge

Let’s revisit some of the steps covered in Chapter 3 and how they apply to this challenge:

  1. Map out your inputs. What does each input contain? What dimensions and measures are there in the data? Are any data fields incorrectly formatted? What is the data’s level of granularity?

    • In this case, only the Date field seems to potentially need cleaning.

  2. Map out your outputs. How many output files will be required? What format will you need certain fields to be? Think about the granularity of the data required.

    • In this case, each team will have a single row within the output. This means there will be a lot of aggregation required to take the game results to the level we need.

  3. Understand the gaps. What fields are missing within the data, and how do we add them? This will give us a list of fields that we need to create through either pivoting, joins, or calculations.

At this stage, you don’t have to solve all the issues that might arise between loading the input data and generating the output data set for analysis. As discussed in Chapter 3, you might not spot all of the challenges in the data set at this point, but as you work with the data, they will emerge.

Logical Steps

Breaking the challenge down into individual chunks makes it a lot easier to work out how best to solve it. If you skip this step, the challenge may seem insurmountable. Creating the calculations you know you’ll need is a good first step, as they will help:

  • Guide next steps toward finding a solution

  • Determine an order in which to proceed

Let’s take wins as an example in this challenge. You have individual game results, and you need to determine who the winner is. This is easier said than done, though, because for each win there is a loss as well. This means two rows are required for each game: one to record the winner, and one for the loser. To ensure we capture every team’s games, first we’ll use the Team List data source and join all game results to that twice—once to match the home team and again to match the away team to the original Team List team names (Figure 35-5).

Join setup to create one set of results per team for home and away games
Figure 35-5. Join setup to create one set of results per team for home and away games

Using two calculations then allows us to calculate each team’s points (Figure 35-6) and their opposition’s points (Figure 35-7) for home games.

Calculating Own Score by returning the home team’s points value
Figure 35-6. Calculating Own Score by returning the home team’s points value
Calculating Opposition Score by returning the away team’s points value
Figure 35-7. Calculating Opposition Score by returning the away team’s points value

These calculations result in the data set shown in Figure 35-8.

Result of Own/Opposition Score calculations
Figure 35-8. Result of Own/Opposition Score calculations

We can then repeat these calculations, but this time testing whether the Team List team name is the away team. Figure 35-7 showed how to calculate the away team’s score where the focus team plays away. Figure 35-9 shows how to calculate the opposition’s score.

Calculating the away team’s Own Score from the second Join step in #join_setup_to_create_one_set_of_results
Figure 35-9. Calculating the away team’s Own Score from the second Join step in Figure 35-5

We can then assess these calculations for who won or lost the game (Figure 35-10). We can take the same approach for games in which the team played away and union the results together.

Calculating the result of each game
Figure 35-10. Calculating the result of each game

A sample of the resulting data set is shown in Figure 35-11.

Sample of the resulting data set to this point
Figure 35-11. Sample of the resulting data set to this point

We can then union together the two flows to create one large data set to record each team’s full set of games in the season (Figure 35-12).

First stage of NBA challenge
Figure 35-12. First stage of NBA challenge

Making Changes

You can repeat this approach to ensure you are tackling each of those calculations in turn. As you determine a solution for each subchallenge, you may need to change the order of the steps, or copy and paste entire sections. This is easy enough. By right-clicking on the linking line between two steps, you can delete it and then drag the step from the “pre-step” to the Add part of the step you wish to link it to in your flow (Figure 35-13).

Reconnecting the steps
Figure 35-13. Reconnecting the steps

Be Ready to Iterate

Often only by working with the data will you arrive at the solution. Otherwise, it can be difficult to imagine how exactly the data will behave during each step and as you use the data when conducting your analysis. When analyzing the data, you may determine that you need to iterate further or remove certain steps. This is a good thing—it is all part of the data prep learning experience, and fortunately, with Prep Builder, you can make these changes quickly.

For example, when creating the Win (W), Loses (L), Home Results (Home), and Away Results (Away) columns, I knew they would involve similar calculations but wasn’t sure in which order I would handle them. These columns represent the total wins and losses as well as the record of wins and losses achieved in home and away games. Figure 35-14 shows the flow I used to approach this task.

Flow to create Wins, Losses, Home Results, and Away Results columns
Figure 35-14. Flow to create Wins, Losses, Home Results, and Away Results columns

The primary data point I needed for each team was whether the team won or lost a game. I already had captured whether the team at the focus of the game had scored more points than the opposition. If they had scored more, I returned a column of W; otherwise, I returned a column of L. This logic was correct but wasn’t ideal for aggregating to calculate all the relevant totals of wins and losses. Therefore, I pivoted this column to create a single column for both.

To make the next counts simple, I created a simple calculation of 1 to add to each win or loss column depending on the result (Figure 35-15).

Pivot to calculate Win/Loss counts
Figure 35-15. Pivot to calculate Win/Loss counts

Knowing that the next steps would require an Aggregate step, I realized I’d need to split the process into two streams at this point. The Aggregate step returns only the aggregated values and the Group By data fields (for a refresher on the Aggregate step in Prep, read Chapter 15). Even with very careful planning, it is unlikely that I would have predicted adding another branch to the flow at this point. Prep is a fantastically agile tool when your logic starts to diverge from your initial plans. Because I needed to aggregate at an overall level as well as to split out the home and away records, I needed two separate Aggregate steps (Figure 35-16).

Aggregating Wins and Losses columns
Figure 35-16. Aggregating Wins and Losses columns

These two flows were eventually joined back together to form one overall view of the team’s records at the end of the flow shown in Figure 35-14. Using the Profile pane allows you to see which fields are required and which are not and remove them through the ellipsis menu. Figure 35-17 shows the resulting data set.

Resulting data set from this subsection of the flow
Figure 35-17. Resulting data set from this subsection of the flow

To practice what this chapter has covered, why not have a go at the challenge yourself? I won’t spoil the whole challenge here and show every step, but this should get you started.

Summary

Being able to iterate your approach is a powerful data preparation skill, as often there are cleaning issues you didn’t spot in your initial approach or your first solution doesn’t work as intended. Enjoying the challenge of data preparation comes down to being able to focus on problem-solving. This is much easier if you break big challenges down into more manageable chunks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset