Previous chapters have discussed techniques for determining the changes required to prepare a data set for analysis, albeit at a relatively simple level. What about those situations where the path isn’t straightforward; how do you approach the problem then? Complex challenges can include building solutions requiring multiple steps, inputting data sources using multiple join conditions, or having to complete many reshaping steps throughout the data prep process. This chapter will cover this exact scenario by taking on one of the most complicated challenges Preppin’ Data has covered to date: 2020: Week 3. The aim of this challenge, created by Jonathan Allenby (my fellow Dr. Prepper), is to turn the National Basketball Association (NBA) game results into the detailed standings you commonly see on websites or in newspapers.
This challenge involves taking the results and building the full conference league tables, including rankings, wins, and losses; recent performance; and even winning streaks (Figure 35-1). This really is a tough challenge with many facets; hence, it’s a great example to use to show you how to break down complex problems into smaller, more manageable pieces.
As covered in Chapter 3, developing an understanding of the input and output will help you to get an overview of the task. On large, complex challenges, planning becomes especially important to ensure you are working toward the desired results.
Figure 35-2 shows the first few rows of one of the monthly inputs for the challenge, demonstrating the structure of the data set.
To understand where to begin, it’s important to think about what is required from the data set. As this is a Preppin’ Data challenge, this has already been detailed for you (Figure 35-3).
Here’s my initial scope for this challenge (Figure 35-4).
Let’s revisit some of the steps covered in Chapter 3 and how they apply to this challenge:
Map out your inputs. What does each input contain? What dimensions and measures are there in the data? Are any data fields incorrectly formatted? What is the data’s level of granularity?
In this case, only the Date field seems to potentially need cleaning.
Map out your outputs. How many output files will be required? What format will you need certain fields to be? Think about the granularity of the data required.
In this case, each team will have a single row within the output. This means there will be a lot of aggregation required to take the game results to the level we need.
Understand the gaps. What fields are missing within the data, and how do we add them? This will give us a list of fields that we need to create through either pivoting, joins, or calculations.
At this stage, you don’t have to solve all the issues that might arise between loading the input data and generating the output data set for analysis. As discussed in Chapter 3, you might not spot all of the challenges in the data set at this point, but as you work with the data, they will emerge.
Breaking the challenge down into individual chunks makes it a lot easier to work out how best to solve it. If you skip this step, the challenge may seem insurmountable. Creating the calculations you know you’ll need is a good first step, as they will help:
Guide next steps toward finding a solution
Determine an order in which to proceed
Let’s take wins as an example in this challenge. You have individual game results, and you need to determine who the winner is. This is easier said than done, though, because for each win there is a loss as well. This means two rows are required for each game: one to record the winner, and one for the loser. To ensure we capture every team’s games, first we’ll use the Team List data source and join all game results to that twice—once to match the home team and again to match the away team to the original Team List team names (Figure 35-5).
Using two calculations then allows us to calculate each team’s points (Figure 35-6) and their opposition’s points (Figure 35-7) for home games.
These calculations result in the data set shown in Figure 35-8.
We can then repeat these calculations, but this time testing whether the Team List team name is the away team. Figure 35-7 showed how to calculate the away team’s score where the focus team plays away. Figure 35-9 shows how to calculate the opposition’s score.
We can then assess these calculations for who won or lost the game (Figure 35-10). We can take the same approach for games in which the team played away and union the results together.
A sample of the resulting data set is shown in Figure 35-11.
We can then union together the two flows to create one large data set to record each team’s full set of games in the season (Figure 35-12).
You can repeat this approach to ensure you are tackling each of those calculations in turn. As you determine a solution for each subchallenge, you may need to change the order of the steps, or copy and paste entire sections. This is easy enough. By right-clicking on the linking line between two steps, you can delete it and then drag the step from the “pre-step” to the Add part of the step you wish to link it to in your flow (Figure 35-13).
Often only by working with the data will you arrive at the solution. Otherwise, it can be difficult to imagine how exactly the data will behave during each step and as you use the data when conducting your analysis. When analyzing the data, you may determine that you need to iterate further or remove certain steps. This is a good thing—it is all part of the data prep learning experience, and fortunately, with Prep Builder, you can make these changes quickly.
For example, when creating the Win (W), Loses (L), Home Results (Home), and Away Results (Away) columns, I knew they would involve similar calculations but wasn’t sure in which order I would handle them. These columns represent the total wins and losses as well as the record of wins and losses achieved in home and away games. Figure 35-14 shows the flow I used to approach this task.
The primary data point I needed for each team was whether the team won or lost a game. I already had captured whether the team at the focus of the game had scored more points than the opposition. If they had scored more, I returned a column of W; otherwise, I returned a column of L. This logic was correct but wasn’t ideal for aggregating to calculate all the relevant totals of wins and losses. Therefore, I pivoted this column to create a single column for both.
To make the next counts simple, I created a simple calculation of 1 to add to each win or loss column depending on the result (Figure 35-15).
Knowing that the next steps would require an Aggregate step, I realized I’d need to split the process into two streams at this point. The Aggregate step returns only the aggregated values and the Group By data fields (for a refresher on the Aggregate step in Prep, read Chapter 15). Even with very careful planning, it is unlikely that I would have predicted adding another branch to the flow at this point. Prep is a fantastically agile tool when your logic starts to diverge from your initial plans. Because I needed to aggregate at an overall level as well as to split out the home and away records, I needed two separate Aggregate steps (Figure 35-16).
These two flows were eventually joined back together to form one overall view of the team’s records at the end of the flow shown in Figure 35-14. Using the Profile pane allows you to see which fields are required and which are not and remove them through the ellipsis menu. Figure 35-17 shows the resulting data set.
To practice what this chapter has covered, why not have a go at the challenge yourself? I won’t spoil the whole challenge here and show every step, but this should get you started.
Being able to iterate your approach is a powerful data preparation skill, as often there are cleaning issues you didn’t spot in your initial approach or your first solution doesn’t work as intended. Enjoying the challenge of data preparation comes down to being able to focus on problem-solving. This is much easier if you break big challenges down into more manageable chunks.