Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 26. Cleaning by Grouping Data

Data preparation would not be necessary if we always had someone else curating a perfect data set for us. However, we can (and sadly often have to) clean the data ourselves. As mentioned in Chapter 9, one of the most common challenges you’ll face in data prep is cleaning up string data—for example, standardizing the string values enough to be able to count instances of values even when they have typos. One technique can come in especially handy for this scenario: grouping. This chapter will cover what grouping is and how to use the built-in grouping tools in Prep Builder.

What Does Grouping Mean?

Grouping means applying logic to (mostly) string data fields to recognize a common characteristic among them, such as their meaning or intended value. For example, we might expect the following data items to be grouped together:

Edinburgh
Edenburgh
Edinborough
3d!nburgh

As humans, we can recognize that all these different names probably all refer to Edinburgh, Scotland (especially if the column were called City Name). But data software does not see this data the same way, so we have to give it some direction for how to handle these different collections of characters.

Why Use Grouping

Grouping is a technique you need to learn for multiple reasons.

Improving Accuracy

When they hear the term “data,” most people seem to think about system-generated data being fed into databases. After a year of data analysis, most data workers would consider themselves lucky to work with such data! Most data sources people use are still manually compiled. Even leading equity research agencies manually input corporate results from PDFs to build the data sources for their analysis. And therein lies the problem.

Manually entering data increases the risk of mistyping a letter or number. Add in modern-day deadline-driven culture to this mix, and you are looking at a potentially very messy data set. If you haven’t invested in autogenerated data, you probably also don’t have a heavily resourced IT team who will clean up those mistakes. Fortunately, grouping allows you to combine together incorrect and correct values to form a fully correct data set.

Navigating the Data Hierarchy

“Drilling into data” is an expression you are likely to encounter as you work in data. This refers to starting at a high-level (less granular) view of the data and gradually focusing on lower-level (more granular) views, often by filtering out what is not of interest to keep the data set manageable. Hierarchies in tools like Tableau allow you to “drill down” like this, but the hierarchies aren’t always set within the data. You can mitigate this by using groups to build higher or intermediary levels between existing data fields in your data set. Common hierarchies can include:

Time (years, months, weeks, days, etc.)
Geographical (country, state, city, etc.)
Organizational (office, department, team, etc.)
Operational processes (call centers, teams, products)

Groups can be used to pull together lower-level entities to create a new higher level. This differs from aggregation, which is about changing the granularity of records in terms of measures. Grouping, by contrast, is about forming consistent data values within a string data field. This can help make aggregations easier within both Prep and Desktop. The flexibility of grouping techniques means that you can update your analysis quickly in response to organizational changes like a merger or a team/management restructure (Figure 26-1).

Smoothing Reorganizations

Reorganizations used to be a nightmare to handle from a data perspective. Organizations might decide to change management structure or restructure a few teams’ sales territories without looping in the data team. Building significant and robust reference tables to reflect and apply this organizational restructuring often takes too long. Grouping allows us to handle these challenges on an ad hoc basis in the short term (Figure 26-2) until more formal reference tables can be built.

Grouping Techniques

Grouping and replacing values in Tableau Prep is one of its most useful functions, and it’s also very easy to use. As this section will describe, there are three main techniques for grouping values, which allows Prep to handle a large range of use cases.

Manual

Manual grouping is great for a simple, quick fix. However, for a larger, more complex challenge, this technique will soon become frustrating.

Manual grouping allows you to apply your own logic to a data set by selecting data points to group together (Figure 26-3).

Team mismatches that would be resolved by grouping

Clearly, in this example we want to make the second row of data the same as the other instances of York. Each tool has its own method for this, but in Prep Builder the solution is very simple:

Use the Profile pane to quickly see odd values in the data set (Figure 26-4).

Figure 26-4. Viewing an outlier in the Profile pane
Select the values you would like to clean/amend by holding down the Ctrl key or the Command key on Mac (Figure 26-5).

Figure 26-5. Selecting multiple values in Prep Builder
Right-click to bring up the menu (Figure 26-6).

Figure 26-6. Menu of options for grouping multiple values
Select Group to form the new grouping. Keep in mind that the first data value you select will become the name of the group. This is easy to change, though: simply double-click the grouped value to rename it (Figure 26-7).

Figure 26-7. Renaming a group in the Profile pane

When you edit the group (which you can do from the Changes pane), you can see which data values have been grouped together. This view shows you how any data values that match the grouped strings will also be grouped (Figure 26-8).

The easiest way to create a group within Prep Builder is to double-click a value in the Profile pane. This allows you to change all instances of that value in the data field (column). Keep in mind that the value must be the same data type as the data field.

Calculations

There are many ways to clean data values with calculations, but grouping cleaned data values is normally accomplished in a couple of ways.

IF statements

IF statements are a type of logic statement. Logic statements are commands to tell the software to work through a series of steps based on a true/false condition. In the example shown in Figure 26-9, the calculation is searching for the incorrect name 'Y0rk' and converting any matching values to 'York'. If the value isn’t 'Y0rk' then the calculation simply returns the value that already existed in the field.

Remember, if you give the new calculation the same name as the original data field, the results of the calculation will overwrite the old values. The Changes pane can act as an audit trail, but duplicating the field to compare the original values with the output values can be a useful sanity check. Completing the calculation in Prep Builder rather than waiting to complete the task in Desktop means this process has to be processed just once, rather than each time the data set is used. You can future-proof the calculation by adding in potential values and their corresponding conversions before they appear in the data set if you know they may pop up.

REPLACE() functions

REPLACE() functions will replace a character in a string data value with the character captured within the calculation. In Figure 26-10, the REPLACE() function is removing any zeros found in the data values and replacing them with lowercase letter os.

While you can create IF statements with many conditions (the longest I have written has 78 conditions), this is quite an arduous task. Each new variation you find that needs to be replaced requires another condition. REPLACE() statements can help here if only certain characters need to be removed. Any alphanumeric character that needs to be replaced can’t exist in any other string. This technique is fine in our example, as none of the town names needs to include a zero, but if we had an e instead of an o in York (i.e., “Yerk”), we wouldn’t be able to replace it without also accidentally turning Leeds into “Loods.”

Built-in Functionality

Some data preparation tools have their own built-in capabilities for grouping similar strings together. Prep Builder has pulled together many of the best features and made it very easy to use them. In addition to manual and calculation-based grouping, Prep Builder provides these three methods of grouping:

Pronunciation grouping is based on how the string would sound if spoken in English. Similar pronunciations are grouped together.
Common character–based grouping analyzes letters and sorts them into an order. Similar ordered strings are then grouped together. The original position of the characters isn’t important.
Spelling-based grouping evaluates the differences in characters used in strings and determines the changes required to spell all the terms similarly. The difference between the common character- and spelling-based grouping techniques is the position of common characters; the spelling option, unlike the common characters one, requires many letters to be in a similar order.

Using the Preppin’ Data 2019: Week 2 data set, let’s take the City field example (once the Data Interpreter has been used to import the relevant data; see Chapter 24 for more details on the Interpreter). Here the data field contains a number of spellings for each city. In this data set, I was just expecting values of Edinburgh and London, but Figure 26-11 shows the list of values that actually appeared in the column at the start of the exercise.

List of values in the City field in the 2019: Week 2 challenge

Let’s see what happens as we progressively use each of the built-in functions in the Group and Replace menu. First, using Pronunciation cleans up a lot of the data, as you can see in Figure 26-12.

Applying the Pronunciation Group and Replace option

However, not everything has been correctly grouped together. Like the Profile pane, the Group and Replace functions use histograms to show the dominant values. The correct spellings of Edinburgh and London—and the most common in this set—have formed the two majority groups. By selecting one of those two groupings, in this case Edinburgh, you can see the values that Prep Builder has grouped together. If any of these values is incorrect, you have two options:

Change the sensitivity of the Group and Replace function by moving the dot on the plus/minus Grouping scale. Move the dot toward the minus sign to reduce the sensitivity of the algorithms making the groupings. This means more data is likely to be added into one group. Conversely, move the dot toward the plus sign to increase the algorithm’s sensitivity so that less data is included.
Manually deselect values. By unchecking a selection in a grouping, you can remove this value, and all the related records, from the grouping.

Because the pronunciation option works on how the letters are spoken in English, the 3 at the start of 3d!nburgh or the rearranged letters of nodonL are not close enough to be considered a match. So, first save the progress made thus far by clicking Done in the Group and Replace controls, and then apply the Group and Replace functions again to the resulting field, this time selecting Common Characters. The resulting data is one step closer to the desired output (Figure 26-13).

Results of grouping by common characters

Because nodonL contains the same letters as London, the two terms are grouped together. Again, as London has the highest number of records, nodonL is added to the London grouping. To save this further progress, click Done. Finally, let’s address the last mismatch using the Spelling Group and Replace option (Figure 26-14).

Here you can see that 3d!nburgh has been grouped with the rest of the Edinburgh values, as there are only two characters that differ from the correct spelling.

Writing a calculation to cover all of these examples would take a lot of time, so this Group and Replace functionality can save you a lot of time and effort. Even though this data field is now ready for analysis, you must be careful if you plan to join this field back to a source data set where the values are still in their original “messy” state. Joining, covered in Chapter 16 and Chapter 32, must be applied to like-for-like values; therefore, this data set would no longer completely join to its earlier form. Understanding what is needed of the data set will determine where in the workflow to perform each data preparation task. Luckily, with Prep Builder, it is quick and easy to adjust this ordering as you iterate through your solution.

Summary

Before using Tableau Prep, grouping like strings together involved either long IF calculations or a lot of manual selections. Unfortunately, these methods would rarely update with the introduction of varying values. Prep’s use of Group and Replace algorithms to group strings together not only makes the initial preparation easier but also future-proofs the flow.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
26. Cleaning by Grouping Data

Chapter 26. Cleaning by Grouping Data

What Does Grouping Mean?

Why Use Grouping

Improving Accuracy

Navigating the Data Hierarchy

Figure 26-1. Creating a region hierarchy of teams

Smoothing Reorganizations

Figure 26-2. Reorganization of Region field

Grouping Techniques

Manual

Figure 26-3. Team mismatches that would be resolved by grouping

Figure 26-4. Viewing an outlier in the Profile pane

Figure 26-5. Selecting multiple values in Prep Builder

Figure 26-6. Menu of options for grouping multiple values

Figure 26-7. Renaming a group in the Profile pane

Figure 26-8. Values that are part of the York group

Calculations

IF statements

Figure 26-9. Calculation to correct the `'Y0rk'` value

REPLACE() functions

Figure 26-10. `REPLACE()` function in Prep

Built-in Functionality

Figure 26-11. List of values in the City field in the 2019: Week 2 challenge

Figure 26-12. Applying the Pronunciation Group and Replace option

Figure 26-13. Results of grouping by common characters

Figure 26-14. Results of grouping by spelling

Summary

Table of Contents for 26. Cleaning by Grouping Data

Create new playlist

Sign In

Sign Up

Chapter 26. Cleaning by Grouping Data

What Does Grouping Mean?

Why Use Grouping

Improving Accuracy

Navigating the Data Hierarchy

Figure 26-1. Creating a region hierarchy of teams

Smoothing Reorganizations

Figure 26-2. Reorganization of Region field

Grouping Techniques

Manual

Figure 26-3. Team mismatches that would be resolved by grouping

Figure 26-4. Viewing an outlier in the Profile pane

Figure 26-5. Selecting multiple values in Prep Builder

Figure 26-6. Menu of options for grouping multiple values

Figure 26-7. Renaming a group in the Profile pane

Figure 26-8. Values that are part of the York group

Calculations

IF statements

Figure 26-9. Calculation to correct the 'Y0rk' value

REPLACE() functions

Figure 26-10. REPLACE() function in Prep

Built-in Functionality

Figure 26-11. List of values in the City field in the 2019: Week 2 challenge

Figure 26-12. Applying the Pronunciation Group and Replace option

Figure 26-13. Results of grouping by common characters

Figure 26-14. Results of grouping by spelling

Summary

Table of Contents for
26. Cleaning by Grouping Data

Figure 26-9. Calculation to correct the `'Y0rk'` value

Figure 26-10. `REPLACE()` function in Prep