Chapter 26. Cleaning by Grouping Data

Data preparation would not be necessary if we always had someone else curating a perfect data set for us. However, we can (and sadly often have to) clean the data ourselves. As mentioned in Chapter 9, one of the most common challenges you’ll face in data prep is cleaning up string data—for example, standardizing the string values enough to be able to count instances of values even when they have typos. One technique can come in especially handy for this scenario: grouping. This chapter will cover what grouping is and how to use the built-in grouping tools in Prep Builder.

 

What Does Grouping Mean?

Grouping means applying logic to (mostly) string data fields to recognize a common characteristic among them, such as their meaning or intended value. For example, we might expect the following data items to be grouped together:

  • Edinburgh

  • Edenburgh

  • Edinborough

  • 3d!nburgh

As humans, we can recognize that all these different names probably all refer to Edinburgh, Scotland (especially if the column were called City Name). But data software does not see this data the same way, so we have to give it some direction for how to handle these different collections of characters.

Why Use Grouping

Grouping is a technique you need to learn for multiple reasons.

Improving Accuracy

When they hear the term “data,” most people seem to think about system-generated data being fed into databases. After a year of data analysis, most data workers would consider themselves lucky to work with such data! Most data sources people use are still manually compiled. Even leading equity research agencies manually input corporate results from PDFs to build the data sources for their analysis. And therein lies the problem.

Manually entering data increases the risk of mistyping a letter or number. Add in modern-day deadline-driven culture to this mix, and you are looking at a potentially very messy data set. If you haven’t invested in autogenerated data, you probably also don’t have a heavily resourced IT team who will clean up those mistakes. Fortunately, grouping allows you to combine together incorrect and correct values to form a fully correct data set.

Smoothing Reorganizations

Reorganizations used to be a nightmare to handle from a data perspective. Organizations might decide to change management structure or restructure a few teams’ sales territories without looping in the data team. Building significant and robust reference tables to reflect and apply this organizational restructuring often takes too long. Grouping allows us to handle these challenges on an ad hoc basis in the short term (Figure 26-2) until more formal reference tables can be built.

Reorganization of region field
Figure 26-2. Reorganization of Region field

Grouping Techniques

Grouping and replacing values in Tableau Prep is one of its most useful functions, and it’s also very easy to use. As this section will describe, there are three main techniques for grouping values, which allows Prep to handle a large range of use cases.

Manual

Manual grouping is great for a simple, quick fix. However, for a larger, more complex challenge, this technique will soon become frustrating.

Manual grouping allows you to apply your own logic to a data set by selecting data points to group together (Figure 26-3).

Team mismatches that would be resolved by grouping
Figure 26-3. Team mismatches that would be resolved by grouping

Clearly, in this example we want to make the second row of data the same as the other instances of York. Each tool has its own method for this, but in Prep Builder the solution is very simple:

  1. Use the Profile pane to quickly see odd values in the data set (Figure 26-4).

    Viewing an outlier in the Profile pane
    Figure 26-4. Viewing an outlier in the Profile pane
  2. Select the values you would like to clean/amend by holding down the Ctrl key or the Command key on Mac (Figure 26-5).

    Selecting multiple values in Prep Builder
    Figure 26-5. Selecting multiple values in Prep Builder
  3. Right-click to bring up the menu (Figure 26-6).

  4. Select Group to form the new grouping. Keep in mind that the first data value you select will become the name of the group. This is easy to change, though: simply double-click the grouped value to rename it (Figure 26-7).

    Renaming a group in the Profile pane
    Figure 26-7. Renaming a group in the Profile pane

When you edit the group (which you can do from the Changes pane), you can see which data values have been grouped together. This view shows you how any data values that match the grouped strings will also be grouped (Figure 26-8).

Values that are part of the York group
Figure 26-8. Values that are part of the York group

The easiest way to create a group within Prep Builder is to double-click a value in the Profile pane. This allows you to change all instances of that value in the data field (column). Keep in mind that the value must be the same data type as the data field.

Calculations

There are many ways to clean data values with calculations, but grouping cleaned data values is normally accomplished in a couple of ways.

IF statements

IF statements are a type of logic statement. Logic statements are commands to tell the software to work through a series of steps based on a true/false condition. In the example shown in Figure 26-9, the calculation is searching for the incorrect name 'Y0rk' and converting any matching values to 'York'. If the value isn’t 'Y0rk' then the calculation simply returns the value that already existed in the field.

Calculation to correct the 'Y0rk' value
Figure 26-9. Calculation to correct the 'Y0rk' value

Remember, if you give the new calculation the same name as the original data field, the results of the calculation will overwrite the old values. The Changes pane can act as an audit trail, but duplicating the field to compare the original values with the output values can be a useful sanity check. Completing the calculation in Prep Builder rather than waiting to complete the task in Desktop means this process has to be processed just once, rather than each time the data set is used. You can future-proof the calculation by adding in potential values and their corresponding conversions before they appear in the data set if you know they may pop up.

REPLACE() functions

REPLACE() functions will replace a character in a string data value with the character captured within the calculation. In Figure 26-10, the REPLACE() function is removing any zeros found in the data values and replacing them with lowercase letter os.

REPLACE() function in Prep
Figure 26-10. REPLACE() function in Prep

While you can create IF statements with many conditions (the longest I have written has 78 conditions), this is quite an arduous task. Each new variation you find that needs to be replaced requires another condition. REPLACE() statements can help here if only certain characters need to be removed. Any alphanumeric character that needs to be replaced can’t exist in any other string. This technique is fine in our example, as none of the town names needs to include a zero, but if we had an e instead of an o in York (i.e., “Yerk”), we wouldn’t be able to replace it without also accidentally turning Leeds into “Loods.”

Built-in Functionality

Some data preparation tools have their own built-in capabilities for grouping similar strings together. Prep Builder has pulled together many of the best features and made it very easy to use them. In addition to manual and calculation-based grouping, Prep Builder provides these three methods of grouping:

  • Pronunciation grouping is based on how the string would sound if spoken in English. Similar pronunciations are grouped together.

  • Common character–based grouping analyzes letters and sorts them into an order. Similar ordered strings are then grouped together. The original position of the characters isn’t important.

  • Spelling-based grouping evaluates the differences in characters used in strings and determines the changes required to spell all the terms similarly. The difference between the common character- and spelling-based grouping techniques is the position of common characters; the spelling option, unlike the common characters one, requires many letters to be in a similar order.

Using the Preppin’ Data 2019: Week 2 data set, let’s take the City field example (once the Data Interpreter has been used to import the relevant data; see Chapter 24 for more details on the Interpreter). Here the data field contains a number of spellings for each city. In this data set, I was just expecting values of Edinburgh and London, but Figure 26-11 shows the list of values that actually appeared in the column at the start of the exercise.

List of values in the City field in the 2019: Week 2 challenge
Figure 26-11. List of values in the City field in the 2019: Week 2 challenge

Let’s see what happens as we progressively use each of the built-in functions in the Group and Replace menu. First, using Pronunciation cleans up a lot of the data, as you can see in Figure 26-12.

Applying the Pronunciation Group and Replace option
Figure 26-12. Applying the Pronunciation Group and Replace option

However, not everything has been correctly grouped together. Like the Profile pane, the Group and Replace functions use histograms to show the dominant values. The correct spellings of Edinburgh and London—and the most common in this set—have formed the two majority groups. By selecting one of those two groupings, in this case Edinburgh, you can see the values that Prep Builder has grouped together. If any of these values is incorrect, you have two options:

  • Change the sensitivity of the Group and Replace function by moving the dot on the plus/minus Grouping scale. Move the dot toward the minus sign to reduce the sensitivity of the algorithms making the groupings. This means more data is likely to be added into one group. Conversely, move the dot toward the plus sign to increase the algorithm’s sensitivity so that less data is included.

  • Manually deselect values. By unchecking a selection in a grouping, you can remove this value, and all the related records, from the grouping.

Because the pronunciation option works on how the letters are spoken in English, the 3 at the start of 3d!nburgh or the rearranged letters of nodonL are not close enough to be considered a match. So, first save the progress made thus far by clicking Done in the Group and Replace controls, and then apply the Group and Replace functions again to the resulting field, this time selecting Common Characters. The resulting data is one step closer to the desired output (Figure 26-13).

Results of grouping by common characters
Figure 26-13. Results of grouping by common characters

Because nodonL contains the same letters as London, the two terms are grouped together. Again, as London has the highest number of records, nodonL is added to the London grouping. To save this further progress, click Done. Finally, let’s address the last mismatch using the Spelling Group and Replace option (Figure 26-14).

Results of grouping by spelling
Figure 26-14. Results of grouping by spelling

Here you can see that 3d!nburgh has been grouped with the rest of the Edinburgh values, as there are only two characters that differ from the correct spelling.

Writing a calculation to cover all of these examples would take a lot of time, so this Group and Replace functionality can save you a lot of time and effort. Even though this data field is now ready for analysis, you must be careful if you plan to join this field back to a source data set where the values are still in their original “messy” state. Joining, covered in Chapter 16 and Chapter 32, must be applied to like-for-like values; therefore, this data set would no longer completely join to its earlier form. Understanding what is needed of the data set will determine where in the workflow to perform each data preparation task. Luckily, with Prep Builder, it is quick and easy to adjust this ordering as you iterate through your solution.

Summary

Before using Tableau Prep, grouping like strings together involved either long IF calculations or a lot of manual selections. Unfortunately, these methods would rarely update with the introduction of varying values. Prep’s use of Group and Replace algorithms to group strings together not only makes the initial preparation easier but also future-proofs the flow.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset