When cleaning data, finding unexpected characters in your data fields can cause significant issues. Those issues can occur at multiple points during your data preparation: loading, using, and outputting the data. Therefore, this chapter is focused on building your understanding of what unwanted characters are, the problems they introduce, and how to remove them.
An unwanted character is simply a letter, number, or symbol within your data field that you do not need or that introduces potential problems in your data output. Data software is often very precise about what it is processing, and rightly so; otherwise, it could easily produce erroneous output. For example, if there are different data types within a single data field, it affects whether a field can be aggregated logically. Let’s look at the three main types of data fields that can be affected by unwanted characters when they’re loaded into Prep Builder:
The most common unwanted character in data preparation is the humble space. The space between characters is easy to spot, but that’s not the case with leading or trailing spaces (i.e., a space at the start or end of a string). These spaces still count as characters in string functions like LEFT()
, RIGHT()
, MID()
, and SPLIT()
, so they can cause issues in most of the common string data preparation steps.
The challenge of unwanted characters isn’t very different from many others in data preparation. However, where they do differ is in the fact that they present potentially hard-to-find, individual values to clean rather than entire data fields. Identifying those individual values with the unwanted characters can be a challenge, especially in a string field, which doesn’t simply return a null value on input.
The main issue with unwanted characters is that they can increase the complexity of your data prep by preventing you from being able to apply a single rule to all values in a data field. For example, a numeric field with a hidden non-numeric character lurking in just one value cannot be simply aggregated, preventing many of the common steps from working normally. Fortunately, instead of trying to find the needle in the haystack, you can use some of Prep Builder’s built-in functionality to assist with this task.
Figure 29-1 includes at least one unwanted character in each data field to demonstrate the potential effects. In the Date field, the letter c
has replaced the day. In the Store field, an exclamation mark has replaced the l
in Clapham
. The Type field has two issues: an 8
instead of a B
and a leading space.
The data set has been saved as a comma-separated value (CSV) file. Tableau Prep Builder loads a CSV file’s data fields as strings by default. This means you will always need to set the data types in Tableau Prep before using functions specific to types other than strings (Figure 29-2).
Changing the different fields into the data type required for analysis highlights the unwanted characters shown earlier in Figure 29-1. Date values that do not conform to the expected date format are converted to nulls (Figure 29-3).
The sales figures that are not solely numeric are also converted to nulls (Figure 29-4).
The string values with the exclamation point and number 8
remain as strings, but because they clearly differ they are split apart in the Profile pane (Figure 29-5). The Bar
value with a leading space has been cleaned up naturally by Prep. It can do this for some data sources, but not all.
Therefore, to make the data usable, you have to remove the unwanted characters before completing the data type conversion.
How you remove unwanted characters depends on the character and the data type the field should be.
With free text entry, words and names can easily be mistyped by the person who entered the information. Often, the values with typos are close enough to other similar values that Prep Builder’s Group and Replace functionality can correct the unwanted characters.
To use this technique, open the ellipsis menu of the string data field, select Group and Replace, and then select Spelling (Figure 29-6).
This results in just a single value for Clapham, and the correct spelling is taken from the most common result. You can apply the same technique using numbers instead of letters, as in the 8ar example. After you apply the Spelling grouping technique to both columns, the resulting data set is a single name in each column (Figure 29-7).
Grouping techniques are covered in more detail in Chapter 26.
When you convert the Sales field to a numeric data type, the d50
value results in a null due to the letter d
. To resolve this, first you use Prep Builder’s cleaning functionality in the ellipsis menu to remove letters from the data field (Figure 29-8).
Next, you convert the data field to a numeric data type (Figure 29-9).
This results in a set of values that you want for the analysis (Figure 29-10).
If there are additional unwanted characters, in many cases you can simply repeat the cleaning technique described in the preceding section to remove them. If the character is mistyped, however, it might require some human logic to determine what the actual value should be. You can search manually, but in large data sets this can take a significant amount of time. Instead, regular expressions (regexes), covered in Chapter 31, can help you find the unwanted characters (a letter in this case). The regex calculation shown in Figure 29-11 returns True
when it finds the letter c in the Date field.
Prep Builder clearly indicates when a letter appears somewhere within the string field. The Profile pane highlights the value(s) in question (Figure 29-12).
You can then manually assess how to resolve the issue. In this example data set, we expect to have a record per date, so clearly we need to rename 0c to 03. To do this, simply double-click the record in the Profile pane and update it with the correct value (Figure 29-13).
You can then change the data type to date without converting any values to null.