Chapter 29. Dealing with Unwanted Characters

When cleaning data, finding unexpected characters in your data fields can cause significant issues. Those issues can occur at multiple points during your data preparation: loading, using, and outputting the data. Therefore, this chapter is focused on building your understanding of what unwanted characters are, the problems they introduce, and how to remove them.

What Is an Unwanted Character?

An unwanted character is simply a letter, number, or symbol within your data field that you do not need or that introduces potential problems in your data output. Data software is often very precise about what it is processing, and rightly so; otherwise, it could easily produce erroneous output. For example, if there are different data types within a single data field, it affects whether a field can be aggregated logically. Let’s look at the three main types of data fields that can be affected by unwanted characters when they’re loaded into Prep Builder:

Numeric fields
If a non-numeric character is loaded into a numeric field, the field will be imported as a string and thus can no longer be used in aggregations. For example, what should 10 + 1c3 equal? As you can see, that calculation isn’t possible, and that is why a numeric field must contain only numeric values.
Dates
If a non-numeric character is found in a field expecting only date values, the date value with the unwanted character will appear as a null because the date will be in an invalid format—for example, 30/0/4/2023.
Strings
Strings are very flexible data types and therefore generally won’t cause an error if they’re imported with unwanted characters. The only exception is when a character is used outside those permitted by the data source or Tableau Prep, such as certain characters from non-English alphabets.

The most common unwanted character in data preparation is the humble space. The space between characters is easy to spot, but that’s not the case with leading or trailing spaces (i.e., a space at the start or end of a string). These spaces still count as characters in string functions like LEFT(), RIGHT(), MID(), and SPLIT(), so they can cause issues in most of the common string data preparation steps.

Issues Caused by Unwanted Characters

The challenge of unwanted characters isn’t very different from many others in data preparation. However, where they do differ is in the fact that they present potentially hard-to-find, individual values to clean rather than entire data fields. Identifying those individual values with the unwanted characters can be a challenge, especially in a string field, which doesn’t simply return a null value on input.

The main issue with unwanted characters is that they can increase the complexity of your data prep by preventing you from being able to apply a single rule to all values in a data field. For example, a numeric field with a hidden non-numeric character lurking in just one value cannot be simply aggregated, preventing many of the common steps from working normally. Fortunately, instead of trying to find the needle in the haystack, you can use some of Prep Builder’s built-in functionality to assist with this task.

Figure 29-1 includes at least one unwanted character in each data field to demonstrate the potential effects. In the Date field, the letter c has replaced the day. In the Store field, an exclamation mark has replaced the l in Clapham. The Type field has two issues: an 8 instead of a B and a leading space.

A data set containing unwanted characters
Figure 29-1. A data set containing unwanted characters

The data set has been saved as a comma-separated value (CSV) file. Tableau Prep Builder loads a CSV file’s data fields as strings by default. This means you will always need to set the data types in Tableau Prep before using functions specific to types other than strings (Figure 29-2).

The Input metadata grid of a CSV file showing only string data types
Figure 29-2. The Input metadata grid of a CSV file showing only string data types

Changing the different fields into the data type required for analysis highlights the unwanted characters shown earlier in Figure 29-1. Date values that do not conform to the expected date format are converted to nulls (Figure 29-3).

Null values resulting from converting the string field with unwanted characters to a date
Figure 29-3. Null values resulting from converting the string field with unwanted characters to a date

The sales figures that are not solely numeric are also converted to nulls (Figure 29-4).

Sales records with non-numeric values are also converted to nulls
Figure 29-4. Sales records with non-numeric values are also converted to nulls

The string values with the exclamation point and number 8 remain as strings, but because they clearly differ they are split apart in the Profile pane (Figure 29-5). The Bar value with a leading space has been cleaned up naturally by Prep. It can do this for some data sources, but not all.

Unwanted characters shown in the Profile pane
Figure 29-5. Unwanted characters shown in the Profile pane

Therefore, to make the data usable, you have to remove the unwanted characters before completing the data type conversion.

Removing Unwanted Characters

How you remove unwanted characters depends on the character and the data type the field should be.

Strings with Mistyped Characters

With free text entry, words and names can easily be mistyped by the person who entered the information. Often, the values with typos are close enough to other similar values that Prep Builder’s Group and Replace functionality can correct the unwanted characters.

To use this technique, open the ellipsis menu of the string data field, select Group and Replace, and then select Spelling (Figure 29-6).

Group and Replace menu
Figure 29-6. Group and Replace menu

This results in just a single value for Clapham, and the correct spelling is taken from the most common result. You can apply the same technique using numbers instead of letters, as in the 8ar example. After you apply the Spelling grouping technique to both columns, the resulting data set is a single name in each column (Figure 29-7).

Result of Spelling grouping
Figure 29-7. Result of Spelling grouping
Note

Grouping techniques are covered in more detail in Chapter 26.

Numbers with Unwanted Characters

When you convert the Sales field to a numeric data type, the d50 value results in a null due to the letter d. To resolve this, first you use Prep Builder’s cleaning functionality in the ellipsis menu to remove letters from the data field (Figure 29-8).

Clean menu Remove Letters option
Figure 29-8. Clean menu Remove Letters option

Next, you convert the data field to a numeric data type (Figure 29-9).

Converting a string field to an integer field
Figure 29-9. Converting a string field to an integer field

This results in a set of values that you want for the analysis (Figure 29-10).

Result of type conversion after removing unwanted characters
Figure 29-10. Result of type conversion after removing unwanted characters

Dates with Mistyped Characters

If there are additional unwanted characters, in many cases you can simply repeat the cleaning technique described in the preceding section to remove them. If the character is mistyped, however, it might require some human logic to determine what the actual value should be. You can search manually, but in large data sets this can take a significant amount of time. Instead, regular expressions (regexes), covered in Chapter 31, can help you find the unwanted characters (a letter in this case). The regex calculation shown in Figure 29-11 returns True when it finds the letter c in the Date field.

Regex calculation to identify values containing a letter character
Figure 29-11. Regex calculation to identify values containing a letter character

Prep Builder clearly indicates when a letter appears somewhere within the string field. The Profile pane highlights the value(s) in question (Figure 29-12).

Letter characters highlighted in the Profile pane
Figure 29-12. Letter characters highlighted in the Profile pane

You can then manually assess how to resolve the issue. In this example data set, we expect to have a record per date, so clearly we need to rename 0c to 03. To do this, simply double-click the record in the Profile pane and update it with the correct value (Figure 29-13).

Manually removing unwanted characters
Figure 29-13. Manually removing unwanted characters

You can then change the data type to date without converting any values to null.

Summary

Unwanted characters can be quite challenging, but you can largely handle them with Prep Builder’s built-in functionality. For some typos, however, you’ll need to create more custom solutions to help identify and then manually correct the issue.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset