Chapter 31. Using Regular Expressions

Data can be complicated; it’s not all just adding values together or counting things. For example, string data fields can have a lot of complexity, but using them is often essential in forming your data sets and analysis. Regular expressions (or regexes) are a set of commands that make parsing difficult strings much more achievable. In this chapter, we will cover what regexes are exactly, how they can be used in Prep Builder, and how to use some common regex functions.

What Are Regular Expressions?

Regular expressions are calculations that search for a pattern of characters (e.g., the value population) in a string of characters (e.g., "The population of the United Kingdom is 66,650,000 people"). Regexes are much more flexible in the patterns they can identify compared to the other string functions we’ve seen like LEFT(), MID(), and RIGHT().

One advantage of learning regexes is that they share a common set of commands across most data tools and therefore are highly transferable between data preparation tools. Another is how they can help you avoid building very complex string calculations using the simple string functions this book has covered thus far.

How to Use Regexes in Prep

There are three main ways to use regexes in Prep (Figure 31-1).

Regex functions available in Prep
Figure 31-1. Regex functions available in Prep

To help you understand how to make use of regexes in Prep Builder, we’ll go through each function before addressing specific examples.

REGEXP_EXTRACT() and REGEXP_EXTRACT_NTH()

REGEXP_EXTRACT() allows you to extract part of a string (a substring). The part returned is enclosed by brackets within the regular expression; this is called the capturing group and it appears next to the regex pattern inside parentheses.

REGEXP_EXTRACT([String Field],'regex pattern')

The REGEXP_EXTRACT_NTH() function returns the nth instance of the capturing group if it exists.

REGEXP_MATCH()

The REGEXP_MATCH() function returns a Boolean result of either True or False depending upon whether or not a given string matches the specified regex pattern. The entire string does not have to match the regex pattern to return a True result, only a substring.

REGEXP_MATCH([String Field],'regex pattern')

REGEXP_REPLACE()

The REGEXP_REPLACE() function specifies a part of a string field to replace as well as the value to replace it with.

 REGEXP_REPLACE([String Field],'Substring to replace','Replacement value')

Regex Use Cases

The flexibility offered by regexes allows for a vast array of use cases. This section describes some of the most common scenarios.

Replacing Common Mistakes

Many string fields come from manually entered text. This can mean a data field contains repeated mistakes that will need to be removed before accurate analysis can occur. For example, if you have the times of different events captured but in different formats, you’ll need to extract the relevant values from the strings before you can begin your analysis (Figure 31-2).

Basic Time field
Figure 31-2. Basic Time field

In this instance, the time needs to be extracted from this string. Figure 31-3 shows how you can use REGEXP_EXTRACT() to achieve this.

Using REGEXP_EXTRACT() to extract the time
Figure 31-3. Using REGEXP_EXTRACT() to extract the time

This returns the result you need (Figure 31-4).

Result of REGEXP_EXTRACT() calculation in #using_regexp_extractleft_parenthesisrig
Figure 31-4. Result of REGEXP_EXTRACT() calculation in Figure 31-3

Anonymizing Comments or Feedback

One of the biggest strengths of regexes is finding substrings in a much longer string that match a specified pattern. One case where this functionality can be especially useful is identifying a person’s details and removing them from the data source. Let’s use the sample output from a Chin & Beard Suds Co. feedback form shown in Figure 31-5.

Customer feedback to be anonymized
Figure 31-5. Customer feedback to be anonymized

By identifying a common pattern in the text, you can specify that pattern in a regex to replace the text in question. For this example, we want to identify the pattern [email protected]. We use the encoding [.w]+@[.w]+ to anonymize any word characters before the @ symbol and any word characters after it. A word character is a letter, digit, or underscore. The space at the start and end of the email address indicates that the whole Description field isn’t converted.

In our example, we will replace the email address with [Hidden Details] so it is easy to remove or leave in future analyses. Figure 31-6 shows the calculation to do this within Prep Builder.

Using REGEXP_REPLACE() to anonymize email addresses
Figure 31-6. Using REGEXP_REPLACE() to anonymize email addresses

The resulting data set is shown in Figure 31-7.

Resulting data set from the REGEX_REPLACE() example
Figure 31-7. Resulting data set from the REGEX_REPLACE() example

Common Regex Commands

In order to be able to use regex functions, you will need to be able to specify the patterns you’re looking for using regex commands. In regexes, case sensitivity depends on the data source Prep Builder is connected to, so you will need to take care when entering string names. Table 31-1 lists the key regex commands you’ll use.

Table 31-1. Common regex commands
d Any digit 0–9
d+ One or more digits
a-z Any lowercase letter
A-Z Any uppercase letter
D Any nondigit
s Whitespace
w Any word character
^ Start of a string
$ End of a string
( ) Capture anything within the parentheses, as in REGEXP_EXTRACT() function
{ } Exactly a number of the given substring; set up by adding a number between the brackets (i.e., {3})

For testing, you can use Prep or get instant feedback on websites built for testing regexes (e.g., http://regex101.com).

Summary

Regular expressions allow you to clean complex string fields in a much more simplistic and reusable way. Although there is more to learn with regexes compared to the other string functions used to clean data, the investment of time is worth it, as they make it much easier to prepare your data for analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset