Chapter 25. Splitting Data Fields

One of the most common actions you will take when preparing string data for analysis is to split a string field into its subparts. Splitting was briefly mentioned in Chapter 9, which covered the basics of working with strings if you need to refresh your memory about the data type. Splitting is required for many reasons, such as operational systems picking up data and outputting unique IDs for each record or squeezing records together to fit them into a specified database table. The human brain is fantastic at spotting patterns in data (that’s why we create visual analytics, after all), so you can often spot the need to split data fields (columns) by just looking at the data set.

For example, in Figure 25-1, we can see that we probably need to split the Product Code field on the left into three separate columns (on the right) in order to help us analyze this data set.

Result of splitting Product Code field
Figure 25-1. Result of splitting Product Code field

Basic Splits

Splitting data in most data tools is very easy; Prep Builder is no different. Simply choose the data field you want to split, click the ellipsis in the top right of the field, select Split Values, and then click Automatic Split. Prep Builder will split up the field using what it believes to be the most appropriate logic (Figure 25-2).

Selecting Automatic Split from a data field’s menu
Figure 25-2. Selecting Automatic Split from a data field’s menu

In this case, the automatic split has worked as desired (Figure 25-3).

Result of the automatic split from #selecting_automatic_split_from_a_data_f
Figure 25-3. Result of the automatic split from Figure 25-2

To perform this task, Prep Builder actually writes three Calculated Fields, which you can always edit if you want a slightly different result (Figure 25-4).

Calculations in the Changes pane resulting from an automatic split
Figure 25-4. Calculations in the Changes pane resulting from an automatic split

You will find these calculations in the Changes pane in Prep Builder. You can also edit them from here, or alternatively learn how Prep completed the task you set it (Figure 25-5).

Opening one of the split calculations in the Calculation Editor
Figure 25-5. Opening one of the split calculations in the Calculation Editor

This formula is splitting the Product Code field on the hyphen (-) separator and pulling back the third part (i.e., what comes after the second hyphen but before the fourth, which in this case doesn’t exist). The resulting values are then trimmed, which means any leading or trailing spaces are removed, as these can wreak havoc when you are matching text values.

Advanced Splits: When Automatic Splits Don’t Work as Intended

Let’s tweak the data a little bit so that the pattern of delimiters (the character we split the data field by) is a little less obvious to Prep Builder (Figure 25-6).

Revised data set with unclear delimiter for Prep Builder
Figure 25-6. Revised data set with unclear delimiter for Prep Builder

Now the Product Type (first part of the Product Code) for Soap Bar no longer includes the word Soap and each field now includes spaces around the hyphens. With this irregular pattern, Prep no longer gives us the results we are looking for (Figure 25-7).

Result of an automatic split of the data set in #revised_data_set_with_unclear_delimiter
Figure 25-7. Result of an automatic split of the data set in Figure 25-6

The Product Type has disappeared altogether in the data set resulting from this split. Notice all those blank records? They are the result of the word Soap being removed from the Soap Bar rows; note there are no blanks resulting from the Liquid Soap rows. Let’s look at the calculation Prep is writing for this split (Figure 25-8).

Using the Calculation Editor to understand the split issue
Figure 25-8. Using the Calculation Editor to understand the split issue

Well, that’s not clear, is it?

These situations are where your human eye can take over, and you should choose the Custom Split menu option instead of Automatic Split. By setting up the Custom Split to work on the hyphen, we’re back to getting the results we want from this data set (Figure 25-9).

Setting up a custom split
Figure 25-9. Setting up a custom split

In the Custom Split dialog, you set both the delimiter and what you want returned from the split. In Figure 25-9, selecting All ensures that all the data is returned even if you expect a certain number of columns to be created (Figure 25-10).

At the time of writing, the All option is limited to splitting 150 items.

Result of the custom split
Figure 25-10. Result of the custom split

The Custom Split option can save you from having to write a number of complex calculations. In this example, the third calculation Prep Builder has built is quite complicated, so you could simplify it if you wanted, but there is really no need to unless your flow’s performance is lacking.

When Not to Split Data

Although SPLIT() is a powerful function, it’s not always the right technique for your task. Let’s look at a few scenarios where this might be the case.

Address Data

In Figure 25-11, the address details are separated by commas. However, it is difficult to define an appropriate split because the street part of these addresses is inconsistently formatted, with varying numbers and punctuation. Addresses are typically difficult to split and likely will require more complex calculations and logic to align the correct parts to the correct columns.

Address data separated by commas
Figure 25-11. Address data separated by commas
Note

If you want to tackle this problem, have a go at the 2019: Week 46 Challenge, which you can use splitting to solve. The solution is available on the Preppin’ Data website for you to check your flow, although there is often more than one right answer; we care mainly about whether you’ve achieved the data set in the output given, not how you arrived at it.

No Clear Delimiter

If there is no clear and logical separator to split by, then the split technique is definitely not appropriate. In this circumstance, more advanced string functions like FIND() or FINDNTH() might be a good approach. These functions allow you to find out if a character or substring exists within a string by its position. You can then pair these functions with LEFT(), RIGHT(), or MID() to split out the parts of the field required. These functions were covered in more detail in Chapter 9. Another option in this situation is to use regular expressions (or regexes), which will be covered in Chapter 31.

Summary

Ultimately, SPLIT() is a great first function to investigate when breaking up data columns to aid your analysis. Prep has a couple of great options, but might not always deliver the correct solution, in which case you’ll need to build the logic yourself. Fortunately, SPLIT() generates the calculation logic and makes the calculation available in the Changes pane for editing, so it is easy to make the changes you require.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset