One of the most common actions you will take when preparing string data for analysis is to split a string field into its subparts. Splitting was briefly mentioned in Chapter 9, which covered the basics of working with strings if you need to refresh your memory about the data type. Splitting is required for many reasons, such as operational systems picking up data and outputting unique IDs for each record or squeezing records together to fit them into a specified database table. The human brain is fantastic at spotting patterns in data (that’s why we create visual analytics, after all), so you can often spot the need to split data fields (columns) by just looking at the data set.
For example, in Figure 25-1, we can see that we probably need to split the Product Code field on the left into three separate columns (on the right) in order to help us analyze this data set.
Splitting data in most data tools is very easy; Prep Builder is no different. Simply choose the data field you want to split, click the ellipsis in the top right of the field, select Split Values, and then click Automatic Split. Prep Builder will split up the field using what it believes to be the most appropriate logic (Figure 25-2).
In this case, the automatic split has worked as desired (Figure 25-3).
To perform this task, Prep Builder actually writes three Calculated Fields, which you can always edit if you want a slightly different result (Figure 25-4).
You will find these calculations in the Changes pane in Prep Builder. You can also edit them from here, or alternatively learn how Prep completed the task you set it (Figure 25-5).
This formula is splitting the Product Code field on the hyphen (-) separator and pulling back the third part (i.e., what comes after the second hyphen but before the fourth, which in this case doesn’t exist). The resulting values are then trimmed, which means any leading or trailing spaces are removed, as these can wreak havoc when you are matching text values.
Let’s tweak the data a little bit so that the pattern of delimiters (the character we split the data field by) is a little less obvious to Prep Builder (Figure 25-6).
Now the Product Type (first part of the Product Code) for Soap Bar no longer includes the word Soap and each field now includes spaces around the hyphens. With this irregular pattern, Prep no longer gives us the results we are looking for (Figure 25-7).
The Product Type has disappeared altogether in the data set resulting from this split. Notice all those blank records? They are the result of the word Soap being removed from the Soap Bar rows; note there are no blanks resulting from the Liquid Soap rows. Let’s look at the calculation Prep is writing for this split (Figure 25-8).
Well, that’s not clear, is it?
These situations are where your human eye can take over, and you should choose the Custom Split menu option instead of Automatic Split. By setting up the Custom Split to work on the hyphen, we’re back to getting the results we want from this data set (Figure 25-9).
In the Custom Split dialog, you set both the delimiter and what you want returned from the split. In Figure 25-9, selecting All ensures that all the data is returned even if you expect a certain number of columns to be created (Figure 25-10).
At the time of writing, the All option is limited to splitting 150 items.
The Custom Split option can save you from having to write a number of complex calculations. In this example, the third calculation Prep Builder has built is quite complicated, so you could simplify it if you wanted, but there is really no need to unless your flow’s performance is lacking.
Although SPLIT()
is a powerful function, it’s not always the right technique for your task. Let’s look at a few scenarios where this might be the case.
In Figure 25-11, the address details are separated by commas. However, it is difficult to define an appropriate split because the street part of these addresses is inconsistently formatted, with varying numbers and punctuation. Addresses are typically difficult to split and likely will require more complex calculations and logic to align the correct parts to the correct columns.
If you want to tackle this problem, have a go at the 2019: Week 46 Challenge, which you can use splitting to solve. The solution is available on the Preppin’ Data website for you to check your flow, although there is often more than one right answer; we care mainly about whether you’ve achieved the data set in the output given, not how you arrived at it.
If there is no clear and logical separator to split by, then the split technique is definitely not appropriate. In this circumstance, more advanced string functions like FIND()
or FINDNTH()
might be a good approach. These functions allow you to find out if a character or substring exists within a string by its position. You can then pair these functions with LEFT()
, RIGHT()
, or MID()
to split out the parts of the field required. These functions were covered in more detail in Chapter 9. Another option in this situation is to use regular expressions (or regexes), which will be covered in Chapter 31.
Ultimately, SPLIT()
is a great first function to investigate when breaking up data columns to aid your analysis. Prep has a couple of great options, but might not always deliver the correct solution, in which case you’ll need to build the logic yourself. Fortunately, SPLIT()
generates the calculation logic and makes the calculation available in the Changes pane for editing, so it is easy to make the changes you require.