My favorite data joke is, “What’s a ghost’s favorite data type?”
“BOO-lean.”
Now that that’s out the way, let’s talk about potentially the simplest data type but one that sits at the heart of a lot of what we do in data analytics and therefore is an important part of data preparation: Boolean data. In this chapter, we’ll cover what it is and why it is useful when analyzing data.
The word Boolean comes from the mathematician George Boole. He was Cork University’s first mathematics professor, and his theorems were eventually applied to computing. Boolean data is simply a True
or False
response to a conditional statement or test.
The response of True
or False
is often encoded as 1
or 0
behind the scenes in the software we use. Therefore, the performance of calculations that use Boolean data is very quick. Computing is based on 1
s and 0
s, so Boolean data is easily processed by a computer.
A simple column of 1
or 0
responses is actually amazingly useful in data analysis for many reasons beyond just performance:
Here, an indicator refers to a field or set of fields that indicate whether or not each record fits some criteria. These can be analyzed very simply in most data tools. For example, if you want to count how many customers have a certain product type, it’s very simple to sum a Boolean indicator of 1
s and 0
s in most tools (Figure 10-1).
SUM()
or AVG()
function to form simple summary statistics about the data set without having to build complex calculations or functions.True
/False
responses, allowing you to decide how to proceed. These tests can be useful in cleaning data as well as in developing more complex logic for solving specific difficult problems. For example, using the banking data set in Figure 10-1, if we are analyzing only customers with credit cards, we can easily extract the relevant data by checking if a customer has a 1
value in the Credit Card Ind column (Figure 10-2).One aspect of using Boolean data fields to be mindful of is losing the meaning of the data points. Getting a response of True
/False
or 1
/0
might be useful to the analyst but may not be as meaningful to the user of the resulting analysis. Therefore, it’s a good idea to use an alias to describe what the Boolean data type refers to (Figure 10-3).
In Prep Builder, the Profile pane is the perfect place to change your values, but doing so would negate some of the performance benefits of Boolean values. Therefore, it is often more beneficial to save these changes for later in the flow. Simply double-clicking a value (in this case, 1) allows you to replace it with an alias, such as “Yes.” Remember to change your numeric data type from an integer to a string first, or Prep Builder will not accept the change.
Many calculation functions feature Boolean logic—that is, they assess whether a value in a data field meets a condition to be output as either True
or False
. The following are some of the most useful.
This is short for an “Immediate IF” function. IIF()
functions are the equivalent of choosing what to output instead of just True
or False
when a condition is or isn’t met.
Here’s how to set up this calculation:
IIF(logical test, if True return this, if False return this)
The function returns the True
result (second part of the function) if the logical test is met, and the False
result (third part of the function) if the logical test is not met. Using an IIF()
function can save a lot of typing of traditional IF
statements and thus prevent keying errors. For example, an IIF()
function can be very effective to assess whether sales targets have been met. Here’s how you would use it to return a value of 'Met'
if sales are over 100 and 'Missed'
if they are not:
IIF( [Sales] > 100 , 'Met ', 'Missed' )
This is a great function to test whether certain words, terms, or characters are present in a string. This prevents you from having to split fields up into individual words or characters in order to check whether they contain the search term. It also avoids data being converted into massively long data sets and values being duplicated.
For example, if you’re searching for a single word within a tweet, the contains()
function can tell you if that term exists without splitting each word out into an individual row and having the values associated with that tweet replicated within the data set (e.g., number of retweets or likes).
contains([tweet],'keyword')
Figure 10-4 shows the contains()
function applied to the Preppin’ Data challenge from 2019: Week 9.
For each row of the data set, the function returns True
if it finds the keyword within the given field, and False
if it doesn’t. Figure 10-5 shows the resulting data set using the example from the Preppin’ Data challenge.
This function tests whether a column is a valid date. When you are working with manual date recording (for example, capturing records of events in Excel), it is very useful to be able to identify invalid dates so you can clean the inaccurate records to ensure only valid data is captured for analysis.
IsDate([Date])
The function returns True
if the Date field (shown in square brackets) is in the format yyyy-MM-dd. This means the year is two to four digits, month one or two digits, and day one or two digits. The date parts need to be valid; that is, the day value has to be less than or equal to 31 and the month values have to be below 13. The Date field must separate the date parts with hyphens.
In this example, let’s use the string '2020-02-29'
to test whether Prep Builder recognizes this date as valid. As you can see in Figure 10-6, Prep Builder validates that 2020 did have a leap year.
As Figure 10-7 shows, Prep Builder returns False
for dates that are not possible, like the 30th of February.
Nulls in data can often be valid, but finding them in fields where you don’t expect them can be a sign of issues that need to be resolved. Branching off records that have unexpected nulls so you can either alert an analyst or prepare that data differently can prevent misanalysis.
IsNull([Field])
If the field isn’t null—that is, it contains any value or string (including a space character)—the function will return False
. If the field being assessed returns True
, then the field will be a null.
IF
statements are logical tests that allow you to specify what value should be returned when the conditions set within the statement are met. When an IF
statement is called, Prep Builder will assess each condition in turn. Once a condition is met, the resulting THEN
clause sets the specified value into the calculation’s resulting data field. If the calculation’s first condition isn’t met, the next condition will be tested. This process continues until either:
All records have met a condition.
No condition is met, but the IF
statement contains an ELSE
statement to force a value to be entered.
No condition is met and there is no ELSE
condition, so the calculation’s resulting data field will contain a null for that record.
Using the Tweets data set from the Preppin’ Data 2019: Week 9 challenge, the IF
statement shown in Figure 10-8 tests if the tweet contains the word “soap.”
The resulting column would be Boolean, as the values returned are True
(where the word “soap” is found in the tweet) and False
if not. Notice in Figure 10-9 how Prep Builder changes the case of the True
or False
return value to match its normal output for a Boolean field.
The IF
statement in Figure 10-8 yields the same result we would get if we used a contains()
function. So why use it? The benefit of using an IF
statement is twofold:
You can test more than two conditions. An IF
statement can have as many conditions as you like. My personal record is nearly 80. The order of the conditions is important because as soon a record meets a condition, it is no longer assessed for any of the following conditions.
You can specify different return values. The resulting values of the IF
statement needn’t be Boolean; you can have it return dates, values, or strings as well.
Using the Tweets data set again, Figure 10-10 shows an IF
statement that identifies whether a specific type of soap was mentioned in the tweet.
We add another condition by using the term elseif
. So, for every subsequent condition you want to test, you would just add another elseif
. The results of the statement in Figure 10-10 are shown in Figure 10-11. Notice that the last tweet in the data pane contains the word “liquid” as well as “bar,” but because the “liquid” condition is assessed first, the resulting value is Liquid
rather than Bar
.
IF
statements can contain quite complex conditions. Understanding the impact of using AND or OR conditions, as well as the mathematical order of operations, is important. To return a True
value, AND conditions require every condition specified to be met, whereas OR conditions require only one to be met. Familiarize yourself with the mathematical order of operations, often nicknamed BODMAS (UK) or PEMDAS (US), to ensure your IF
statement returns the values you expect:
Brackets or Parentheses
Of (powers of) or Exponents
Division or Multiplication
Multiplication or Division
Addition
Subtraction
Multiplication and division have the same precedence (they are always done left to right), which is why they’re transposed in the BODMAS and PEMDAS acronyms.
Because of their flexibility, you will find IF
statements very useful in data preparation, whether for returning a simple Boolean value or a field with many differing values based on multiple conditions.
Whereas IF
statements test conditions in Tableau, CASE
statements test values. During data preparation, you can use CASE
statements to change the output value of a field as either a conditional test or simply to rename it to something easier for users to understand. To demonstrate this, let’s use a simple data set on the number of corners found in different shapes (Figure 10-12).
The CASE
statement, with its simple syntax (see Figure 10-13), saves a lot of coding compared to the equivalent IF
statement.
Unlike the SQL language, Tableau does not allow CASE
statements to perform the Boolean algebra required to test conditions.