Chapter 10. Dealing with Boolean Data

My favorite data joke is, “What’s a ghost’s favorite data type?”

“BOO-lean.”

Now that that’s out the way, let’s talk about potentially the simplest data type but one that sits at the heart of a lot of what we do in data analytics and therefore is an important part of data preparation: Boolean data. In this chapter, we’ll cover what it is and why it is useful when analyzing data.

What Is Boolean Data?

The word Boolean comes from the mathematician George Boole. He was Cork University’s first mathematics professor, and his theorems were eventually applied to computing. Boolean data is simply a True or False response to a conditional statement or test.

Why Is It So Useful in Data Analysis?

The response of True or False is often encoded as 1 or 0 behind the scenes in the software we use. Therefore, the performance of calculations that use Boolean data is very quick. Computing is based on 1s and 0s, so Boolean data is easily processed by a computer.

A simple column of 1 or 0 responses is actually amazingly useful in data analysis for many reasons beyond just performance:

Indicators

Here, an indicator refers to a field or set of fields that indicate whether or not each record fits some criteria. These can be analyzed very simply in most data tools. For example, if you want to count how many customers have a certain product type, it’s very simple to sum a Boolean indicator of 1s and 0s in most tools (Figure 10-1).

Indicators demonstrated as 1 for yes and 0 for no
Figure 10-1. Indicators demonstrated as 1 for yes and 0 for no
In Prep Builder or Desktop, you could easily aggregate these values using either a SUM() or AVG() function to form simple summary statistics about the data set without having to build complex calculations or functions.
Logical tests to filter
Determining whether or not something meets a condition is a key part of data analytics. As mentioned, Boolean tests simply result in True/False responses, allowing you to decide how to proceed. These tests can be useful in cleaning data as well as in developing more complex logic for solving specific difficult problems. For example, using the banking data set in Figure 10-1, if we are analyzing only customers with credit cards, we can easily extract the relevant data by checking if a customer has a 1 value in the Credit Card Ind column (Figure 10-2).
Customers with Credit Cards data set
Figure 10-2. Customers with Credit Cards data set

One aspect of using Boolean data fields to be mindful of is losing the meaning of the data points. Getting a response of True/False or 1/0 might be useful to the analyst but may not be as meaningful to the user of the resulting analysis. Therefore, it’s a good idea to use an alias to describe what the Boolean data type refers to (Figure 10-3).

Aliasing Boolean values in Prep
Figure 10-3. Aliasing Boolean values in Prep

In Prep Builder, the Profile pane is the perfect place to change your values, but doing so would negate some of the performance benefits of Boolean values. Therefore, it is often more beneficial to save these changes for later in the flow. Simply double-clicking a value (in this case, 1) allows you to replace it with an alias, such as “Yes.” Remember to change your numeric data type from an integer to a string first, or Prep Builder will not accept the change.

Functions Featuring Boolean Logic

Many calculation functions feature Boolean logic—that is, they assess whether a value in a data field meets a condition to be output as either True or False. The following are some of the most useful.

IIF()

This is short for an “Immediate IF” function. IIF() functions are the equivalent of choosing what to output instead of just True or False when a condition is or isn’t met.

Here’s how to set up this calculation:

IIF(logical test, if True return this, if False return this)

The function returns the True result (second part of the function) if the logical test is met, and the False result (third part of the function) if the logical test is not met. Using an IIF() function can save a lot of typing of traditional IF statements and thus prevent keying errors. For example, an IIF() function can be very effective to assess whether sales targets have been met. Here’s how you would use it to return a value of 'Met' if sales are over 100 and 'Missed' if they are not:

IIF( [Sales] > 100 , 'Met ', 'Missed' )

contains()

This is a great function to test whether certain words, terms, or characters are present in a string. This prevents you from having to split fields up into individual words or characters in order to check whether they contain the search term. It also avoids data being converted into massively long data sets and values being duplicated.

For example, if you’re searching for a single word within a tweet, the contains() function can tell you if that term exists without splitting each word out into an individual row and having the values associated with that tweet replicated within the data set (e.g., number of retweets or likes).

contains([tweet],'keyword')

Figure 10-4 shows the contains() function applied to the Preppin’ Data challenge from 2019: Week 9.

Using contains() to find the substring '@C&BSudsCo'
Figure 10-4. Using contains() to find the substring '@C&BSudsCo'

For each row of the data set, the function returns True if it finds the keyword within the given field, and False if it doesn’t. Figure 10-5 shows the resulting data set using the example from the Preppin’ Data challenge.

Resulting data set for the contains() calculation in #using_containsleft_parenthesisright_par
Figure 10-5. Resulting data set for the contains() calculation in Figure 10-4

IsDate()

This function tests whether a column is a valid date. When you are working with manual date recording (for example, capturing records of events in Excel), it is very useful to be able to identify invalid dates so you can clean the inaccurate records to ensure only valid data is captured for analysis.

IsDate([Date])

The function returns True if the Date field (shown in square brackets) is in the format yyyy-MM-dd. This means the year is two to four digits, month one or two digits, and day one or two digits. The date parts need to be valid; that is, the day value has to be less than or equal to 31 and the month values have to be below 13. The Date field must separate the date parts with hyphens.

In this example, let’s use the string '2020-02-29' to test whether Prep Builder recognizes this date as valid. As you can see in Figure 10-6, Prep Builder validates that 2020 did have a leap year.

Testing the IsDate() function
Figure 10-6. Testing the IsDate() function

As Figure 10-7 shows, Prep Builder returns False for dates that are not possible, like the 30th of February.

IsDate() correctly recognizing an invalid date
Figure 10-7. IsDate() correctly recognizing an invalid date

IsNull()

Nulls in data can often be valid, but finding them in fields where you don’t expect them can be a sign of issues that need to be resolved. Branching off records that have unexpected nulls so you can either alert an analyst or prepare that data differently can prevent misanalysis.

IsNull([Field])

If the field isn’t null—that is, it contains any value or string (including a space character)—the function will return False. If the field being assessed returns True, then the field will be a null.

IF/THEN

IF statements are logical tests that allow you to specify what value should be returned when the conditions set within the statement are met. When an IF statement is called, Prep Builder will assess each condition in turn. Once a condition is met, the resulting THEN clause sets the specified value into the calculation’s resulting data field. If the calculation’s first condition isn’t met, the next condition will be tested. This process continues until either:

  • All records have met a condition.

  • No condition is met, but the IF statement contains an ELSE statement to force a value to be entered.

  • No condition is met and there is no ELSE condition, so the calculation’s resulting data field will contain a null for that record.

Using the Tweets data set from the Preppin’ Data 2019: Week 9 challenge, the IF statement shown in Figure 10-8 tests if the tweet contains the word “soap.”

IF statement assessing whether Tweet contains the word “soap”
Figure 10-8. IF statement assessing whether Tweet contains the word “soap”

The resulting column would be Boolean, as the values returned are True (where the word “soap” is found in the tweet) and False if not. Notice in Figure 10-9 how Prep Builder changes the case of the True or False return value to match its normal output for a Boolean field.

Resulting data set from the IF statement calculation in #if_statement_assessing_whether_tweet_co
Figure 10-9. Resulting data set from the IF statement calculation in Figure 10-8

The IF statement in Figure 10-8 yields the same result we would get if we used a contains() function. So why use it? The benefit of using an IF statement is twofold:

  • You can test more than two conditions. An IF statement can have as many conditions as you like. My personal record is nearly 80. The order of the conditions is important because as soon a record meets a condition, it is no longer assessed for any of the following conditions.

  • You can specify different return values. The resulting values of the IF statement needn’t be Boolean; you can have it return dates, values, or strings as well.

Using the Tweets data set again, Figure 10-10 shows an IF statement that identifies whether a specific type of soap was mentioned in the tweet.

Multiple condition IF statement
Figure 10-10. Multiple condition IF statement

We add another condition by using the term elseif. So, for every subsequent condition you want to test, you would just add another elseif. The results of the statement in Figure 10-10 are shown in Figure 10-11. Notice that the last tweet in the data pane contains the word “liquid” as well as “bar,” but because the “liquid” condition is assessed first, the resulting value is Liquid rather than Bar.

Resulting data set for the IF statement in #multiple_condition_if_statement
Figure 10-11. Resulting data set for the IF statement in Figure 10-10

IF statements can contain quite complex conditions. Understanding the impact of using AND or OR conditions, as well as the mathematical order of operations, is important. To return a True value, AND conditions require every condition specified to be met, whereas OR conditions require only one to be met. Familiarize yourself with the mathematical order of operations, often nicknamed BODMAS (UK) or PEMDAS (US), to ensure your IF statement returns the values you expect:

  1. Brackets or Parentheses

  2. Of (powers of) or Exponents

  3. Division or Multiplication

  4. Multiplication or Division

  5. Addition

  6. Subtraction

Note

Multiplication and division have the same precedence (they are always done left to right), which is why they’re transposed in the BODMAS and PEMDAS acronyms.

Because of their flexibility, you will find IF statements very useful in data preparation, whether for returning a simple Boolean value or a field with many differing values based on multiple conditions.

CASE

Whereas IF statements test conditions in Tableau, CASE statements test values. During data preparation, you can use CASE statements to change the output value of a field as either a conditional test or simply to rename it to something easier for users to understand. To demonstrate this, let’s use a simple data set on the number of corners found in different shapes (Figure 10-12).

Number of Corners data set
Figure 10-12. Number of Corners data set

The CASE statement, with its simple syntax (see Figure 10-13), saves a lot of coding compared to the equivalent IF statement.

CASE statement to name shapes
Figure 10-13. CASE statement to name shapes

Unlike the SQL language, Tableau does not allow CASE statements to perform the Boolean algebra required to test conditions.

Summary

As you have seen in this chapter, Boolean data type fields can be very powerful. Practicing their use is key to becoming skilled at data preparation. The speed with which computers can process Boolean fields makes this data type especially effective in large data sets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset