Chapter 27. Dealing with Nulls

Nulls, or the absence of data, are a fickle beast within data preparation. Experienced data preppers will know almost instinctively how to deal with them, or at least how to manage the challenges that come with a data set containing them. Novice data preppers do not have the same set of use cases or experience to draw on, though, so this chapter covers the basic considerations for working with a data set containing nulls.

 

What Is a Null?

The absence of data is not the same as a zero, a new row, or a space, all of which are actually values. Nulls appear in data sets for many reasons, including:

  • They are the result of mismatched fields in a union.

  • They are the result of mismatched fields in a left, right, or fullOuter join.

  • There’s no original data entry for that record, but other data points in the set exist (i.e., the other fields are not null).

Prep Builder shows the number of null records for any data field in its Profile pane (Figure 27-1).

Nulls in the Volume data field
Figure 27-1. Nulls in the Volume data field

Now that you know what they are and where to find them, let’s look at when it’s acceptable to have null records in your data set and when you should remove or replace them.

When Is a Null OK?

To understand when a null value is acceptable, let’s assess the impact of one. The most common situation where nulls can affect your analysis is when you are averaging values. Figure 27-2 shows a simple data set that has nulls, and Figure 27-3 shows the same data set but with zero values instead of the nulls.

Soap Scent data set with nulls
Figure 27-2. Soap Scent data set with nulls
Soap Scent data set with zeros
Figure 27-3. Soap Scent data set with zeros

If you use both these tables in Prep Builder in exactly the same way to calculate averages, due to the null values, they will produce two different outputs. For example, adding an Aggregate step after the Input step and calculating the average volume for each type of product generates very different results. When you apply this technique to the data set with nulls, the values returned for Bars and Liquid are 56.6 and 566.6, respectively (Figure 27-4).

Averages resulting from nulls in the data set
Figure 27-4. Averages resulting from nulls in the data set

However, applying the same technique to the table containing zeros gives results of 42.5 for Bars and 425 for Liquid product types (Figure 27-5).

Averages resulting from zeros in the data set
Figure 27-5. Averages resulting from zeros in the data set

So what is the math in this situation and why does the result differ? Well, to calculate the averages, Prep Builder is summing up the volumes and dividing by the number of rows. For both tables, the sum of the volumes is 170 for bars and 1,700 for liquid. However, Prep Builder does not count null values in the row count for the average. This means that volume sum is being divided by 4 for the table with no nulls (as each type of soap has four rows), but being divided only by 3 for the table with nulls (as each type of soap has only three non-null volumes).

This is an important point to consider, because if a null is correct (i.e., a record didn’t occur and shouldn’t have), it is the perfect entry for the record in the respective data field. However, if a null is present but it’s due to a value not being recorded when it should have been, then the null should be replaced with a zero. Situations like this occur when a product is stocked in a shop but not sold. Nulls should be used where a product isn’t stocked within a certain store and, therefore, never had the possibility of being sold.

How to Remove or Replace a Null

As a data prepper, you may have to decide that the null cannot remain in your data set. You have a few options in this case.

ISNULL()

If your analysis doesn’t allow for a null value to be in place, you may need to filter out the row. ISNULL() is a fantastic function that allows you to simply assess whether a record has a null in a given field. ISNULL() returns a Boolean result of True or False so you can filter out just the True values, thereby removing the nulls (Figure 27-6).

Applying the ISNULL() function to the Volume data field
Figure 27-6. Applying the ISNULL() function to the Volume data field

In this example, each row with a null in the Volume data field will return True. Filtering out this data produces the result shown in Figure 27-7.

Result of ISNULL() calculation with null values highlighted
Figure 27-7. Result of ISNULL() calculation with null values highlighted

Removing columns full of nulls is a wise choice too, as it is very unlikely that they are useful for any reason, especially if the column doesn’t have a name and returns Fx (x is a number) when Prep Builder doesn’t find a field name in the data set.

ZN()

ZN() (short for “zero if null”) replaces a null value for the specified data field with a zero. This is useful when you want the row to be considered in aggregations, especially averages as in the previous example (Figure 27-8).

Zero if null calculator setup
Figure 27-8. Zero if null calculator setup

In this example, Prep Builder will overwrite any null values with a zero instead. You can overwrite nulls in Prep Builder by double-clicking the null in the Profile pane and entering 0 (zero) instead. If the data field is non-numeric, you may wish to use IFNULL() instead, as it returns the value given if it finds a null within the data field (Figure 27-9).

Using an IFNULL() function to return 'Unknown' instead of a null
Figure 27-9. Using an IFNULL() function to return 'Unknown' instead of a null

Merge

Not all nulls are supposed to be zeros—they might be due to data entry or system errors. This means that if you have a value within the data set that can replace those nulls, you can use a merge. The situation where this mostly occurs is during a union: if data field names do not exactly match, then the union will create separate columns for the mismatched fields.

Let’s use the Preppin’ Data 2020: Week 8 challenge to show the merge operation in action. This challenge starts with a wildcard union but results in two columns for the volumes (Figure 27-10).

Data set where the Sales Volume and Volume fields need to be merged
Figure 27-10. Data set where the Sales Volume and Volume fields need to be merged

There are a couple of easy ways to merge the columns in the Profile pane. First, drag one of the fields and drop it onto the other field you wish to merge it with. Second, select both columns by holding down the Ctrl key (Command on the Mac), and then click the ellipsis. Select Merge (Figure 27-11).

Merging data fields using the menus in the Profile pane
Figure 27-11. Merging data fields using the menus in the Profile pane

Now you should see those two fields merged together with the null values from one column replaced by the corresponding values from the other. The resulting data set is shown in Figure 27-12.

Result of the merged data set
Figure 27-12. Result of the merged data set

Nulls can be frustrating, but hopefully this chapter has helped you to understand the choices you have, when the nulls can remain, and when and how to remove or replace them.

Summary

Nulls are the absence of values in a data set. There are multiple situations where their presence is necessary for you to properly analyze the data. Where the nulls need to be removed or altered, however, there are lots of different techniques available to you. To choose the right one, you’ll need to think through the impact of your actions on the subsequent analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset