Chapter 6

Cleansing and Profiling Data

This chapter covers Objective 2.2 (Identify common reasons for cleansing and profiling datasets) of the CompTIA Data+ exam and includes the following topics:

  • Images Duplicate data

  • Images Redundant data

  • Images Missing values

  • Images Invalid data

  • Images Non-parametric data

  • Images Data outliers

  • Images Specification mismatch

  • Images Data type validation

For more information on the official CompTIA Data+ exam topics, see the Introduction.

This chapter covers topics related to cleansing and profiling data. Some datasets may include duplicate data or data that’s just not worth working with. In this chapter you will learn about redundant data, missing values, invalid data, non-parametric data, and data outliers. This chapter also describes specification mismatches and data type validation.

Profiling and Cleansing Basics

It’s pretty apparent that data plays a vital role in analysis as well as decision making. In fact, the success of analysis and decision making relies heavily on data quality. For a variety of reasons, the quality of data may be lower than expected. Data may include duplicates or it may be redundant or invalid, as we will discuss in this chapter.

Fun Fact

The phrase “Garbage in, garbage out” applies to data quality. If the data being used for analysis has errors, missing values, and other inconsistencies, working with that data will give you very low-quality or useless outcomes.

Profiling and cleansing the data in a dataset is usually a two-step process involving error detection and correction. Data profiling is carried out in order to identify the records that include corrupted or incomplete data, and verification and validation are conducted to pinpoint missing or incomplete information. The goals of profiling are to spot any errors in a given dataset and recognize any anomalies that could affect the quality of the final analysis.

Data scrubbing, also known as data cleansing, follows the data profiling phase and is aimed at correcting data quality issues before a dataset is leveraged in another system. The goal is to correct data that is incomplete or formatted incorrectly or that includes duplicate records or errors, all of which may reduce its usefulness. During the cleansing phase, data engineers may introduce transformations that clean the data and, hence, enhance the quality. Data cleansing may be performed manually or automatically. Manual cleaning is typically a time-consuming process, whereas automated cleaning is quicker and often more cost-effective.

Note

Once data extraction is complete, data is typically stored in temporary tables, and data cleansing functions are carried out before the data is stored in the target tables.

Duplicate Data

Duplicate data reduces the quality of a given dataset and makes the knowledge acquired from such sources unreliable. Detecting duplicate records requires determining which records refer to the same unique object or entity.

The process of detecting duplicates starts with the preparation stage, in which data entries are stored uniformly in a database that brings some degree of homogeneity to the datasets. The data preparation stage consists of parsing, transforming, and standardizing the data. The processes of extraction, transformation, and loading (ETL) may be used to enhance the quality and usability of data that is collected.

A few steps can be taken to reduce the probability of existing duplicate entities. You may be able to use logic to check that all entered data—such as physical addresses, email addresses, and phone numbers—is entered correctly and to look for duplicate and misspelled entries. It is important to develop a streamlined and standardized process for data input to reduce duplication at the source.

Note

It can be quite difficult to identify similar duplicates in a database. Objects or entities may superficially look alike but differ by only a character or symbol. In the absence of these superficial characters or symbols, they would be precise duplicates.

You can use similarity measurements to look for similar duplicates that are not exact textual duplicates. Based on these similarity measurements, you can decide whether the entries are duplicates. For example, “neighbor” and “neighbour” are similar duplicates: They are not spelled the same but have the same meaning in US and UK English. If you use a string-matching algorithm that estimates the similarity between “neighbor” and “neighbour,” you can use the result to determine whether the two terms are the same.

Another option for eliminating duplicate entries is to leverage a knowledge-based approach called IntelliClean. IntelliClean is a proposed framework that provides a methodical way to deal with duplicate entries that involves introducing standardization, detecting anomalies, and removing duplication in databases. The IntelliClean framework involves three stages: preprocessing, processing, and validation and verification. Anomalies in data are identified and cleansed during the preprocessing state. The output of the preprocessing stage is the input to the next stage, processing. In this stage, there are some rules to be followed for identifying duplicates: identification, purge or merge, update, and alert rules. These rules can be executed by the search and comparison engine to compare the rules collection to the objects collection. The actions taken in the preprocessing and processing stages are logged for the verification and validation stage, which may involve human intervention to assure accuracy and consistency.

Redundant Data

ExamAlert

Redundant data is an important aspect pertinent to data cleansing and is an area of focus in the CompTIA Data+ exam.

Removing the noise introduced by redundant data (which might be introduced during the data acquisition or storage phase) is just as important as cleansing data by identifying and eliminating errors. Data redundancy occurs when the same datasets are stored in multiple data sources, such as data warehouses or data lakes. Data redundancy usually starts with poorly designed (relational) databases that are inadequately structured, leading to unnecessary replication of data across the same table. Data redundancy may also be an outcome of backing up data recursively, creating multiple data backups with the same datasets.

For example, say that several business units are collecting customer reviews from the same customer and storing them in different tables. The redundancy can be avoided by leveraging unique IDs or foreign keys to maintain relationships across separate but related tables.

Note

While there are many benefits to reducing data redundancy, one of the key benefits is the reduced storage requirement, which is especially valuable as data volumes grow.

Another common example of data redundancy is repeating a customer’s details (such as name, address, and phone number) across columns in a table in a relational database. This duplication can occur as different lines of business (LOBs) work to input data and end up making multiple entries for the same customer. As a result, the same pieces of data end up existing in multiple places.

Note

In the real world, data redundancy can’t be eliminated; however, it can be reduced and managed well.

From a data analysis perspective, there are two types of data redundancy:

  • Images Superficial redundancy: This type of redundancy does not impact analysis as the redundant variables are not considered for future analysis. For example, you might simply ignore repeated values across two tables in a database. Hence, to resolve superficial redundancy, the data engineers can ignore repeated values.

  • Images Deep redundancy: This type of redundancy exists when you are correlating variables across multiple datasets and/or tables and anything that extends from these redundant values. With deep redundancy, you need to take a more focused approach to ensure that redundant values across a multitude of databases are removed. The data engineers might need to compare the redundant values across multiple tables in multiple databases or data warehouses, which can be done using algorithms.

Before we move on to looking at how to reduce data redundancy, we need to consider the problems that data redundancy can create. Data redundancy can lead to omissions and incorrect measurements, and it can also increase the size of databases and storage requirements. Think about a customer contact database that includes the customer’s address as one of the data fields, with multiple entries of the same address across the database columns. Now, if the customer moves to a different location, it will be necessary to update the address across all the columns. Making the needed update in multiple places would be time-consuming, and it would involve the possibility of making typographical errors.

One way to reduce and remove data redundancies is to leverage the process of normalization. Normalization allows you to efficiently organize data in a database by eliminating redundant data and ensuring that data dependencies make sense. With data normalization, the attributes and relations of a database are organized such that their dependencies are properly enforced by database integrity limitations. Normalization involves normal forms from 1NF through 5NF. Chapter 7, “Understanding and Executing Data Manipulation,” covers the concept of normalization and the various normalization forms.

ExamAlert

Data redundancy and normalization are key topics, and you can expect questions along these lines: “What do you need to do to a sample dataset before you can analyze it?” The answer is normalize the data.

To reduce redundancy, organizations often integrate the data from multiple databases into a centralized source, such as a data warehouse or data lake. An organization that does this ends up with less redundancy and also has the opportunity to put limits on data entry and implement validation.

Missing Values

In addition to having duplicate values, a dataset may be missing values. If missing values are not handled properly, researchers or analysts may reach incorrect conclusions about the data, which would impact the analysis and possibly skew outcomes.

Missing values can be broadly categorized as random and non-random:

  • Images Random missing values: Random missing values result when the respondent or the person entering the data inadvertently fails to provide certain responses, skips some data fields, or makes typographical errors while entering the data. An example is trying to consolidate data from multiple databases and mistakenly skipping one or more entries in tuples.

    Random missing values are further classified as missing completely at random (MCAR) and missing at random (MAR). MCAR is a result of absolute missing values in that there is no relationship between a missing data point and any other values in the dataset. For example, while responding to surveys, respondents might miss one or more questions randomly, and it may be impossible to relate these missing values to the type of questions. In case of MAR, the missing values have relationships to the data points; for example, in a survey, a respondent may fail to answer a sex-related question as male, female, or unidentified sex.

  • Images Non-random missing values: These missing values result when the respondent or the person entering the data intentionally leaves blank fields or incomplete data entries. An example would be if the respondent chooses to leave a field empty in response to a question about sensitive data such as their marital status.

There are multiple approaches for dealing with missing data. While some approaches involve removing variables that have missing values, others include imputation or substituting values. Some of the popular approaches to address missing data are as follows:

  • Images Replace missing values: One of the ways to deal with missing values is to replace all missing values in a dataset with a unique identifier. For example, blank cells may be filled with five zeros to represent the value “not present” or “N/A.”

  • Images Do nothing: One approach to dealing with missing data is to not do anything about the missing data. However, the data engineers let an algorithm take control over how it responds to the missing data.

Note

The do nothing approach does not imply no imputation. Rather, the imputation method/algorithm will determine the best substitution for the missing data.

Different algorithms react to missing data differently; some algorithms identify the most plausible imputation values for missing data based on training from previous datasets, and others just try to substitute based on prebuilt logic rules.

  • Images Drop if it’s not in use: One way to deal with missing data is to delete all rows that have missing values to get the dataset to the point where there are no missing values. The drawback of this approach is that it can affect the sample representativeness, as the size of population sample might not be sufficient.

  • Images Use imputation with most frequent values: Missing values, such as numeric or string values, are substituted with the most frequent values. For example, in measuring speed in miles per hour (MPH) most frequent values may be used to replace missing speed values in a dataset.

  • Images Use imputation with mean or median values: With this method, which works only with numeric data, you start by working out the mean or median of the available values in a column. You then replace the missing values in each column with the calculated values.

  • Images Use extrapolation and interpolation: This method involves trying to estimate values from other similar observations based on previously known data points. For example, speed of a cruise ship in a certain area of ocean can be determined by previous voyages of other cruise ships in same region.

Invalid Data

Invalid data contains values that were initially generated inaccurately. In most datasets, invalid data values can be difficult to identify; however, they impact the outcomes of analysis and should therefore be removed from a dataset.

Invalid data may be included in a dataset for a number of reasons, including the following:

  • Images Incorrect recording of data: Someone may have recorded the data values incorrectly, perhaps simply entering incorrect values or putting values in the wrong fields. For example, an office receptionist took the wrong details from a customer for a callback from the marketing team.

  • Images Generation of invalid data: A tool/software may generate bad or invalid data, or an observer may incorrectly configure the tool. For example, an engineer might calibrate a meter incorrectly, leading to ambiguous readings.

  • Images Measurement of incorrect items: An observer might need to measure X but instead measure Y. For example, instead of measuring the value of calcium in a blood sample, a clinician might measure zinc and report it as calcium.

Invalid data can be rectified in a few ways. One way is to remove the invalid data from the dataset altogether. Another is to cross-check the data against standard/similar values.

Non-parametric Data

Non-parametric data is data that does not fit a well-defined or well-stated distribution. No fixed parameters are available, and no normal distribution is needed for non-parametric data. Non-parametric methods are also known as distribution-free or assumption-free methods. Because there are no (fixed) parameters available to evaluate the hypotheses, the entire population distribution can be used.

For example, say that in a city, the current population as of 16:00 hours is 1 million. At 17:00 hours, the population becomes 1.1 million. There are no parameters governing this population growth, and so this data is considered non-parametric data. It is important to see here that the data doesn’t fit a defined shape (for example, the population grows by x in z time). Also, it is important to know that the data is aligned with time intervals and may contain outliers.

In order to deal with non-parametric data in datasets, it is important to consider the tests that will help deal with this data and associated hypotheses. Non-parametric methods include the sign test, the Spearman correlation test, and the U test. You can use the sign test, for example, to test a null hypothesis and compare two groups for their equality in terms of size. You can use the Spearman correlation test to measure the strength of association between ordinal variables (where a variable is r, and r = 1 implies positive correlation and r = –1 implies negative correlation). The topic of null hypothesis and hypothesis testing is covered in Chapter 10, “Descriptive and Inferential Statistical Methods.”

Data Outliers

An outlier in a dataset is an observation that is inconsistent with or that is very dissimilar to the other observations/information. Outliers are anomalous values that deviate from other data points in a dataset. An example of an outlier would be a credit card transaction that’s much larger than the user’s typical transactions. Some key causes of data outliers are human error, instrument calibration error, and sometimes an environmental change. Outliers may also be present due to the intentional sabotage of data.

Note

Defining an observation as an anomaly depends on what has been defined as normal or baseline. It also depends on how a data analyst defines the baseline compared to anomalies.

Figure 6.1 shows data with outliers that are outside the normal values.

Images

Figure 6.1 Outliers

There are three types of outliers:

  • Images Type I: Global outliers: These outliers, also known as point anomalies, have values far outside the rest of the dataset. An example would be lightning striking a building in your community, which is a rare event.

  • Images Type II: Contextual outliers: The value of this type of outlier, also known as a conditional outlier, deviates significantly from the rest of the data points. Consider a wet summer and the city getting more than 8 inches of rain within a day, causing flooding. It isn’t usual to get so much rain in summer; however, getting rains during summer is still a contextual phenomenon.

  • Images Type III: Collective outliers: As the name suggests, a collective outlier is a subset of anomalous data points within a dataset that collectively diverge from the rest of the dataset. For example, say that almost all homes in a community start replacing their regular Internet cable connection with Internet satellite dish antennas. This isn’t a singular event but a collective activity that isn’t expected to happen all at once.

Statistical methods can be used to detect outliers. Examples include the standard deviation method, the distance method, the statistical method, and the density method. Some additional ways to detect outliers are as follows:

  • Images Visually find outliers by plotting data on histograms for univariate data and scatter plots for multivariate data

  • Images Use statistical tests such as Chauvenet’s criterion or Grubb’s test to compare data within a set to identify outliers

  • Images Manually look at data for outliers and remove these anomalies

Specification Mismatches

It is quite common for data from source systems to be incompatible with a destination system. This is sometimes due to specification mismatches, such as unrecognized symbols, mismatches of units/labels, or different data formatting. For example, if the formatting of data isn’t the same across two systems, a mismatch may result. Quite commonly, the use of quotes and semicolons in a file causes specification mismatches. Further, specification mismatches may arise due to different terms or abbreviations being used for the same concept (for example, First_Name in one database and F_Name in another or Sex=Male in one database and Sex=M in another). Unless a user curates the data for one-to-one translation or transposition between source and destination systems, specification mismatches may occur among different systems exchanging information.

One of the ways to resolve any data specification mismatches is to have algorithms convert incompatible specifications either during preprocessing or on the fly when the data is being transferred from one to another system. Another way to resolve specification mismatches is to enforce data field types such that any data that is moved or entered has to match with the field type.

Data Type Validation

Data type validation is the process of using one or more checks and/or rules to ensure that the user is only entering a specific data type so that the data is meaningful for later processing.

As discussed in Chapter 3, “Data Types and Types of Data,” data types can be any of the following:

  • Images Integer (Int)

  • Images Character (Char)

  • Images String or text

  • Images Float

  • Images Enum

  • Images Boolean

  • Images Array

  • Images Date

  • Images Time

Data type validation ensures that the data entered is of the correct data type to be leveraged at the destination system. For example, a field might accept only string data and reject floating values. This would reject decimal numbers. Similarly, a numeric field should accept any numbers 0 through 9 but should not accept characters or symbols.

Note

In addition to data type validation, there are other data validation mechanisms, such as validation of data range, data consistency, and data format. Multiple data type validation rules can be used to ensure that data being entered meets the minimum standard of the correct data type.

What Next?

If you want more practice on this chapter’s exam objective before you move on, remember that you can access all of the Cram Quiz questions on the Pearson Test Prep software online. You can also create a custom exam by objective with the Online Practice Test. Note any objective you struggle with and go to that objective’s material in this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset