Chapter 3. Cleaning Our Datasets

Data originates from various sources (empirical research, historical research, or record keeping). At some point, a human has to consolidate data in a dataset. Humans are creatures who are far from perfect, and this human process of consolidating data will result in tiny imperfections in our datasets. This chapter looks at the techniques that we can use to identify problems in a dataset.

In this chapter, we will cover the following topics:

  • Structured versus unstructured datasets
  • Creating your own structured data
  • Counting the number of fields in each record
  • Filtering data using regular expressions
  • Searching fields based on a regular expression

Structured versus unstructured datasets

In the last chapter, we navigated data from three different sources: direct keyboard entry, csv files, and SQLite3 files. Data can originate from many more sources than just these. We typically classify the format of the data into two types: structured and unstructured data. Structured data consists of raw data with a degree of organization in the layout. Common examples of structured data include relational or hierarchical databases, CSV, XML, JSON, and YAML file formats. Regardless of the format of the data, the data is organized into a pattern that can be understood by the software (that is, our data is machine readable) and meets the criteria set forth in a metadata document.

The following sentence is what most would consider as unstructured data:

 

"Nicknamed "The Wizard" for his defensive brilliance, Smith set major league records for career assists (8,375) and double plays (1,590) by a shortstop"

 
 ---Wikipedia entry for Ozzie Smith

While there is a healthy debate on what structured data is, I tend to lean toward two requirements—the dataset should be in a machine-readable format and the dataset should meet the criteria set forth in a metadata document. If there is no machine-readable way to consistently pull values from a dataset, the dataset is classified as unstructured data. Unstructured data represents all other datasets. Examples of unstructured data include values that are found in text, the words of someone speaking on a recording in an audio file, the characters on a page from a scanned image, identifying a person in a video clip, or even structured data (like a csv file) that happens to be embedded in unstructured data.

How data analysis differs from pattern recognition

We do have strategies to extract text from images and words from audio recordings and pluck values from sentences. Each of these examples uses a field of computer science known as pattern recognition, which attempts to automate the process of defining a structure to unstructured data. While there are many successful techniques to solve this problem in various contexts, they also have a margin of error that is built into their success rate. In order to be considered structured, there needs to be perfect accuracy (we get it right the first time) and consistency (this happens every time) to the manner in which data is accessed. Data analysis differs from the field of pattern recognition because the structure of the data is assumed to not be the problem that we are trying to solve.

If your primary source of data is structured in your favorite format, error-free, and distilled to just the records needed to work on your desired problem, then someone has already performed the hardest job in data analysis for you—cleaning the dataset. Cleaning data is the least glamorous part of data analysis. Yet, it consumes most of our time. Datasets frequently come with their own quirks. The typical oddities that should be anticipated are missing values, duplicate records, misspelled identifiers, data outliers, and column values that do not seem to have a consistent type. Likewise, we frequently need to merge columns when it makes sense. We occasionally must split a column when multiple pieces of information are being expressed. During times when you are blessed with too much data, you will have to perform the common task of filtering data.

To perform all these tasks, you will need a scrub brush and a willingness to get the job done. Sometimes, the data is so messy that you will understand why an entire field of science is devoted to organizing data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset