Chapter 2. Preparing Your Data

The French term mise en place is used in professional kitchens to describe the practice of chefs organizing and arranging the ingredients up to a point where it is ready to be used. It may be as simple as washing and picking herbs into individual leaves or chopping vegetables, or as complicated as caramelizing onions or slow cooking meats.

In the same way, before we start cooking the data or building a predictive model, we need to prepare the ingredients-the data. Our preparation covers three different tasks:

  • Loading the data into the analytic tool
  • Exploring the data to understand it and to find quality problems with it
  • Transforming the data to fix the quality problems

We say that the quality of data is high when it's appropriate for a specific use. In this chapter, we'll describe characteristics of data related to its quality.

As we've seen, our mise en place has three steps. After loading the data, we need to explore it and transform it. Exploring and transforming is an iterative process, but in this book, we'll divide it in two different steps for clarity.

In this chapter, we'll discuss the following topics:

  • Datasets and types of variables
  • Data quality
  • Loading data into Rattle
  • Assigning roles to the variables
  • Transforming variables to solve data quality problems and to improve data format of our predictive model

In this chapter, we'll cover how we explore the data to understand it and find quality problems.

Datasets, observations, and variables

A dataset is a collection of data that we're going to use to create new predictions. There are different kinds of datasets. When we use a dataset for predictive analytics, we can consider a dataset like a table with columns and rows.

In a real-life problem, our dataset would be related to the problem we want to solve. If we want to predict which customer is most likely to buy a product, our dataset would probably contain customer and historic sales data. When we're learning, we need to find an appropriate dataset for our learning purposes. You can find a lot of example datasets on the Internet; in this chapter, and in the following one, we're going to use the Titanic passenger list as a dataset that has been taken from Kaggle.

Note

Kaggle is the world's largest community of data scientists. On this website, you can even find data science competitions. We're not going to use the term data science, in this book, because there are a lot of new terms around analytics and we want to focus just on a few to avoid noise. Currently, we use this term to refer to an engineering area dedicated to collect, clean, and manipulate data to discover new knowledge. On www.kaggle.com, you can find different types of competitions; there are introductory competitions for beginners and competitions with monetary prices. You can access a competition, download the data and the problem description, and create your own solutions. An example of an introductory Kaggle competition is Titanic: Machine Learning from Disaster. You can download this dataset at https://www.kaggle.com/c/titanic-gettingStarted. We're going to use this dataset in this chapter and in Chapter 3, Exploring and Understanding Your Data.

A dataset is a matrix where each row is an observation or member of the dataset. In the Titanic passenger list, each observation contains the data related to a passenger. In a dataset, each column is a particular variable. In the passenger list, the column Sex is a variable. You can see a part of the Titanic passenger list in the following screenshot:

Datasets, observations, and variables

Before we start, we need to understand our dataset. When we download a dataset from the Web, it usually has a variable description document.

The following is the variable description for our dataset:

  • Survived: If the passenger survived, the value of this variable is set to 1, and if the passenger did not survive, it is set to 0.
  • Pclass: This stands for the class the passenger was travelling by. This variable can have three possible values: 1, 2, and 3 (1 = first class; 2 = second class; 3 = third class).
  • Name: This variable holds the name of the passenger.
  • Sex: This variable has two possible values male or female.
  • Age: This variable holds the age of the passenger.
  • SibSp: This holds the number of siblings/spouses aboard.
  • Parch: This holds the number of parents/children aboard.
  • Ticket: This holds the ticket number.
  • Fare: This variable holds the passenger's fare.
  • Cabin: This variable holds the cabin number.
  • Embarked: This is the port of embarkation. This variable has three possible values: C, Q, and S (C = Cherbourg; Q = Queenstown; S = Southampton).

For predictive purposes, there are two kinds of variables:

  • Output variables or target variables: These are the variables we want to predict. In the passenger list, the variable Survived is an output variable. This means that we want to predict if a passenger will survive the sinking.
  • Input variables: These are the variables we'll use to create a prediction. In the passenger list, the variable sex is an input variable.

Rattle refers to output variables as target variables. To avoid confusion, we're going to use the term target variable throughout this book. In this dataset, we've ten input variables (Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked) that we want to use to predict if this person is a potential customer or not. So in this example, our target variable is Survived.

In Titanic: Machine Learning from Disaster, the passenger list is divided into two CSV files: train.csv and test.csv. The file train.csv contains 891 observations or passengers; for each observation, we have a value for the variable Survived. It means that we know if the passenger survived or not. The second file, test.csv, contains only 418 customers, but in this file, we don't have the variable Survived. This means that we don't know if the passenger survived or not. The objective of the competition is to use the training file to create a model that predicts the value of the Survived variable in the test file. For this reason, the variable Survived is the target variable.

Rattle distinguishes two types of variables—numeric and categorical. A numeric variable describes a numerically measured value. In this dataset, Age, SibSp, Parch, and Fare are numeric variables.

A categorical variable is a variable that can be grouped into different categories. There are two types of categorical variables—ordinal and nominal. In an ordinal categorical variable the categories are represented by a number. In our dataset, Pclass is an ordinal categorical variable with three different categories or possible values 1, 2, and 3.

In a nominal categorical variable, the group is represented by a word label. In this dataset, Sex is an example of this type. This variable has only two possible values, and the values are the label, in this case, male and female.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset