Prepare the data

After having imported the data, it is necessary to prepare it appropriately before executing any type of analysis (preprocessing). In fact, often in loading data, it may happen that some values are omitted (by mistake or because there are no observations). Any absence of values in columns and rows can distort the whole analysis. For this reason, it is necessary to clean up these missing values.

A missing value occurs when an unknown value is stored for the variable in an observation. Missing data is a common occurrence and can have a significant effect on the operations that can be done on the data.

The following are some of the main tasks in data preprocessing:

  • Data cleaning: Filling missing values; detecting and removing noisy data and outliers
  • Data reduction: Reduction of the number of attributes for easier data management
  • Data discretization: Converting continuous attributes into categorical attributes
  • Data transformation: Normalizing the data to reduce size and noise
  • Cleaning text: Removing embedded characters that can cause data misalignments

Now we will just delete the missing values in the dataset. To do this, we will use the Clean Missing Data module in Azure Machine Learning Studio. It should be noted that the cleaning method used can have a significant influence on the results as well as the presence of missing data. Experimenting with different methods can help us to identify the best method that eliminates this problem by preserving the characteristics of the data.

To start, we add the Clean Missing Data module to the experiment and connect the dataset with missing values. The Clean Missing Data module is contained in the Data Transformation | Manipulation path. Remember that to connect two modules, you just have to click on the output port of a module, keep it clicked, and connect it to the input port of another module. Then select the Clean Missing Data module. At the right of the canvas in the Properties tab, set the properties of this operation, as shown in the following screenshot:

By default, the Clean Missing Data module applies cleaning to all columns. Therefore, if you need to clean several columns using different methods, use separate instances of the form. To select different columns, we can use the column selector.

Then two boxes are present: Minimum missing value ratio and Maximum missing value ratio. In these boxes, we can specify the minimum and maximum numbers of missing values that can be present for the operation to be performed. By default, the Minimum missing value ratio property is set to 0. This means that missing values are cleaned even if there is only one missing value. By default, the Maximum missing value ratio is set to 1. This means that missing values are cleaned even if 100% of the values in the column are missing.

After these, the Cleaning mode property is available: In this drop-down menu, we can select options for replacing or removing missing values. The following options are available:

  • Replace using MICE
  • Custom substitution value
  • Replace with mean
  • Replace with median
  • Replace with mode
  • Remove entire row
  • Remove entire column
  • Replace using Probabilistic PCA

In our case, having a very large number of observations (699), we decide to delete the entire row (Remove entire row) wherever there is a missing value.

After correctly setting the cleaning operations, we can launch the procedure by just clicking on the Run button at the bottom of the canvas.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset