Best practices for data handling

Data cleaning and manipulation constitutes the framework of any analytics project. To ensure that this important step is executed efficiently, the following best practices should be executed:

  • After importing the dataset, one should ensure that the dataset (all the variables and rows) has been read correctly. This means reading all the variables in their correct or required format. Sometimes, due to some limitation on the data or the IDE side, some variables are read wrongly and they need to be formatted to the correct format.
  • For example, if a variable reports some numerical ID (let's say 10-digits long), many a times it would be read and displayed in a scientific notation. However, this would be wrong as it is an ID and shouldn't be displayed in a scientific notation. Sometimes, a variable containing long strings are truncated. These issues should be taken care of before performing any operation on the data.
  • After every data manipulation step such as transposing a dataset, creating and joining dummy variables to the dataset, merging two datasets, creating a new variable, or changing the format type of a variable, one should look at the resultant dataset to see whether the manipulation has taken place correctly or not.
  • As far as possible, data shouldn't be deleted from a dataset. This should be kept in mind while dealing with missing values. If some of the values in a row are missing, imputing values should be the preferred choice. Deleting the entire row should be avoided.
  • Basic plots, namely, histograms and scatter plots should be created for all the numerical variables to see the general outlook and behavior of that variable. This helps in spotting some obvious trends, outliers, potential modifications, and so on. The pair-wise scatter plot of all the numerical variables can also be tried if the number of variables is manageable. This plot is called scatter plot matrix and is very useful to spot relationships, if any, among any two variables. Category-wise histograms are also used to get a good sense of distribution of a variable over different categories.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset