Exploring and visualizing the data

To be done in close conjunction with parsing and cleaning the data, data exploration and visualization is an important part of the model-building process. This part of the pipeline is hard to define concretelywhat exactly is one looking for when exploring the data? The underlying theory is that humans can do certain things much better than computers canthings such as making connections and identifying patterns. The more one looks at and analyzes the data, the more one will discover about how the variables are related and how they can be used to predict the target variable.

A popular exploratory activity in this step is to take a stock of all of the predictor variables; that is, their formats (for example, whether they are binary, categorical, or continuous) and how many missing values there are in each. For binary variables, it is helpful to count how many responses are positive and how many are negative; for categorical variables, it is helpful to count how many possible values each variable can take and the frequency histograms for each; and for continuous variables, calculating some measures of central tendency (for example, mean, median, mode) and dispersion (for example, standard deviation, percentiles) is a good idea.

Additional exploratory and visualization activities can be done to elucidate the relationships between selected predictor variables and the target variable. Specific plots vary depending on the formats (binary, categorical, continuous). For example, when both the predictor variable and target variable are continuous, a scatterplot is a popular visualization; to make a scatterplot the values of each variable are plotted on separate axes. If the predictor variable is continuous and the target variable is binary or categorical, a dual overlapping frequency histogram is a good tool, as is a box-and-whisker plot.

In many cases, there are so many predictor variables that it becomes impossible to inspect manually and visualize each relationship. In these cases, automatic analyses, and calculating measures and statistics, such as correlation coefficients, become important.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset