Correlations among input variables

An important step is to identify relationships among input variables. To measure this relationship, we use the correlation coefficient. Correlation coefficient is a number between +1 and -1. When two variables have a correlation coefficient close to +1, they have a strong positive correlation. A coefficient of exactly +1 indicates a perfect positive fit. A positive correlation between two variables means that both variables increase and decrease their values simultaneously. A correlation coefficient between two variables close to -1 shows that both variables have strong negative correlation. When two variables have a negative correlation, the value of one of the variables increases when the value of the other variable decreases. A correlation coefficient close to 0 or a weak correlation between two variables means that there is no linear relationship between those variables.

Coming back to the Titanic passenger list, I've selected the Explore tab, the Correlation sub-option, and I've clicked on the Execute button, as shown in this screenshot:

Correlations among input variables

Of course, each variable has a correlation coefficient with itself of 1.0. Now look at the variable Pclass (passenger class). This variable has three possible values: 1 (first class), 2 (second class), and 3 (third class). This is a categorical variable because there are three possible groups or categories. These categories are ranked and we're going to use a numeric variable for that. In this way, Rattle can compute the correlation between Pclass and other numeric variables. Look at the correlation coefficient between Fare and Pclass; it is -0.573. Is there any relationship between Fare and Pclass? A correlation coefficient close to-0.6 indicates that there is some correlation between the two variables. What does this correlation between Fare and Pclass mean in real life, though? Usually, first class tickets are the most expensive, second class tickets are cheaper, and the third class tickets are the cheapest. Still, why is the relationship between Pclass and Fare negative? It is because a higher value of Fare (higher price) indicates a lower number of the variable Pclass (higher class).

The following chart is a visual representation of the correlation coefficients. By looking at the graph, you will see that the correlations coefficients are the same as in the previous report. Note that you need to enable the Advanced Graphics option inside the Settings menu for this:

Correlations among input variables

The Explore Missing and Hierarchical options

The Explore Missing option will help you to detect relationships between missing values in your dataset, as shown in the following screenshot:

The Explore Missing and Hierarchical options

When two variables have a strong correlation in missing values, it means that when the value of a variable is not present, the second variable also tends to have a missing value.

The Hierarchical option uses a tree diagram graphical to represent the correlation between variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset