Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Correlations among input variables

An important step is to identify relationships among input variables. To measure this relationship, we use the correlation coefficient. Correlation coefficient is a number between +1 and -1. When two variables have a correlation coefficient close to +1, they have a strong positive correlation. A coefficient of exactly +1 indicates a perfect positive fit. A positive correlation between two variables means that both variables increase and decrease their values simultaneously. A correlation coefficient between two variables close to -1 shows that both variables have strong negative correlation. When two variables have a negative correlation, the value of one of the variables increases when the value of the other variable decreases. A correlation coefficient close to 0 or a weak correlation between two variables means that there is no linear relationship between those variables.

Coming back to the Titanic passenger list, I've selected the Explore tab, the Correlation sub-option, and I've clicked on the Execute button, as shown in this screenshot:

Of course, each variable has a correlation coefficient with itself of 1.0. Now look at the variable Pclass (passenger class). This variable has three possible values: 1 (first class), 2 (second class), and 3 (third class). This is a categorical variable because there are three possible groups or categories. These categories are ranked and we're going to use a numeric variable for that. In this way, Rattle can compute the correlation between Pclass and other numeric variables. Look at the correlation coefficient between Fare and Pclass; it is -0.573. Is there any relationship between Fare and Pclass? A correlation coefficient close to-0.6 indicates that there is some correlation between the two variables. What does this correlation between Fare and Pclass mean in real life, though? Usually, first class tickets are the most expensive, second class tickets are cheaper, and the third class tickets are the cheapest. Still, why is the relationship between Pclass and Fare negative? It is because a higher value of Fare (higher price) indicates a lower number of the variable Pclass (higher class).

The following chart is a visual representation of the correlation coefficients. By looking at the graph, you will see that the correlations coefficients are the same as in the previous report. Note that you need to enable the Advanced Graphics option inside the Settings menu for this:

The Explore Missing and Hierarchical options

The Explore Missing option will help you to detect relationships between missing values in your dataset, as shown in the following screenshot:

When two variables have a strong correlation in missing values, it means that when the value of a variable is not present, the second variable also tends to have a missing value.

The Hierarchical option uses a tree diagram graphical to represent the correlation between variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Correlations among input variables

Create new playlist

Sign In

Sign Up

Correlations among input variables

The Explore Missing and Hierarchical options

Table of Contents for
Correlations among input variables