Chapter 4

Bivariate Descriptive Statistics

Abstract

This chapter discusses the main concepts of bivariate descriptive statistics (which involves two variables). Through tables, charts and/or summary measures, it is possible to describe the behavior of the variables. In the case of two qualitative variables, the associations between them can be studied through contingency tables and measures of association, such as, chi-square (for nominal and ordinal variables), Phi coefficient, contingency coefficient, and Cramer's V coefficient (all of them for nominal variables), and Spearman's coefficient (for ordinal variables). In the case of two quantitative variables, the correlations between them can be studied through joint frequency distribution tables, charts, such as, scatter plots, and measures of correlation, such as, covariance and Spearman’s correlation coefficient. Finally, tables, charts, and summary measures are generated through the IBM SPSS Statistical Software® and Stata Statistical Software®.

Keywords

Bivariate descriptive statistics; Contingency tables; Frequency distribution tables; Perceptual maps; Scatter plot; Measures of association; Measures of correlation

Numbers rule the world.

Plato

4.1 Introduction

The previous chapter discussed descriptive statistics for a single variable (univariate descriptive statistics). This chapter presents the concepts of descriptive statistics involving two variables (bivariate analysis).

Therefore, a bivariate analysis has as its main objective to study the relationships (associations for qualitative variables and correlations for quantitative variables) between two variables. These relationships can be studied through the joint distribution of frequencies (contingency tables or crossed classification tables—cross tabulation), graphical representations, and through summary measures.

The bivariate analysis will be studied from two distinct situations:

  1. a) When two variables are qualitative;
  2. b) When two variables are quantitative.

Fig. 4.1 shows the bivariate descriptive statistics that will be studied in this chapter, represented by tables, charts, and summary measures, and presents the following situations:

  1. a) The descriptive statistics used to represent the data behavior of two qualitative variables are: a) joint frequency distribution tables, in this specific case, also called contingency tables or crossed classification tables (cross tabulation); b) charts, such as, perceptual maps resulting from the correspondence analysis technique (more details can be found in Fávero and Belfiore, 2017); c) measures of association, such as, the chi-square statistics (used for nominal and ordinal qualitative variables), the Phi coefficient, the contingency coefficient, and Cramer's V coefficient (all of them based on chi-square and used for nominal variables), in addition to Spearman's coefficient (for ordinal qualitative variables).
  2. b) In the case of two quantitative variables, we are going to use joint frequency distribution tables, graphical representations, such as, the scatter plot, besides measures of correlation, such as, covariance and Pearson’s correlation coefficient.
Fig. 4.1
Fig. 4.1 Bivariate descriptive statistics depending on the type of variable.

4.2 Association Between Two Qualitative Variables

The main objective is to assess if there is a relationship between the qualitative or categorical variables studied, in addition to the level of association between them. This can be done through frequency distribution tables, summary measures, such as, the chi-square (used for nominal and ordinal variables), the Phi coefficient, the contingency coefficient, and Cramer's V coefficient (for nominal variables), and Spearman's coefficient (for ordinal variables), in addition to graphical representations, such as, perceptual maps resulting from the correspondence analysis, as presented in Fávero and Belfiore (2017).

4.2.1 Joint Frequency Distribution Tables

The simplest way to summarize a set of data resulting from two qualitative variables is through a joint frequency distribution table, in this specific case, it is called a contingency table, or a crossed classification table (cross tabulation), or even a correspondence table. In a joint way, it shows the absolute or relative frequencies of variable X’s categories, represented on the X-axis, and of variable Y, represented on the Y-axis.

It is common to add the marginal totals to the contingency table, which correspond to the sum of variable X’s rows and to the sum of variable Y’s columns. We are going to illustrate this analysis through an example based on Bussab and Morettin (2011).

Example 4.1

A study was done with 200 individuals trying to analyze the joint behavior of variable X (Health insurance agency) with variable Y (Level of satisfaction). The contingency table showing the variables’ joint absolute frequency distribution, in addition to the marginal totals, is shown in Table 4.E.1. These data are available on the SPSS software in the file HealthInsurance.sav.

Table 4.E.1

Joint Absolute Frequency Distribution of the Variables Being Studied
Level of Satisfaction
AgencyDissatisfiedNeutralSatisfiedTotal
Total Health40161268
Live Life32241672
Mena Health2432460
Total967232200

Unlabelled Table

The study can also be carried out based on the relative frequencies, as studied in Chapter 3, for univariate problems. Bussab and Morettin (2011) show three ways to illustrate the proportion of each category:

  1. a) In relation to the general total;
  2. b) In relation to the total of each row;
  3. c) In relation to the total of each column.

Choosing each option varies according to the objective of the problem. For example, Table 4.E.2 shows the joint relative frequency distribution of the variables being studied in relation to the general total.

Table 4.E.2

Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the General Total
Level of Satisfaction
AgencyDissatisfiedNeutralSatisfiedTotal
Total Health20%8%6%34%
Live Life16%12%8%36%
Mena Health12%16%2%30%
Total48%36%16%100%

Unlabelled Table

First, we are going to analyze the marginal totals of the rows and columns that provide the unidimensional distributions of each variable. The marginal totals of the rows correspond to the sum of the relative frequencies of each category of the variable Agency and the marginal totals of the columns correspond to the sum of each category of the variable Level of satisfaction. Thus, we can conclude that 34% of the individuals are members of Total Health, 36% of Live Life, and 30% of Mena Health. Analogously, we can conclude that 48% of the individuals are dissatisfied with their health insurance agencies, 36% said they were neutral, and only 16% said they were satisfied.

Regarding the joint relative frequency distribution of the variables being studied (a contingency table), we can state that 20% of the individuals are members of Total Health and are dissatisfied. The same logic is applied to the other categories of the contingency table.

Conversely, Table 4.E.3 shows the joint relative frequency distribution of the variables being studied in relation to the total of each row.

Table 4.E.3

Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Row
Level of Satisfaction
AgencyDissatisfiedNeutralSatisfiedTotal
Total Health58.8%23.5%17.6%100%
Live Life44.4%33.3%22.2%100%
Mena Health40%53.3%6.7%100%
Total48%36%16%100%

Unlabelled Table

From Table 4.E.3, we can see that the ratio of individuals who are members of Total Health and who are dissatisfied is 58.8% (40/68), those who are neutral is 23.5% (16/68); and those who are satisfied is 17.6% (12/68). The sum of the ratios in the respective row is 100%. The same logic is applied to the other rows.

Finally, Table 4.E.4 shows the joint relative frequency distribution of the variables being studied in relation to the total of each column.

Therefore, the ratio of individuals who are members of Total Health and who are dissatisfied is 41.7% (40/96), members of Live Life, 33.3% (32/96), and members of Mena Health, 25% (24/96). The sum of the ratios in the respective column is 100%. The same logic is applied to the other columns.

Table 4.E.4

Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Column
Level of Satisfaction
AgencyDissatisfiedNeutralSatisfiedTotal
Total Health41.7%22.2%37.5%34%
Live Life33.3%33.3%50%36%
Mena Health25%44.4%12.5%30%
Total100%100%100%100%

Unlabelled Table

Creating Contingency Tables on the SPSS Software

The contingency tables in Example 4.1 will be generated by using SPSS. The use of the images in this chapter has been authorized by the International Business Machines Corporation©.

First, we are going to define the properties of each variable on SPSS. The variables Agency and Level of satisfaction are qualitative, but, initially, they are presented as numbers, as shown in the file HealthInsurance_NoLabel.sav. Thus, labels corresponding to each category of both variables must be created, so that:

Labels of the variable Agency:

  • 1 = Total Health
  • 2 = Live Life
  • 3 = Mena Health

Labels of the variable Level of satisfaction, simply called Satisfaction:

  • 1 = Dissatisfied
  • 2 = Neutral
  • 3 = Satisfied

Therefore, we must click on Data → Define Variable Properties… and select the variables that interest us, as seen in Figs. 4.2 and 4.3.

Fig. 4.2
Fig. 4.2 Defining the properties of the variable on SPSS.
Fig. 4.3
Fig. 4.3 Selecting the variables that interest us.

Next, we must click on Continue. Based on Figs. 4.4 and 4.5, note that the variables Agency and Satisfaction were defined as nominal. This definition can also be done in the environment Variable View. The definition of the labels must be created at this moment, as shown in Figs. 4.4 and 4.5. Clicking on OK, the database initially represented as numbers starts being substituted for the respective labels. In the file HealthInsurance.sav, the data have already been labeled.

Fig. 4.4
Fig. 4.4 Defining the labels of variable Agency.
Fig. 4.5
Fig. 4.5 Defining the labels of variable Satisfaction.

To create contingency tables (cross tabulation), we are going to click on the menu Analyze → Descriptive Statistics → Crosstabs…, as shown in Fig. 4.6.

Fig. 4.6
Fig. 4.6 Creating contingency tables (cross tabulation) on SPSS.

We are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Next, we must click on Cells, as shown in Fig. 4.7.

Fig. 4.7
Fig. 4.7 Creating a contingency table.

To create contingency tables that represent the joint absolute frequency distribution of the variables observed, the joint relative frequency distribution in relation to the general total, the joint relative frequency distribution in relation to the total of each row, and the joint relative frequency distribution in relation to the total of each column (Tables 4.1–4.4) we must, from the Crosstabs: Cell Display dialog box (opened after we clicked on Cells…), select the option Observed in Counts and options Row, Column and Total in Percentages, as shown in Fig. 4.8. Finally, we are going to click on Continue and OK.

Fig. 4.8
Fig. 4.8 Creating contingency tables from the Crosstabs: Cell Display dialog box.

The contingency table (cross tabulation) generated by SPSS is shown in Fig. 4.9. Note that the data generated are exactly the same as those presented in Tables 4.1–4.4.

Fig. 4.9
Fig. 4.9 Cross classification table (cross tabulation) generated by SPSS.

Creating Contingency Tables on the Stata Software

In Chapter 03, we learned how to create frequency distribution tables for a single variable on Stata through the command tabulate, or simply tab. In the case of two or more variables, if the objective is to create univariate frequency distribution tables for each variable being analyzed, we must use the command tab1, followed by the list of variables.

The same logic must be applied to create joint frequency distribution tables (contingency tables). To create a contingency table on Stata from the absolute frequencies of the variables being observed, we must use the following syntax:

tabulate variable1⁎ variable2⁎

or simply:

tab variable1⁎ variable2⁎

where the terms variable1⁎ and variable2⁎ must be substituted for the names of the respective variables.

If, in addition to the joint absolute frequency distribution of the variables being observed, we want to obtain the joint relative frequency distribution in relation to the total of each row, to the total of each column, and to the general total, we must use the following syntax:

tabulate variable1⁎ variable2⁎, row column cell

or simply:

tab variable1⁎ variable2⁎, r co ce

Consider a case with more than two variables being studied, in which the objective is to construct bivariate frequency distribution tables (two-way tables), for all the combinations of variables, two by two. In this case, we must use the command tab2, with the following syntax:

tab2 variables⁎

where the term variables⁎ should be substituted for the list of variables being considered in the analysis.

Analogously, to obtain both the joint absolute frequency distribution and the joint relative frequency distributions per row, per column, and per general total, we must use the following syntax:

tab2 variables⁎, r co ce

The contingency tables in Example 4.1 will be generated now by using the Stata software. The data are available in the file HealthInsurance.dta.

Hence, to obtain the table of joint absolute frequency distribution, relative frequencies per row, relative frequencies per column, and relative frequencies per general total, the command is:

tab agency satisfaction, r co ce

The results can be seen in Fig. 4.10 and are similar to those presented in Fig. 4.9 (SPSS).

Fig. 4.10
Fig. 4.10 Contingency table constructed on Stata.

4.2.2 Measures of Association

The main measures that represent the association between two qualitative variables are:

  1. a) The chi-square statistic (χ2)—used for nominal and ordinal qualitative variables;
  2. b) The Phi coefficient, the contingency coefficient and Cramer's V coefficient—applied to nominal variables and based on chi-square; and
  3. c) Spearman’s coefficient—used for ordinal variables.

4.2.2.1 Chi-Square Statistic

The chi-square statistic (χ2) measures the discrepancy between the contingency table observed and the contingency table expected, starting from the hypothesis that there is no association between the variables studied. If the frequency distribution observed is exactly equal to the frequency distribution expected, the result of the chi-square statistic is zero. Therefore, a value lower than χ2 indicates independence between the variables.

Statistic χ2 is given by:

χ2=i=1Ij=1JOijEij2Eij

si1_e  (4.1)

where:

  • Oij: number of observations in the ith position of variable X and in the jth position of variable Y;
  • Eij: expected frequency of observations in the ith position of variable X and in the jth position of variable Y;
  • I: number of categories (rows) of variable X;
  • J: number of categories (columns) of variable Y.

Example 4.2

Calculate the χ2 statistic for Example 4.1.

Solution

Table 4.E.5 shows the observed values in the distribution with the respective relative frequencies in relation to the general total of the row. The calculation could also be done in relation to the general total of the column, arriving at the same result of the χ2 statistic.

Table 4.E.5

Observed Values of Each Category With the Respective Ratios in Relation to the General Total of the Row
Level of Satisfaction
AgencyDissatisfiedNeutralSatisfiedTotal
Total Health40 (58.8%)16 (23.5%)12 (17.6%)68 (100%)
Live Life32 (44.4%)24 (33.3%)16 (22.2%)72 (100%)
Mena Health24 (40%)32 (53.3%)4 (6.7%)60 (100%)
Total96 (48%)72 (36%)32 (16%)200 (100%)

Unlabelled Table

The data in Table 4.E.5 show the dependence between the variables. Assuming that there was no association between the variables, we would expect a ratio of 48% in relation to the total of the row of all three health insurance companies in the Dissatisfied column, 36% in the Neutral column, and 16% in the Satisfied column. The calculation of the expected values can be seen in Table 4.E.6. For example, the calculation of the first cell is 0.48 × 68 = 32.64.

Table 4.E.6

Expected Values in Table 4.E.5, Assuming the Nonassociation Between the Variables
Level of Satisfaction
AgencyDissatisfiedNeutralSatisfiedTotal
Total Health32.6 (48%)24.5 (36%)10.9 (16%)68 (100%)
Live Life34.6 (48%)25.9 (36%)11.5 (16%)72 (100%)
Mena Health28.8 (48%)21.6 (36%)9.6 (16%)60 (100%)
Total96 (48%)72 (36%)32 (16%)200 (100%)

Unlabelled Table

To calculate the χ2 statistic, we must apply expression (4.1) for the data in Tables 4.E.5 and 4.E.6. The calculation of each term OijEij2Eijsi2_e is shown in Table 4.E.7, jointly with the χ2 measure resulting from the sum of the categories.

Table 4.E.7

Calculating the χ2 Statistic
Level of Satisfaction
AgencyDissatisfiedNeutralSatisfied
Total Health1.662.940.12
Live Life0.190.141.74
Mena Health0.805.013.27
Totalχ2 = 15.861

Unlabelled Table

As we are going to study in Chapter 9, which discusses hypotheses tests, significance level α indicates the probability of rejecting a certain hypothesis when it is true. P-value, on the other hand, represents the probability associated to the sample observed value, indicating the lowest significance level that would lead to the rejection of the supposed hypothesis. In other words, P-value represents a decreasing reliability index of a result. The lower the value, the less we can believe in the assumed hypothesis.

In the case of the χ2 statistic, whose test presupposes the nonassociation between the variables being studied, most statistical software, including SPSS and Stata, calculate the corresponding P-value. Thus, for a confidence level of 95%, if P-value < 0.05, the hypothesis is rejected and we can state that there is an association between the variables. On the other hand, if P-value > 0.05, we conclude that the variables are independent. All of these concepts will be studied in more detail in Chapter 9.

Excel calculates the P-value of the χ2 statistic through the CHITEST or CHISQ.TEST (Excel 2010 and future versions) functions. In order to do that, we just need to select the set of cells corresponding to the observed or real values and the set of cells of the expected values.

  • Solving the chi-square statistic on the SPSS software

Analogous to Example 4.1, calculating the chi-square statistic (χ2) on SPSS is also done on the tab Analyze → Descriptive Statistics → Crosstabs…. Once again, we are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Initially, to generate the observed values and the expected values in case of nonassociation between the variables (data in Tables 4.E.5 and 4.E.6), we must click on Cells… and select the options Observed and Expected in Counts, from the Crosstabs: Cell Display dialog box (Fig. 4.11). In the same box, to generate the adjusted standardized residuals, we must select the option Adjusted standardized in Residuals. The results can be seen in Fig. 4.12.

Fig. 4.11
Fig. 4.11 Creating the contingency table with the observed frequencies, the expected frequencies, and the residuals.
Fig. 4.12
Fig. 4.12 Contingency table with the observed values, the expected values, and the residuals, assuming the nonassociation between the variables.

To calculate the χ2 statistic, in Statistics…, we must select the option Chi-square (Fig. 4.13). Finally, we are going to click on Continue and OK. The result can be seen in Fig. 4.14.

Fig. 4.13
Fig. 4.13 Selecting the χ2 statistic.
Fig. 4.14
Fig. 4.14 Result of the χ2 statistic.

Based on Fig. 4.14, we can see that the value of χ2 is 15.861, similar to the one calculated in Table 4.E.7. We can also observe that the lowest significance level that would lead to the rejection of the nonassociation hypothesis between the variables (P-value) is 0.003. Since 0.003 < 0.05 (for a confidence level of 95%), the null hypothesis is rejected, which allows us to conclude that there is association between the variables.

  • Solving the χ2 statistic on the Stata software

In Section 4.2.1, we learned how to create contingency tables on Stata through the command tabulate, or simply tab. Besides the observed frequencies, this command also gives us the expected frequencies through the option expected, or simply exp, as well as the calculation of the χ2 statistic using the option chi2, or simply ch. For the data in Example 4.1 available in the file HealthInsurance.dta, to obtain the observed and expected frequency distribution tables, jointly with the χ2 statistic, we are going to use the following command:

tab agency satisfaction, exp ch

However, the command tab does not allow residuals to be generated in the output. As an alternative, the command tabchi was developed from a tabulation module created by Nicholas J. Cox, allowing the adjusted standardized residuals to be calculated too. In order for this command to be used, we must initially type:

findit tabchi

and install it in the link tab_chi from http://fmwww.bc.edu/RePEc/bocode/t. After doing this, we can type the following command:

tabchi agency satisfaction, a

The result is shown in Fig. 4.15 and is similar to those presented in Figs. 4.12 and 4.14 on the SPSS software. Note that, differently from the command tab, which requires the option exp so that the expected frequencies can be generated, the command tabchi already gives them to us automatically.

Fig. 4.15
Fig. 4.15 Result of the χ2 statistic on Stata.

4.2.2.2 Other Measures of Association Based on Chi-Square

The main measures of association based on the chi-square statistic (χ2) are Phi, Cramer’s V coefficient, and the contingency coefficient (C), all of them applied to nominal qualitative variables.

In general, an association or correlation coefficient is a measure that varies between 0 and 1, presenting value 0 when there is no relationship between the variables, and value 1 when they are perfectly related. We are going to see how each one of the coefficients studied in this section behaves in relation to these characteristics.

  1. a) Phi Coefficient

The Phi coefficient is the simplest measure of association for nominal variables based on χ2, and it can be expressed as follows:

Phi=χ2n

si3_e  (4.2)

In order for Phi to vary only between 0 and 1, it is necessary for the contingency table to have a 2 x 2 dimension.

Example 4.3

In order to offer high-quality services and meet their customers’ expectations, Ivanblue, a company in the male fashion industry, is investing in strategies to segment the market. Currently, the company has four stores in Campinas, located in the north, center, south, and east regions of the city, and sells four types of clothes: ties, shirts, polo shirts, and pants. Table 4.E.8 shows the purchase data of 20 customers, such as, the type of clothes and the location of the store. Check if there is association between the two variables using the Phi coefficient.

Table 4.E.8

Purchase Data of 20 Customers
CustomerClothesRegion
1TieSouth
2Polo shirtNorth
3ShirtSouth
4PantsNorth
5TieSouth
6Polo shirtCenter
7Polo shirtEast
8TieSouth
9ShirtSouth
10TieCenter
11PantsNorth
12PantsCenter
13TieCenter
14Polo shirtEast
15PantsCenter
16TieCenter
17PantsSouth
18PantsNorth
19Polo shirtEast
20ShirtCenter

Solution

Using the procedure described in the previous section, the value of the chi-square statistic is χ2 = 18.214. Therefore:

Phi=χ2n=18.21420=0.954

si4_e

Since both variables have four categories, in this case the condition 0 ≤ Phi ≤ 1 is not valid, making it difficult to interpret how strong the association is.

  1. b) Contingency coefficient

The contingency coefficient (C), also known as Pearson’s contingency coefficient, is another measure of association for nominal variables based on the χ2 statistic, being represented by the following expression:

C=χ2n+χ2

si5_e  (4.3)

where n is the sample size.

The contingency coefficient (C) has as its lowest limit the value 0, indicating that there is no relationship between the variables; however, the highest limit of C varies depending on the number of categories, so:

0Cq1q

si6_e  (4.4)

where:

q=minIJ

si7_e  (4.5)

where I is the number of rows and J is the number of columns in a contingency table.

When C=q1qsi8_e, there is a perfect association between the variables; however, this limit never assumes the value 1. Hence, two contingency coefficients can only be compared if both are defined from tables with the same number of rows and columns.

Example 4.4

Calculate the contingency coefficient (C) for the data in Example 4.3.

Solution

We calculate C as follows:

C=χ2n+χ2=18.21420+18.214=0.690

si9_e

Since the contingency table is 4 × 4 (q = min(4, 4) = 4), the values that C can assume are in the interval:

0C340C0.866

si10_e

We can conclude that there is association between the variables.

  1. c) Cramer’s V coefficient

Another measure of association for nominal variables based on the χ2 statistic is Cramer's V coefficient, calculated by:

V=χ2n.q1

si11_e  (4.6)

where q = min(IJ), as presented in expression (4.5).

For 2 x 2 contingency tables, expression (4.6) is going to be V=χ2nsi12_e, which corresponds to the Phi coefficient.

Cramer's V coefficient is an alternative to the Phi coefficient and to the contingency coefficient (C), and its value is always limited to the interval [0, 1], regardless of the number of categories in the rows and columns:

0V1

si13_e  (4.7)

Value 0 indicates that the variables do not have any kind of association and value 1 shows that they are perfectly associated. Therefore, Cramer's V coefficient allows us to compare contingency tables that have different dimensions.

Example 4.5

Calculate Cramer's V coefficient for the data in Example 4.3.

Solution

V=χ2nq1=18.21420×3=0.551

si14_e

Since 0 ≤ V ≤ 1, there is association between the variables; however, it is not considered very strong.

  • Solution of Examples 4.3, 4.4, and 4.5 (calculation of the Phi, contingency, and Cramer’s V coefficients) by using SPSS

In Section 4.2.1, we discussed how to create labels that correspond to the variable categories from the menu Data → Define Variable Properties…. The same procedure must be applied to the data in Table 4.E.8 (we cannot forget to define the variables as nominal). The file Market_Segmentation.sav gives us these data already tabulated on SPSS.

Similar to the calculation of the χ2 statistic, calculating the Phi, contingency, and Cramer’s V coefficients on SPSS can also be done on the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Clothes in Row(s) and the variable Region in Column(s).

In Statistics…, we are going to select the options Contingency coefficient and Phi and Cramer’s V (Fig. 4.16). Note that these coefficients are calculated for nominal variables. The results of the statistics can be seen in Fig. 4.17.

Fig. 4.16
Fig. 4.16 Selecting the contingency coefficient and Phi and Cramer’s V coefficients.
Fig. 4.17
Fig. 4.17 Results of the contingency coefficient and Phi and Cramer’s V coefficients.

For all three coefficients, the P-value of 0.033 (0.033 < 0.05) indicates that there is association between the variables being studied.

  • Solution of Examples 4.3 and 4.5 (calculation of the Phi and Cramer’s V coefficients) by using Stata

Stata calculates the Phi and Cramer’s V coefficients through the command phi. Hence, they are going to be calculated for the data in Example 4.3 available in the file Market_Segmentation.dta.

In order for the phi command to be used, initially, we must type:

findit phi

and install it in the link snp3.pkg from http://www.stata.com/stb/stb3/. After doing this, we can type the following command:

phi clothes region

The results can be seen in Fig. 4.18. Note that the Phi coefficient on Stata is called Cohen’s w. Cramer's V coefficient, on the other hand, is called Cramer's phi-prime.

Fig. 4.18
Fig. 4.18 Calculating the Phi and Cramer’s V coefficients on Stata.

4.2.2.3 Spearman’s Coefficient

Spearman's coefficient (rsp) is a measure of association between two ordinal qualitative variables.

Initially, we must sort the set of data of variable X and of variable Y in ascending order. After sorting the data, it is possible to create ranks or rankings, denoted by k (k = 1, …, n). Assigning ranks is something done separately for each variable. Rank 1 is then assigned to the smallest value of the variable, rank 2 to the second smallest value, and so on, and so forth, up until ranking n for the highest value. In case of a tie between values k and k + 1, we must assign ranking k + 1/2 to both observations.

Calculating Spearman’s coefficient can be done by using the following expression:

rsp=16k=1ndk2n.n21

si15_e  (4.8)

where:

  • n: number of observations (pairs of values);
  • dk: difference between the rankings of order k.

Spearman's coefficient is a measure that varies between − 1 and 1. If rsp = 1, all the values of dk are null, indicating that all the rankings are equal to variables X and Y (perfect positive association). The value rsp = − 1 is found when k=1ndk2=n.n213si16_e reaches its maximum value (there is an inversion in the values of the variable rankings), indicating a perfect negative association. When rsp = 0, there is no association between variables X and Y. Fig. 4.19 shows a summary of this interpretation.

Fig. 4.19
Fig. 4.19 Interpretation of Spearman’s coefficient.

This interpretation is similar to Pearson’s association coefficient, which will be studied in Section 4.3.3.2.

Example 4.6

The coordinator of the Business Administration course is analyzing if there is any kind of association between the grades of 10 students in two different subjects: Simulation and Finance. The data regarding this problem are presented in Table 4.E.9. Calculate Spearman’s coefficient.

Table 4.E.9

Grades in the Subjects Simulation and Finance of the 10 Students Being Analyzed
Grades
StudentSimulationFinance
14.76.6
26.35.1
37.56.9
45.07.1
54.43.5
63.74.6
78.56.8
88.27.5
93.54.2
104.03.3

Unlabelled Table

Solution

To calculate Spearman’s coefficient, first, we are going to assign rankings to each category of each variable depending on their respective values, as shown in Table 4.E.10.

Table 4.E.10

Ranks in the Subjects Simulation and Finance of the 10 Students
Rankings
StudentSimulationFinancedkdk2
156− 11
27524
38800
469− 39
54224
624− 24
710739
8910− 11
913− 24
103124
Sum40

Unlabelled Table

Applying expression (4.8), we have:

rsp=16k=1ndk2nn21=16×4010×99=0.7576

si17_e

Value 0.758 indicates a strong positive association between the variables.

  • Calculating Spearman’s coefficient using SPSS software

File Grades.sav shows the data from Example 4.6 (rankings in Table 4.E.9) tabulated in an ordinal scale (defined in the environment Variable View).

Similar to the calculation of the χ2 statistic and the Phi, contingency, and Cramer’s V coefficients, Spearman’s coefficient can also be generated by SPSS from the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Simulation in Row(s) and the variable Finance in Column(s).

In Statistics…, we are going to select the option Correlations (Fig. 4.20). We are going to click on Continue and then, finally, on OK. The result of Spearman’s coefficient is shown in Fig. 4.21.

Fig. 4.20
Fig. 4.20 Calculating Spearman’s coefficient from the Crosstabs: Statistics dialog box.
Fig. 4.21
Fig. 4.21 Result of Spearman’s coefficient from the Crosstabs: Statistics dialog box.

The P-value 0.011 < 0.05 (under the hypothesis of nonassociation between the variables) indicates that there is a correlation between the grades in Simulation and Finance, with 95% confidence.

Spearman’s coefficient can also be calculated in the menu Analyze → Correlate → Bivariate…. We must select the variables that interest us, in addition to Spearman’s coefficient, as shown in Fig. 4.22. We are going to click on OK, resulting in Fig. 4.23.

  • Calculating Spearman’s coefficient by using Stata software
Fig. 4.22
Fig. 4.22 Calculating Spearman’s coefficient from the Bivariate Correlations dialog box.
Fig. 4.23
Fig. 4.23 Result of Spearman’s coefficient from the Bivariate Correlations dialog box.

In Stata, Spearman's coefficient is calculated using the command spearman. Therefore, for the data in Example 4.6, available in the file Grades.dta, we must type the following command:

spearman simulation finance

The results can be seen in Fig. 4.24.

Fig. 4.24
Fig. 4.24 Result of Spearman’s coefficient on Stata.

4.3 Correlation Between Two Quantitative Variables

In this section, the main objective is to assess if there is a relationship between the quantitative variables being studied, besides the level of correlation between them. This can be done through frequency distribution tables, graphical representations, such as, scatter plots, in addition to measures of correlation, such as, the covariance and Pearson’s correlation coefficient.

4.3.1 Joint Frequency Distribution Tables

The same procedure presented for qualitative variables can be used to represent the joint distribution of quantitative variables and to analyze the possible relationships between the respective variables. Analogous to the study of the univariate descriptive statistic, continuous data that do not repeat themselves with a certain frequency can be grouped into class intervals.

4.3.2 Graphical Representation Through a Scatter Plot

The correlation between two quantitative variables can be represented in a graphical way through a scatter plot. It graphically represents the values of variables X and Y in a Cartesian plane. Therefore, a scatter plot allows us to assess:

  1. a) Whether there is any relationship between the variables being studied or not;
  2. b) The type of relationship between the two variables, that is, the direction in which variable Y increases or decreases depending on changes in X;
  3. c) The level of relationship between the variables;
  4. d) The nature of the relationship (linear, exponential, among others).

Fig. 4.25 shows a scatter plot in which the relationship between variables X and Y is strong positive linear, that is, variations in Y are directly proportional to variations in X. The level of relationship between the variables is strong and the nature is linear.

Fig. 4.25
Fig. 4.25 Strong positive linear relationship.

If all the points are contained in a straight line, we have a case in which the relationship is perfect linear, as shown in Fig. 4.26.

Fig. 4.26
Fig. 4.26 Perfect positive linear relationship.

Figs. 4.27 and 4.28, on the other hand, show a scatter plot in which the relationship between variables X and Y is strong negative linear and perfect negative linear, respectively.

Fig. 4.27
Fig. 4.27 Strong negative linear relationship.
Fig. 4.28
Fig. 4.28 Perfect negative linear relationship.

Finally, we may now have a case in which there is no relationship between variables X and Y, as shown in Fig. 4.29.

  • Constructing a scatter plot on SPSS
Fig. 4.29
Fig. 4.29 There is no relationship between variables X and Y.

Example 4.7

Let us open file Income_Education.sav on SPSS. The objective is to analyze the correlation between the variables Family Income and Years of Education through a scatter plot. In order to do that, we are going to click on GraphsLegacy DialogsScatter/Dot… (Fig. 4.30). In the window Scatter/Dot in Fig. 4.31, we are going to select the type of chart (Simple Scatter). Clicking on Define, the Simple Scatterplot dialog box will open, as shown in Fig. 4.32. We are going to select the variable FamilyIncome in the Y-axis and the variable YearsofEducation in the X-axis. Next, we are going to click on OK. The scatter plot created is shown in Fig. 4.33.

Fig. 4.30
Fig. 4.30 Constructing a scatter plot on SPSS.
Fig. 4.31
Fig. 4.31 Selecting the type of chart.
Fig. 4.32
Fig. 4.32 Simple Scatterplot dialog box.
Fig. 4.33
Fig. 4.33 Scatter plot of the variables Family Income and Years of Education.

Based on Fig. 4.33, we can see a strong positive correlation between the variables Family Income and Years of Education. Therefore, the higher the number of years of education, the higher the family income will be, even if there is no cause and effect relationship.

The scatter plot can also be created in Excel by selecting the option Scatter.

  • Constructing a scatter plot on Stata

The data from Example 4.7 are also available on Stata from the file Income_Education.dta. The variables being studied are called income and education.

The scatter plot on Stata is created using the command twoway scatter (or simply tw sc) followed by the variables we are interested in. Thus, to analyze the correlation between the variables Family Income and Years of Education through a scatter plot on Stata, we must type the following command:

tw sc income education

The resulting scatter plot is shown in Fig. 4.34.

Fig. 4.34
Fig. 4.34 Scatter plot on Stata.

4.3.3 Measures of Correlation

The main measures of correlation, used for quantitative variables, are the covariance and Pearson’s correlation coefficient.

4.3.3.1 Covariance

Covariance measures the joint variation between two quantitative variables X and Y, and it is calculated by using the following expression:

covXY=i=1nXiX¯.YiY¯n1

si18_e  (4.9)

where:

  • Xi: ith value of X;
  • Yi: ith value of Y;
  • X¯si19_e: mean of the values of Xi;
  • Y¯si20_e: mean of the values of Yi;
  • n: sample size.

One of the limitations of the covariance is that the measure depends on the sample size, and it may lead to a bad estimate in the case of small samples. Pearson’s correlation coefficient is an alternative for this problem.

Example 4.8

Once again, consider the data in Example 4.7 regarding the variables Family Income and Years of Education. The data are also available in Excel in the file Income_Education.xls. Calculate the covariance of the data matrix of both variables.

Solution

Applying expression (4.9), we have:

covXY=7.67.0819611856.22++5.47.087751856.2295=72,326.9395=761.336

si21_e

The covariance can be calculated in Excel by using the COVARIANCE.S (sample) function.

In the following section, we are also going to discuss how the covariance can be calculated on SPSS, jointly with Pearson’s correlation coefficient. SPSS considers the same expression presented in this section.

4.3.3.2 Pearson’s Correlation Coefficient

Pearson’s correlation coefficient (ρ) is a measure that varies between − 1 and 1. Through the sign, it is possible to verify the type of linear relationship between the two variables analyzed (the direction in which variable Y increases or decreases depending on how X changes); the closer it is to the extreme values, the stronger the correlation between them. Therefore:

  •  If ρ is positive, there is a directly proportional relationship between the variables; if ρ = 1, we have a perfect positive linear correlation.
  •  If ρ is negative, there is an inversely proportional relationship between the variables; if ρ = − 1, we have a perfect negative linear correlation.
  •  If ρ is null, there is no correlation between the variables.

Fig. 4.35 shows a summary of the interpretation of Pearson’s correlation coefficient.

Fig. 4.35
Fig. 4.35 Interpretation of Pearson’s correlation coefficient.

Pearson’s correlation coefficient (ρ) can be calculated as a ratio between the covariance of two variables and the product of the standard deviations (S) of each one of them:

ρ=covXYSXSY=i=1nXiX¯.YiY¯n1SXSY

si22_e  (4.10)

Since SX=i=1nXiX¯2n1si23_e and SY=i=1nYiY¯2n1si24_e, as we studied in Chapter 3, expression (4.10) becomes:

ρ=i=1nXiX¯.YiY¯i=1nXiX¯2.i=1nYiY¯2

si25_e  (4.11)

In Chapter 12, we are going to use Pearson’s correlation coefficient a lot, when studying factorial analysis.

Example 4.9

Once again, open the file Income_Education.xls and calculate Pearson’s correlation coefficient between the two variables.

Solution

Calculating Pearson’s correlation coefficient through expression (4.10) is as follows:

ρ=covXYSXSY=761.336970.774×1.009=0.777

si26_e

This calculation could also be done by using expression (4.11), which does not depend on the sample size. The result indicates a strong positive correlation between the variables Family Income and Years of Education.

Excel also calculates Pearson’s correlation coefficient through the PEARSON function.

  • Solution of Examples 4.8 and 4.9 (calculation of the covariance and Pearson’s correlation coefficient) on SPSS

Once again, open the file Income_Education.sav. To calculate the covariance and Pearson’s correlation coefficient on SPSS, we are going to click on AnalyzeCorrelateBivariate…. The Bivariate Correlations window will open. We are going to select the variables Family Income and Years of Education, in addition to Pearson’s correlation coefficient, as shown in Fig. 4.36. In Options…, we must select the option Cross-product deviations and covariances, according to Fig. 4.37. We are going to click on Continue and then on OK. The results of the statistics are presented in Fig. 4.38.

Fig. 4.36
Fig. 4.36 Bivariate Correlations dialog box.
Fig. 4.37
Fig. 4.37 Selecting the covariance statistic.
Fig. 4.38
Fig. 4.38 Results of the covariance and of Pearson’s correlation coefficient on SPSS.

Analogous to Spearman's coefficient, Pearson’s correlation coefficient can also be generated on SPSS from the menu Analyze → Descriptive Statistics → Crosstabs… (option Correlations in the Statistics button).

  • Solution of Examples 4.8 and 4.9(calculation of the covariance and Pearson’s correlation coefficient) on Stata

To calculate Pearson’s correlation coefficient on Stata, we must use the command correlate, or simply corr, followed by the list of variables we are interested in. The result is the correlation matrix between the respective variables.

Once again, open the file Income_Education.dta. Thus, for the data in this file, we can type the following command:

corr income education

The result can be seen in Fig. 4.39.

Fig. 4.39
Fig. 4.39 Calculating Pearson’s correlation coefficient on Stata.

To calculate the covariance, we must use the option covariance, or only cov, at the end of the command correlate (or simply corr). Thus, to generate Fig. 4.40, we must type the following command:

Fig. 4.40
Fig. 4.40 Calculating the covariance on Stata.

corr income education, cov

4.4 Final Remarks

This chapter presented the main concepts of descriptive statistics with greater focus on the study of the relationship between two variables (bivariate analysis). We studied the relationships between two qualitative variables (associations) and between two quantitative variables (correlations). For each situation, several measures, tables, and charts were presented, which allow us to have a better understanding of the data behavior. Fig. 4.1 summarizes this information.

The construction and interpretation of frequency distributions, graphical representations, in addition to summary measures (measures of position or location and measures of dispersion or variability), allow the researcher to have a better understanding and visualization of the data behavior for two variables simultaneously. More advanced techniques can be applied in the future to the same set of data, so that researchers can go deeper in their studies on bivariate analysis, aiming at improving the quality of the decision making process.

4.5 Exercises

  1. 1) Which descriptive statistics can be used (and in which situations) to represent the behavior of two qualitative variables simultaneously?
  2. 2) And to represent the behavior of two quantitative variables?
  3. 3) In what situations should we use contingency tables?
  4. 4) What are the differences between the chi-square statistic (χ2), Phi coefficient, the contingency coefficient (C), Cramer's V coefficient, and Spearman’s coefficient?
  5. 5) What are the main summary measures to represent the data behavior between two quantitative variables? Describe each one of them.
  6. 6) Aiming at identifying the behavior of customers who are in default regarding their payments, a survey with information on the age and level of default of the respondents was carried out. The objective is to determine if there is an association between the variables. Based on the files Default.sav and Default.dta, we would like you to:
    1. a) Create the joint frequency distribution tables for the variables age_group and default (absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies).
    2. b) Determine the percentage of individuals who are between 31 and 40 years of age.
    3. c) Determine the percentage of individuals who are heavily indebted.
    4. d) Determine the percentage of respondents who are 20 years old or younger and do not have debts.
    5. e) Determine, among the individuals who are older than 60, the percentage of those who are a little indebted.
    6. f) Determine, among the individuals who are a relatively indebted, the percentage of those who are between 41 and 50 years old.
    7. g) Verify if there are indications of dependence between the variables.
    8. h) Confirm the previous item using the χ2 statistic.
    9. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not.
  7. 7) The files Motivation_Companies.sav and Motivation_Companies.dta show a database with the variables Company and Level of Motivation (Motivation), obtained through a survey carried out with 250 employees (50 respondents for each one of the 5 companies surveyed), aiming at assessing the employees’ level of motivation in relation to the companies, considered to be large firms. Hence, we would like you to:
    1. a) Create the contingency tables of absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies;
    2. b) Calculate the percentage of respondents who are very demotivated.
    3. c) Calculate the percentage of respondents from Company A and are very demotivated.
    4. d) Calculate the percentage of motivated respondents in Company D.
    5. e) Calculate the percentage of little motivated respondents in Company C.
    6. f) Among the respondents who are very motivated, determine the percentage of those who work for Company B.
    7. g) Verify if there are indications of dependence between the variables.
    8. h) Confirm the previous item using the χ2 statistic.
    9. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not.
  8. 8) The files Students_Evaluation.sav and Students_Evaluation.dta show the grades, from 0 to 10, of 100 students from a public university in relation to the following subjects: Operational Research, Statistics, Operations Management, and Finance. Check and see if there is a correlation between the following pairs of variables, constructing the scatter plot and calculating Pearson’s correlation coefficient:
    1. a) Operational Research and Statistics;
    2. b) Operations Management and Finance.
    3. c) Operational Research and Operations Management.
  9. 9) The files Brazilian_Supermarkets.sav and Brazilian_Supermarkets.dta show revenue data and the number of stores of the 20 largest Brazilian supermarket chains in a given year (source: ABRAS - Brazilian Association of Supermarkets). We would like you to:
    1. a) Create the scatter plot for the variables revenue x number of stores.
    2. b) Calculate Pearson’s correlation coefficient between the two variables.
    3. c) Exclude the four largest supermarket chains in terms of revenue, as well as the chain AM/PM Food and Beverages Ltd., and once again create the scatter plot.
    4. d) Once again, calculate Pearson’s correlation coefficient between the two variables being studied.

References

Bussab W.O., Morettin P.A. Estatística básica. seventh ed. São Paulo: Saraiva; 2011.

Fávero L.P., Belfiore P. Manual de análise de dados: estatística e modelagem multivariada com Excel®, SPSS® e Stata®. Rio de Janeiro: Elsevier; 2017.


"To view the full reference list for the book, click here"

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset