This chapter discusses the main concepts of bivariate descriptive statistics (which involves two variables). Through tables, charts and/or summary measures, it is possible to describe the behavior of the variables. In the case of two qualitative variables, the associations between them can be studied through contingency tables and measures of association, such as, chi-square (for nominal and ordinal variables), Phi coefficient, contingency coefficient, and Cramer's V coefficient (all of them for nominal variables), and Spearman's coefficient (for ordinal variables). In the case of two quantitative variables, the correlations between them can be studied through joint frequency distribution tables, charts, such as, scatter plots, and measures of correlation, such as, covariance and Spearman’s correlation coefficient. Finally, tables, charts, and summary measures are generated through the IBM SPSS Statistical Software® and Stata Statistical Software®.
Bivariate descriptive statistics; Contingency tables; Frequency distribution tables; Perceptual maps; Scatter plot; Measures of association; Measures of correlation
Numbers rule the world.
Plato
The previous chapter discussed descriptive statistics for a single variable (univariate descriptive statistics). This chapter presents the concepts of descriptive statistics involving two variables (bivariate analysis).
Therefore, a bivariate analysis has as its main objective to study the relationships (associations for qualitative variables and correlations for quantitative variables) between two variables. These relationships can be studied through the joint distribution of frequencies (contingency tables or crossed classification tables—cross tabulation), graphical representations, and through summary measures.
The bivariate analysis will be studied from two distinct situations:
Fig. 4.1 shows the bivariate descriptive statistics that will be studied in this chapter, represented by tables, charts, and summary measures, and presents the following situations:
The main objective is to assess if there is a relationship between the qualitative or categorical variables studied, in addition to the level of association between them. This can be done through frequency distribution tables, summary measures, such as, the chi-square (used for nominal and ordinal variables), the Phi coefficient, the contingency coefficient, and Cramer's V coefficient (for nominal variables), and Spearman's coefficient (for ordinal variables), in addition to graphical representations, such as, perceptual maps resulting from the correspondence analysis, as presented in Fávero and Belfiore (2017).
The simplest way to summarize a set of data resulting from two qualitative variables is through a joint frequency distribution table, in this specific case, it is called a contingency table, or a crossed classification table (cross tabulation), or even a correspondence table. In a joint way, it shows the absolute or relative frequencies of variable X’s categories, represented on the X-axis, and of variable Y, represented on the Y-axis.
It is common to add the marginal totals to the contingency table, which correspond to the sum of variable X’s rows and to the sum of variable Y’s columns. We are going to illustrate this analysis through an example based on Bussab and Morettin (2011).
The main measures that represent the association between two qualitative variables are:
The chi-square statistic (χ2) measures the discrepancy between the contingency table observed and the contingency table expected, starting from the hypothesis that there is no association between the variables studied. If the frequency distribution observed is exactly equal to the frequency distribution expected, the result of the chi-square statistic is zero. Therefore, a value lower than χ2 indicates independence between the variables.
Statistic χ2 is given by:
where:
The main measures of association based on the chi-square statistic (χ2) are Phi, Cramer’s V coefficient, and the contingency coefficient (C), all of them applied to nominal qualitative variables.
In general, an association or correlation coefficient is a measure that varies between 0 and 1, presenting value 0 when there is no relationship between the variables, and value 1 when they are perfectly related. We are going to see how each one of the coefficients studied in this section behaves in relation to these characteristics.
The Phi coefficient is the simplest measure of association for nominal variables based on χ2, and it can be expressed as follows:
In order for Phi to vary only between 0 and 1, it is necessary for the contingency table to have a 2 x 2 dimension.
Spearman's coefficient (rsp) is a measure of association between two ordinal qualitative variables.
Initially, we must sort the set of data of variable X and of variable Y in ascending order. After sorting the data, it is possible to create ranks or rankings, denoted by k (k = 1, …, n). Assigning ranks is something done separately for each variable. Rank 1 is then assigned to the smallest value of the variable, rank 2 to the second smallest value, and so on, and so forth, up until ranking n for the highest value. In case of a tie between values k and k + 1, we must assign ranking k + 1/2 to both observations.
Calculating Spearman’s coefficient can be done by using the following expression:
where:
Spearman's coefficient is a measure that varies between − 1 and 1. If rsp = 1, all the values of dk are null, indicating that all the rankings are equal to variables X and Y (perfect positive association). The value rsp = − 1 is found when reaches its maximum value (there is an inversion in the values of the variable rankings), indicating a perfect negative association. When rsp = 0, there is no association between variables X and Y. Fig. 4.19 shows a summary of this interpretation.
This interpretation is similar to Pearson’s association coefficient, which will be studied in Section 4.3.3.2.
In this section, the main objective is to assess if there is a relationship between the quantitative variables being studied, besides the level of correlation between them. This can be done through frequency distribution tables, graphical representations, such as, scatter plots, in addition to measures of correlation, such as, the covariance and Pearson’s correlation coefficient.
The same procedure presented for qualitative variables can be used to represent the joint distribution of quantitative variables and to analyze the possible relationships between the respective variables. Analogous to the study of the univariate descriptive statistic, continuous data that do not repeat themselves with a certain frequency can be grouped into class intervals.
The correlation between two quantitative variables can be represented in a graphical way through a scatter plot. It graphically represents the values of variables X and Y in a Cartesian plane. Therefore, a scatter plot allows us to assess:
Fig. 4.25 shows a scatter plot in which the relationship between variables X and Y is strong positive linear, that is, variations in Y are directly proportional to variations in X. The level of relationship between the variables is strong and the nature is linear.
If all the points are contained in a straight line, we have a case in which the relationship is perfect linear, as shown in Fig. 4.26.
Figs. 4.27 and 4.28, on the other hand, show a scatter plot in which the relationship between variables X and Y is strong negative linear and perfect negative linear, respectively.
Finally, we may now have a case in which there is no relationship between variables X and Y, as shown in Fig. 4.29.
The main measures of correlation, used for quantitative variables, are the covariance and Pearson’s correlation coefficient.
Covariance measures the joint variation between two quantitative variables X and Y, and it is calculated by using the following expression:
where:
One of the limitations of the covariance is that the measure depends on the sample size, and it may lead to a bad estimate in the case of small samples. Pearson’s correlation coefficient is an alternative for this problem.
Pearson’s correlation coefficient (ρ) is a measure that varies between − 1 and 1. Through the sign, it is possible to verify the type of linear relationship between the two variables analyzed (the direction in which variable Y increases or decreases depending on how X changes); the closer it is to the extreme values, the stronger the correlation between them. Therefore:
Fig. 4.35 shows a summary of the interpretation of Pearson’s correlation coefficient.
Pearson’s correlation coefficient (ρ) can be calculated as a ratio between the covariance of two variables and the product of the standard deviations (S) of each one of them:
Since and , as we studied in Chapter 3, expression (4.10) becomes:
In Chapter 12, we are going to use Pearson’s correlation coefficient a lot, when studying factorial analysis.
This chapter presented the main concepts of descriptive statistics with greater focus on the study of the relationship between two variables (bivariate analysis). We studied the relationships between two qualitative variables (associations) and between two quantitative variables (correlations). For each situation, several measures, tables, and charts were presented, which allow us to have a better understanding of the data behavior. Fig. 4.1 summarizes this information.
The construction and interpretation of frequency distributions, graphical representations, in addition to summary measures (measures of position or location and measures of dispersion or variability), allow the researcher to have a better understanding and visualization of the data behavior for two variables simultaneously. More advanced techniques can be applied in the future to the same set of data, so that researchers can go deeper in their studies on bivariate analysis, aiming at improving the quality of the decision making process.