Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14

Binary and Multinomial Logistic Regression Models

Abstract

This chapter presents the binary and multinomial logistic regression models, establishing the circumstances based upon which the binary and multinomial regression models can be used. The objective is to estimate an occurrence probability model of an event based on the maximum likelihood method. The results of statistics tests pertinent to the logistic models are evaluated. Confidence intervals of the model parameters for the purpose of prediction are also elaborated, as well as the analysis of sensitivity and interpretation of the sensitivity curve, the ROC curve and the cutoff concepts, overall model efficiency, sensitivity, and specificity. The binary and multinomial regression models are also prepared in Microsoft Office Excel®, Stata Statistical Software®, and IBM SPSS Statistics Software® and their results are interpreted.

Keywords

Binary logistic regression; Multinomial logistic regression; Probability of event occurrence; Odds; Estimation by maximum likelihood; Cutoff; Sensitivity analysis; Overall model efficiency; Sensitivity; Specificity; Excel; Stata and SPSS

In the fields of observation, chance favors only the mind that is prepared.

Louis Pasteur

14.1 Introduction

The logistic regression models, even though quite useful and easy to apply, are still little used in many areas of human knowledge. Even though the development of software and the increase in computer processing capability has provided for their application in a more direct way, many researchers still do not know their usefulness and, above all, the conditions for their correct use.

Different from the traditional technique of regression estimated by ordinary least squares methods, where the dependent variable is presented in a quantitative way and some presuppositions should be obeyed, as we studied in the previous chapter, the techniques of logistic regression are used when the phenomenon to be studied (outcome variable) presents itself in a qualitative way and, therefore, is represented by one or more dummy variables, depending on the number of possible answers (categories) for this dependent variable.

Imagine, for example, that a researcher is interested in evaluating the probability of heart attacks in financial market executives, based on their physical characteristics (weight, waistline), their eating habits, and their health habits (physical exercise, smoking). A second researcher wants to evaluate the chance of consumers who acquire durable goods in a determined period to go into default due to the income, marital status, and educational level of each. Notice that the heart attack or default are dependent variables in both cases and their events may or may not occur, due to the explanatory variables inserted into the respective models, and, therefore, are qualitative dichotomous variables that represent each of the variables under study. Our intent is to estimate the occurrence probability of these phenomena and, therefore, we will use the binary logistic regression.

Imagine now that a third researcher is interested in studying the probability of obtaining credit by small- and medium-sized companies, due to their financial and operational characteristics. It is known that each company can receive unrestricted credit, restricted credit, or no credit at all. In this case, the dependent variable that represents the phenomenon is also qualitative, but offers three possible answers (categories). Therefore, to estimate the probability of the alternative proposals occurring, we should use the multinomial logistic regression.

Then, if a phenomenon under study is presented by means of two, and only two, categories, it will be represented by only one dummy variable. The first category will be the reference and indicate the event of noninterest (dummy = 0) and the other category will indicate the event of interest (dummy = 1), and we are dealing with a binary logistic regression technique. On the other hand, if the phenomenon under study presents more than two categories as occurrence possibilities, we must initially define the reference category to then estimate the multinomial logistic regression model.

In having a qualitative variable as a phenomenon to be studied, estimation by means of the least minimum squares method, as studied in the previous chapter, is not viable since this dependent variable does not present average and variance and, therefore, there is no way to minimize the sum of the square of the error terms without generating an incoherent arbitrary ponderation. Being that the insertion of this dependent variable in modeling software is done based on typing in the values that represent each of the answer possibilities, it is common to see forgetfulness in defining the category labels that correspond to each of the entered values and, therefore, it is possible that an unadvised or beginning researcher estimate the model by means of the least squares regression, including obtaining outputs, since the software will interpret the dependent variable as being quantitative. This serious mistake is, unfortunately, more common than one would think! The binary and multinomial logistic regression techniques are elaborated based on the estimation by maximum likelihood, to be studied in Sections 14.2.1 and 14.3.1, respectively.

Analogous to what was discussed in the previous chapter, the logistic regression models are defined based on subjacent theory and the experience of the researcher, in such a way that it is possible to estimate the desired model, analyze the obtained results by means of statistical tests, and prepare predictions.

In this chapter, we will cover the binary and multinomial logistic regression models, with the following objectives: (1) introduce the concepts of logistic regression, (2) present estimation by maximum likelihood, (3) interpret the obtained results and prepare predictions, and (4) present the application of these techniques in Excel, Stata, and SPSS. First, the solution to an example will be worked out in Excel simultaneously with the presentation of the concepts and its manual solution. After introducing the concepts, the procedures for preparing the technique in Stata and SPSS will be presented, maintaining the standard adopted in the book.

14.2 The Binary Logistic Regression Model

The binary logistic regression model has, as its main objective, as the study of the probability of the occurrence of an event defined by Y, which presents itself in a qualitative, dichotomic form (Y = 1 to describe the occurrence of an event of interest and Y = 0 to describe the occurrence of the non-event), based on the behavior of explanatory variables. In this way, we can define a vector of explanatory variables, with respective estimated parameters, in the following way:

$Z_{i} = α + β_{1} \cdot X_{1 i} + β_{2} \cdot X_{2 i} + \dots + β_{k} \cdot X_{ki}$

(14.1)

in that Z is known as the logit, α represents the constant, β_j (j = 1, 2, …, k) are the estimated parameters for each explanatory variable, X_j are the explanatory variables (metric or dummies), and the subscript i represents each sample observation (i = 1, 2, …, n, where n is the size of the sample). It is important to highlight that Z does not represent the dependent variable, denominated by Y, and our present objective is to define the p_i probability expression for the occurrence of the event of interest for each observation, in function of logit Z_i, or rather, in function of the estimated parameters for each explanatory variable. To do this, we should define the concept of an event’s chance of occurrence, also known as odds, in the following way:

${odds}_{Y_{i} = 1} = \frac{p_{i}}{1 - p_{i}}$

si10_e (14.2)

Imagine that we are interested in studying the event “passing the calculus course.” If, for example, the probability of a determined student to pass this course is of 80%, their chance of passing will be of 4 to 1 (0.8/0.2 = 4). If the probability of another student passing the same course is 25%, given that they studied much less than the first student, the chance of passing will be 1 to 3 (0.25/0.75 = 1/3). Even though we are used to daily using the terms chance or odds as synonyms for probability, the concepts are different!

The binary logistic regression defines the Z logit as the natural logarithm of odds, such that:

$ln ({odds}_{Y_{i} = 1}) = Z_{i}$

(14.3)

from which comes:

$ln (\frac{p_{i}}{1 - p_{i}}) = Z_{i}$

si12_e (14.4)

Being that our intent is to define an expression for the probability of occurrence of an event under study in function of the logit, we can mathematically isolate p_i based on Expression (14.4) in the following manner:

$\frac{p_{i}}{1 - p_{i}} = e^{Z_{i}}$

si13_e (14.5)

$p_{i} = (1 - p_{i}) \cdot e^{Z_{i}}$

(14.6)

$p_{i} \cdot (1 + e^{Z_{i}}) = e^{Z_{i}}$

(14.7)

And, therefore, we have that:

Probability of occurrence of the event:

$p_{i} = \frac{e^{Z_{i}}}{1 + e^{Z_{i}}} = \frac{1}{1 + e^{- Z_{i}}}$

si16_e (14.8)

Probability of occurrence of the non-event:

$1 - p_{i} = 1 - \frac{e^{Z_{i}}}{1 + e^{Z_{i}}} = \frac{1}{1 + e^{Z_{i}}}$

si17_e (14.9)

Obviously, the sum of Expressions (14.8) and (14.9) is equal to 1.

Based on Expression (14.8), we can elaborate a table with p values in function of the Z values. Being that Z varies from −∞ to +∞, we will, for teaching purposes only, use integer values between − 5 and + 5. Table 14.1 gives these values.

Table 14.1

Probability of Occurrence of an Event (p) in Function of the Z Logit
$p_{i} = \frac{1}{1 + e^{- Z_{i}}}$	Z_i
0.0067	− 5
0.0180	− 4
0.0474	− 3
0.1192	− 2
0.2689	− 1
0.5000	0
0.7311	1
0.8808	2
0.9526	3
0.9820	4
0.9933	5

Based on Table 14.1, we can prepare a graph of p = f(Z), as presented in Fig. 14.1. By means of this graph, we see that the estimated probabilities, in function of the different values assumed by Z, are situated between 0 and 1, which was guaranteed when we imposed that the logit was equal to the natural logarithm of odds. As such, given the parameter data estimated in the model and the value of each of the explanatory variables for each i data observation, we can calculate the value of Z_i and, by means of the logistic curve presented in Fig. 14.1 (also known as the S curve, or sigmoid), estimate the probability of occurrence for an event under study for this determined i observation.

Based on Expressions (14.1) and (14.8), we can define the general expression for the estimated probability of occurrence for an event that is presented in a dichotomic form for an i observation in the following way:

$p_{i} = \frac{1}{1 + e^{- (α + β_{1} \cdot X_{1 i} + β_{2} \cdot X_{2 i} + \dots + β_{k} \cdot X_{ki})}}$

si18_e (14.10)

What the binary logistic regression estimates, therefore, is not the predicted values of the dependent variable, but, yet, the probability of occurrence of the event under study for each observation. We now move on to the estimation of the logit parameters by means of presenting an example prepared initially in Excel.

14.2.1 Estimation of the Binary Logistic Regression Model by Maximum Likelihood

We will now present the concepts pertinent to estimation by maximum likelihood using an example similar to that developed throughout the previous chapter. However, now the dependent variable will be qualitative and dichotomic.

Imagine that our curious professor, who has already considerably explored the effects of determined explanatory variables on the travel time for a group of students to get to school, by means of the multiple regression technique, is now interested in investigating if these same explanatory variables influence the probability of a student arriving late to class. In other words, the phenomenon in question to be studied only presents two categories (arrive late to class or not) and the event of interest refers to arriving late.

To this end, the professor researched 100 students at the school where he teaches, questioning each of them regarding if they had arrived late that day. The professor also asked regarding the distance traveled (in kilometers), the number of traffic lights through which each went, the time of day when the trip was made (morning or afternoon), and the driving style each considers themselves to have (calm, moderate, or aggressive). Part of the prepared dataset is found in Table 14.2.

Table 14.2

Example: Late (Yes or No) × Distance Traveled, Number of Traffic Lights, Time of Day for the Trip to School, and Driving Style
Student	Arrived Late to School (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Traffic Lights—sem (X_2i)	Time of Day (X_3i)	Driving Style (X_4i)
Gabriela	No	12.5	7	morning	calm
Patricia	No	13.3	10	morning	calm
Gustavo	No	13.4	8	morning	moderate
Leticia	No	23.5	7	morning	calm
Luiz Ovidio	No	9.5	8	morning	calm
Leonor	No	13.5	10	morning	calm
Dalila	No	13.5	10	morning	calm
Antonio	No	15.4	10	morning	calm
Julia	No	14.7	10	morning	calm
Mariana	No	14.7	10	morning	calm
…
Filomena	Yes	12.8	11	afternoon	aggressive
…
Estela	Yes	1.0	13	morning	calm

Table 14.2

For the dependent variable, being the event of interest refers to arrive late, this category will present values equal to 1, with the category not arrive late with values equal to 0.

Following what was defined in the previous chapter in relation to the explanatory qualitative variables, the reference category of the variable corresponding to the time of day will be afternoon, or rather, the cells in the dataset with this value will assume values equal to 0, leaving the cells with the category morning with values equal to 1. Now, the driving style variable should be transformed into two dummies (variables style2 for the moderate category and style3 for the aggressive category), being that we have defined the calm category as being the reference.

As such, Table 14.3 presents part of the final dataset to be used for the estimation of our binary logistic regression model.

Table 14.3

Substitution of Qualitative Variable Categories With Respective Dummy Variables
Student	Arrived Late to School (Dummy Yes = 1; No = 0) (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Traffic Lights—sem (X_2i)	Time of Day Dummy per (X_3i)	Driving Style Dummy style2 (X_4i)	Driving Style Dummy style3 (X_5i)
Gabriela	0	12.5	7	1	0	0
Patricia	0	13.3	10	1	0	0
Gustavo	0	13.4	8	1	1	0
Leticia	0	23.5	7	1	0	0
Luiz Ovidio	0	9.5	8	1	0	0
Leonor	0	13.5	10	1	0	0
Dalila	0	13.5	10	1	0	0
Antonio	0	15.4	10	1	0	0
Julia	0	14.7	10	1	0	0
Mariana	0	14.7	10	1	0	0
…
Filomena	1	12.8	11	0	0	1
…
Estela	1	1.0	13	1	0	0

Table 14.3

The complete dataset can be accessed by means of the Late.xls file.

In this way, the logit whose parameters we wish to estimate is defined in the following way:

$Z_{i} = α + β_{1} \cdot {dist}_{i} + β_{2} \cdot {sem}_{i} + β_{3} \cdot {per}_{i} + β_{4} \cdot style 2_{i} + β_{5} \cdot style 3_{i}$

and the estimated probability that a determined student arrives late can be written in the following way:

$p_{i} = \frac{1}{1 + e^{- (α + β_{1} \cdot {dist}_{i} + β_{2} \cdot {sem}_{i} + β_{3} \cdot {per}_{i} + β_{4} \cdot style 2_{i} + β_{5} \cdot style 3_{i})}}$

si20_e

Being that it does not make sense for us to define the term of error for each observation, given that the dependent variable presents itself in a dichotomic form, there is no way to estimate the equation parameters by means of the sum of the residuals squares, as we did when estimating the traditional regression models. In this case, therefore, we will use the likelihood function from which the maximum likelihood estimation will be elaborated. Estimation by maximum likelihood is the most popular parameter estimation technique for logistic regression models.

Due to this fact, it is also important to mention, in relation to the presuppositions studied for regression models estimated by least minimum squares, that the researcher should only be concerned with the presupposition of the absence of multicollinearity of the explanatory variables as to the estimation of logistic regression models.

In the binary logistic regression, the dependent variable follows a Bernoulli distribution, in other words, the fact of a determined i observation have occurred or not in the event of interest can be considered as a Bernoulli trial, in which the probability of the occurrence of the event is p_i and probability of the occurrence of the non-event is (1 − p_i). In general, we can write that the probability of the occurrence of Y_i, being that Y_i is equal to 1 or equal to 0, is given as:

$p (Y_{i}) = p_{i}^{Y_{i}} \cdot {(1 - p_{i})}^{1 - Y_{i}}$

(14.11)

For a sample with n observations, we can define the likelihood function as being:

$L = \prod_{i = 1}^{n} [p_{i}^{Y_{i}} \cdot {(1 - p_{i})}^{1 - Y_{i}}]$

si22_e (14.12)

from which comes, based on Expressions (14.8) and (14.9), that:

$L = \prod_{i = 1}^{n} [{(\frac{e^{Z_{i}}}{1 + e^{Z_{i}}})}^{Y_{i}} \cdot {(\frac{1}{1 + e^{Z_{i}}})}^{1 - Y_{i}}]$

si23_e (14.13)

Being, in practice, it is more convenient to work with the logarithmic likelihood function, we arrive at the following function, also known as the log likelihood function:

$LL = \sum_{i = 1}^{n} \{[(Y_{i}) \cdot ln (\frac{e^{Z_{i}}}{1 + e^{Z_{i}}})] + [(1 - Y_{i}) \cdot ln (\frac{1}{1 + e^{Z_{i}}})]\}$

si24_e (14.14)

And now, a question must be asked: What are the values of the logit parameters that cause the LL value of Expression (14.14) to be maximized? This important question is the main key for the estimation by maximum likelihood of binary logistic regression models, and can be answered using linear programming tools, so as to estimate the α, β₁, β₂, …, β_k parameters based on the following objective function:

$LL = \sum_{i = 1}^{n} \{[(Y_{i}) \cdot ln (\frac{e^{Z_{i}}}{1 + e^{Z_{i}}})] + [(1 - Y_{i}) \cdot ln (\frac{1}{1 + e^{Z_{i}}})]\} = max$

si25_e (14.15)

We will solve this problem with the Excel Solver tool using the data from our example. For such, we should open the LateMaximumLikelihood.xls file, which will help in the calculation of the parameters.

In this file, besides the dependent variable and the explanatory variables, three new variables were created, which correspond to the Z_i logit, the probability of occurrence for the event of interest p_i, and to the LL_i logarithmic likelihood function for each observation, respectively. Table 14.4 shows part of the data when the α, β₁, β₂, β₃, β₄, and β₅ parameters are equal to 0.

Table 14.4

Calculation of LL When α = β₁ = β₂ = β₃ = β₄ = β₅ = 0
Student	Y_i	X_1i	X_2i	X_3i	X_4i	X_5i	Z_i	p_i	$\begin{array}{c} {LL}_{i} \\ (Y_{i}) \cdot ln (p_{i}) + (1 - Y_{i}) \cdot ln (1 - p_{i}) \end{array}$
Gabriela	0	12.5	7	1	0	0	0	0.5	− 0.69315
Patricia	0	13.3	10	1	0	0	0	0.5	− 0.69315
Gustavo	0	13.4	8	1	1	0	0	0.5	− 0.69315
Leticia	0	23.5	7	1	0	0	0	0.5	− 0.69315
Luiz Ovidio	0	9.5	8	1	0	0	0	0.5	− 0.69315
Leonor	0	13.5	10	1	0	0	0	0.5	− 0.69315
Dalila	0	13.5	10	1	0	0	0	0.5	− 0.69315
Antonio	0	15.4	10	1	0	0	0	0.5	− 0.69315
Julia	0	14.7	10	1	0	0	0	0.5	− 0.69315
Mariana	0	14.7	10	1	0	0	0	0.5	− 0.69315
…
Filomena	1	12.8	11	0	0	1	0	0.5	− 0.69315
…
Estela	1	1.0	13	1	0	0	0	0.5	− 0.69315
Sum	$LL = \sum_{i = 1}^{100} \{[(Y_{i}) \cdot ln (p_{i})] + [(1 - Y_{i}) \cdot ln (1 - p_{i})]\}$								− 69.31472

Table 14.4

Fig. 14.2 presents part of the data in the LateMaximumLikelihood.xls file, being that some cells were hidden due to the number of observations being equal to 100.

As we can see, when α = β₁ = β₂ = β₃ = β₄ = β₅ = 0, the sum value of the logarithmic likelihood function is equal to − 69.31472. However, there should be an excellent combination of parameter values, such that the objective function presented in Expression (14.15) should be obeyed, or rather, the sum value of the logarithmic likelihood function is the maximum possible.

According to the logic proposed by Belfiore and Fávero (2012), we will now open the Excel Solver tool. The objective function is in cell J103, which is our destination cell and which should be maximized. Besides this, parameters α, β₁, β₂, β₃, β₄, and β₅, whose values are in cells M3, M5, M7, M9, M11, and M13, respectively, are variables cells. The Solver window will be as shown in Fig. 14.3.

Fig. 14.3 Solver—maximization of the sum of logarithmic likelihood function.

By clicking on Solve and then OK, we will obtain the best solution to the linear programming problem. Table 14.5 shows part of the obtained data.

Table 14.5

Values Obtained When Maximizing LL
Student	Y_i	X_1i	X_2i	X_3i	X_4i	X_5i	Z_i	p_i	LL_i(Y_i) ⋅ ln (p_i) + (1−Y_i) ⋅ ln (1−p_i)
Gabriela	0	12.5	7	1	0	0	− 11.73478	0.00001	− 0.00001
Patricia	0	13.3	10	1	0	0	− 3.25815	0.03704	− 0.03774
Gustavo	0	13.4	8	1	1	0	− 7.42373	0.00060	− 0.00060
Leticia	0	23.5	7	1	0	0	− 9.31255	0.00009	− 0.00009
Luiz Ovidio	0	9.5	8	1	0	0	− 9.62856	0.00007	− 0.00007
Leonor	0	13.5	10	1	0	0	− 3.21411	0.03864	− 0.03940
Dalila	0	13.5	10	1	0	0	− 3.21411	0.03864	− 0.03940
Antonio	0	15.4	10	1	0	0	− 2.79572	0.05756	− 0.05928
Julia	0	14.7	10	1	0	0	− 2.94987	0.04974	− 0.05102
Mariana	0	14.7	10	1	0	0	− 2.94987	0.04974	− 0.05102
…
Filomena	1	12.8	11	0	0	1	5.96647	0.99744	− 0.00256
…
Estela	1	1.0	13	1	0	0	2.33383	0.91164	− 0.09251
Sum	$LL = \sum_{i = 1}^{100} \{[(Y_{i}) \cdot ln (p_{i})] + [(1 - Y_{i}) \cdot ln (1 - p_{i})]\}$								− 29.06568

Table 14.5

Then, the maximum possible value of the sum of the logarithmic likelihood function is LL_max = − 29.06568. The solution to this problem generated the following parameter estimates:

α = − 30.202
β₁ = 0.220
β₂ = 2.767
β₃ = − 3.653
β₄ = 1.346
β₅ = 2.914

and, as such, the Z_i logit can be written as follows:

$Z_{i} = - 30.202 + 0.220 \cdot {dist}_{i} + 2.767 \cdot {sem}_{i} - 3.653 \cdot {per}_{i} + 1.346 \cdot style 2_{i} + 2.914 \cdot style 3_{i}$

Fig. 14.4 presents part of the results obtained by modeling the LateMaximumLikelihood.xls file.

And, therefore, the estimated probability expression for a student to arrive late can be written in the following way:

$p_{i} = \frac{1}{1 + e^{- (- 30.202 + 0.220 \cdot {dist}_{i} + 2.767 \cdot {sem}_{i} - 3.653 \cdot {per}_{i} + 1.346 \cdot style 2_{i} + 2.914 \cdot style 3_{i})}}$

si27_e

Thus, the posing of some interesting questions is now fitting:

What is the average estimated probability to arrive late to school when traveling 17 kilometers and going through 10 traffic lights, making the trip in the morning and having what is considered an aggressive driving style?

On average, how much does chance to arrive late to school change if a route 1 kilometer longer is adopted while maintaining the remaining conditions constant?

Does a student considered aggressive present, on average, a higher chance of arriving late than another who is considered calm? If yes, how much is this chance increased, maintaining the remaining conditions constant?

Before answering these important questions, we need to verify if all the estimated parameters are statistically significant at a determined confidence level. If this is not the case, we will need to re-estimate the final model, so that the model presents only statistically significant parameters so that, from then on, the elaboration of inferences and predictions is possible.

Therefore, having estimated by maximum likelihood the probability equation parameters of the event occurrence, we now begin the study of the general statistical significance of the obtained model, as well as the statistical significance of the parameters, analogous to that done when studying the traditional regression models in the previous chapter. It is important to mention that, in the Appendix of this chapter, we will make a brief presentation of the probit regression models, which can be used alternately to the binary logistic regression models for those cases where the probability of occurrence curve for a determined event adjusts itself more adequately to the cumulative density function of the standard normal distribution.

14.2.2 General Statistical Significance of the Binary Logistic Regression Model and Each of Its Parameters

If, for example, we prepare a linear graph of our dependent variable (late) in function of the variable referent to the number of traffic lights (sem), we notice that the model estimates are not able to adjust themselves in a satisfactory way to the behavior of the dependent variable, being that it is a dummy. The graph in Fig. 14.5A presents this behavior. On the other hand, if the binary logistic regression model is prepared and the estimates of the probability of arriving late for each observation in our sample are plotted, specifically in function of the number of traffic lights through which each student goes, we notice that the adjustment is much more adequate to the behavior of the dependent variable (S curve), with estimated values limited to between 0 and 1 (Fig. 14.5B).

Fig. 14.5 Linear and logistic adjustments of the dependent variable in function of the sem Variable. (A) Linear adjustment and (B) logistic adjustment.

Therefore, being the dependent variable is qualitative, it makes no sense to discuss its percentage of variance explained by the predicting variables. In other words, in logistic regression models, there is no coefficient of determination R² as in the traditional regressions estimated by the least minimum squares method. However, many researchers present, in their work, a coefficient known as the McFadden pseudo R². Its expression is given as:

$pseudo R^{2} = \frac{- 2 \cdot {LL}_{0} - (- 2 \cdot {LL}_{\max})}{- 2 \cdot {LL}_{0}}$

si28_e (14.16)

Its usefulness is quite limited and is restricted to cases where the researcher in interested in comparing one or two distinct models, given that one of the many existing criteria for the model choice is that of the higher McFadden pseudo R².

In our example, as we have already discussed in the previous section and calculated by means of Excel Solver, LL_max, which is the maximum possible value of the sum of the logarithmic likelihood function, is equal to − 29.06568.

Now LL₀ represents the maximum possible value of the sum of the logarithmic likelihood function for a model known as the null model, in other words, for a model that only presents constant α and no explanatory variable. By means of the same procedure performed in the previous section, however now using the LateMaximumLikelihoodNullModel.xls file, we will obtain that LL₀ = − 67.68585. Figs. 14.6 and 14.7 show the Solver window and part of the results obtained by modeling in this file, respectively.

Fig. 14.6 Solver—maximization of the sum of logarithmic likelihood function for the null model.

Fig. 14.7 Estimating parameters through LL maximization by Solver—null model.

Then, based on Expression (14.16), we obtain:

$pseudo R^{2} = \frac{- 2 \cdot (- 67.68585) - [(- 2 \cdot (- 29.06568))]}{- 2 \cdot (- 67.68585)} = 0.5706$

si29_e

As we discussed, a higher McFadden pseudo R² can be used as criteria to choose one model over another. However, as we will study in Section 14.2.4, there is another, more adequate criteria to choose the best model, which refers to a greater area below the receiver operating characteristic (ROC) curve.

Many researchers also use the McFadden pseudo R² as a performance indicator for a chosen model, independent of the comparison with other models. However, its interpretation demands much care and, at times, there is the inevitable temptation to erroneously associate it with the variance percentages of the dependent variable. As we will study in Section 14.2.4, the best performance indicator for a binary logistic regression model is the overall model efficiency, which is defined based on the determination of a cutoff, the concepts of which will be studied in the same section.

Even though the usefulness of the McFadden pseudo R² is limited, software such as Stata and SPSS calculate and present it in their respective outputs, as we will see in Sections 14.4 and 14.5, respectively.

Analogous to the procedure presented in the previous chapter, we will first study the general statistical significance of the model being proposed. The χ² test provides the means to verify the model significance, since its null and alternate hypotheses, for a general logistic regression model, are:

H₀: β₁ = β₂ = … = β_k = 0
H₁: there is at least one β_j ≠ 0, respectively

While the F-test is used for regression models where the dependent variable presents itself quantitatively, which generates decomposition of the variance (ANOVA table), studied in the previous chapter, the χ² test is more adequate for models estimated by the maximum likelihood method, such as the logistic regression models.

The χ² test provides the researcher an initial verification regarding the existence of the model being proposed, since, if all the estimated β_j (j = 1, 2, …, k) parameters are statistically equal to 0, the behavior of the alteration of each of the X variables will not influence the probability occurrence of the event under study in any way. The χ² statistic has the following expression:

$χ^{2} = - 2 \cdot ({LL}_{0} - {LL}_{\max})$

(14.17)

Returning to our example, we have that:

$χ_{5 d . f .}^{2} = - 2 \cdot [- 67.68585 - (- 29.06568)] = 77.2403$

For 5 degrees of freedom (number of explanatory variables considered in the model, or rather, the number of β parameters), we have, by means of Table D in the Appendix, that the χ_c² = 11.070 (χ² critical for 5 degrees of freedom and for the significance level of 5%). In this way, being the χ² calculated χ_cal² = 77.2403 > χ_c² = 11.070, we can reject the null hypothesis that all of the β_j (j = 1, 2, …, 5) parameters are statistically equal to zero. Then, at least one X variable is statistically significant to explain the probability of occurrence of the event under study and we have a statistically significant binary logistic regression model for the purpose of prediction.

Software such as Stata and SPSS do not offer the χ_c² for the defined degrees of freedom and a determined level of significance. However, they offer the χ_cal² level of significance for these degrees of freedom. As such, instead of analyzing if χ_cal² > χ_c², we should check if the level of significance of χ_cal² is lower than 0.05 (5%) so as to give continuity to the regression analysis. As such:

If the P-value (or Sig. χ_cal² or Prob. χ_cal²) < 0.05, there is at least one β_j ≠ 0.

The χ_cal² level of significance can be obtained in Excel by means of the command Formulas → Insert Function → DIST.QUI, which will open the dialog box seen in Fig. 14.8.

Fig. 14.8 Obtaining the χ² level of significance (command Insert Function).

Analogous to the F-test, the χ² test evaluates the joint significance of the explanatory variables, not defining which one or ones of the variables considered in the model are statistically significant to influence the probability of an event occurrence.

In this way, it is necessary that the researcher evaluate if each of the binary logistic regression model parameters is statistically significant and, in this sense, the Wald z statistic would be important to provide the statistical significance for each parameter to be considered in the model. The z nomenclature refers to the fact that the distribution of this statistic is standard normal. The Wald z test hypotheses for the α and for each β_j (j = 1, 2, …, k) are:

H₀: α = 0
H₁: α ≠ 0

H₀: β_j = 0
H₁: β_j ≠ 0, respectively

The expressions for the calculation of the Wald z statistic for each α and β_j parameter are given by:

$\begin{array}{l} z_{α} = \frac{α}{s . e . (α)} \\ z_{β_{j}} = \frac{β_{j}}{s . e . (β_{j})} \end{array}$

si32_e (14.18)

where s.e. refers to the standard error of each parameter under analysis. Given the complexity of the standard error calculation for each parameter, we will not perform them at this time. However, we recommend reading Engle (1984). The s.e. values for each parameter, for our example, are:

s.e. (α) = 9.981
s.e. (β₁) = 0.110
s.e. (β₂) = 0.922
s.e. (β₃) = 0.878
s.e. (β₄) = 0.748
s.e. (β₅) = 1.179

Then, as we have already calculated the parameter estimates, we have that:

$z_{α} = \frac{α}{s . e . (α)} = \frac{- 30.202}{9.981} = - 3.026$

si33_e

$z_{β_{1}} = \frac{β_{1}}{s . e . (β_{1})} = \frac{0.220}{0.110} = 2.000$

si34_e

$z_{β_{2}} = \frac{β_{2}}{s . e . (β_{2})} = \frac{2.767}{0.922} = 3.001$

si35_e

$z_{β_{3}} = \frac{β_{3}}{s . e . (β_{3})} = \frac{- 3.653}{0.878} = - 4.161$

si36_e

$z_{β_{4}} = \frac{β_{4}}{s . e . (β_{4})} = \frac{1.346}{0.748} = 1.799$

si37_e

$z_{β_{5}} = \frac{β_{5}}{s . e . (β_{5})} = \frac{2.914}{1.179} = 2.472$

si38_e

After obtaining the Wald z statistics, the researcher can use the normal curve distribution table to obtain the critical values for a given level of significance and check if such tests reject or do not reject the null hypothesis.

For the 5% level of significance, we have, by means of Table E in the Appendix, that z_c = − 1.96 for the lower tail (probability for the lower tail of 0.025 for the two-tailed distribution) and z_c = 1.96 for the upper tail (probability for the upper tail also of 0.025 for the two-tailed distribution).

The z_c values for the 5% significance level can be obtained in Excel by means of the command Formulas → Insert Function → NORM.S.INV, being that the researcher should type in a probability of 2.5% to obtain z_c for each lower tail, and 97.5% to obtain the z_c for each upper tail, as shown in Figs. 14.9 and 14.10, respectively.

Fig. 14.9 Obtaining z_c for the lower tail (command Insert Function).

Fig. 14.10 Obtaining z_c for the upper tail (command Insert Function).

Only the Wald z statistic for the β₄ parameter presented a value between − 1.96 and 1.96, which indicates, at the 5% significance level, that, for this case, there was no rejection of the null hypothesis, or rather, this parameter cannot be considered statistically different from zero.

As in the case of the χ² test, statistical packages also offer the Wald z test values for the significance levels, which facilitates the decision, being that, with a 95% confidence level (5% significance level) we have:

If P-value (or Sig. z_cal or Prob. z_cal) < 0.05 for α, α ≠ 0

If P-value (or Sig. z_cal or Prob. z_cal) < 0.05 for determined explanatory X variable, β ≠ 0.

As such, being − 1.96 < z_β₄ = 1.799 < 1.96, we see that the P-value of the Wald z statistic of the style2 variable will be greater than 0.05.

The nonrejection of the null hypothesis for the β₄, parameter, at the 5% significance level, indicates that the corresponding style2 variable is not statistically significant to increase or decrease the probability of arriving late to school late in the presence of the other explanatory variables and, therefore, should be excluded from the final model.

At this time, we will perform a manual exclusion of this variable so as to obtain the final model. However, it is important to remember that the manual exclusion of a variable can cause another initially significant variable to come to present a nonsignificant parameter, and this problem tends to worsen with the greater number of explanatory variables in the dataset. The opposite can also occur, i.e., it is not recommended to perform a simultaneous manual exclusion of two or more variables whose parameters, at first sight, do show themselves to be statistically different from zero, since a determined β parameter can become statistically different from zero. Fortunately, these phenomena do not occur in this example and, as such, we opt to manually exclude the style2 variable. This will be proven when we estimate the binary logistic regression model by means of the Stepwise procedure in Stata (Section 14.4) and SPSS (Section 14.5).

Therefore, we will open the LateMaximumLikelihoodFinalModel.xls file. Notice that now the calculation of the (Z_i) logit no longer takes into account the variable style2 parameter, which was excluded from the model. Figs. 14.11 and 14.12 show the Solver window and part of the results obtained in the modeling by means of this last file, respectively.

Fig. 14.11 Solver—maximization of the sum of logarithmic likelihood function for the final model.

Fig. 14.12 Estimating parameters through LL maximization by Solver—final model.

Then, for the final model, we have that LL_max = − 30.80079. Before parting to the definition of the final expression of the occurrence probability of the event under study, we need to define if the new estimated model (final model) presents loss in the quality of adjustment in relation to the estimated complete model with all the explanatory variables. To do this, the likelihood-ratio test, which verifies the complete model adjustment in comparison with the final model adjustment, can be used, presenting the following expression:

$χ_{1 d . f .}^{2} = - 2 \cdot ({LL}_{final model} - {LL}_{complete model})$

(14.19)

For our example data, we have that:

$χ_{1 d . f .}^{2} = - 2 \cdot [- 30.80079 - (- 29.06568)] = 3.4702$

Then, for 1 degree of freedom, we have, by means of Table D in the Appendix, that χ_c² = 3.841 (χ² critical for 1 degree of freedom and for the 5% significance level). This way, being that the χ² calculated χ_cal² = 3.4702 < χ_c² = 3.841, we do not reject the null hypothesis of the likelihood-ratio test, or rather, the estimation of the final model with the exclusion of the style2 variable did not alter the quality of the adjustment, to the 5% significance level, which causes this model to be preferable in relation to the estimated complete model with all of the explanatory variables.

In Sections 14.4 and 14.5 we will present, by means of Stata and SPSS, respectively, another quite usual test to verify the quality of adjustment for the final model, known as the Hosmer-Lemeshow test. By dividing the dataset into 10 groups per the deciles of the final model estimated probabilities for each observation, this test evaluates, by means of performing an χ² test, if there are significant differences between the number of frequencies observed and expected for the number of observations in each of the 10 groups and, in case such differences are not statistically significant, at a determined significance level, the estimated model will not present problems in relation to the quality of the proposed adjustment.

Being as such, we return to the analysis of the final estimated model results. The solution to this new problem generated the following final parameter estimates:

α = − 30.935
β₁ = 0.204
β₂ = 2.920
β₃ = − 3.776
β₅ = 2.459

with the respective standard errors:

s.e. (α) = 10.636
s.e. (β₁) = 0.101
s.e. (β₂) = 1.011
s.e. (β₃) = 0.847
s.e. (β₅) = 1.139

and the following Wald z statistics:

$z_{α} = \frac{α}{s . e . (α)} = \frac{- 30.935}{10.636} = - 2.909$

si41_e

$z_{β_{1}} = \frac{β_{1}}{s . e . (β_{1})} = \frac{0.204}{0.101} = 2.020$

si42_e

$z_{β_{2}} = \frac{β_{2}}{s . e . (β_{2})} = \frac{2.920}{1.011} = 2.888$

si43_e

$z_{β_{3}} = \frac{β_{3}}{s . e . (β_{3})} = \frac{- 3.776}{0.847} = - 4.458$

si44_e

$z_{β_{5}} = \frac{β_{5}}{s . e . (β_{5})} = \frac{2.459}{1.139} = 2.159$

si45_e

with all values of z_cal < − 1.96 or > 1.96 and, therefore, with P-values for the Wald z statistics < 0.05.

The final model also presents the following statistics:

$pseudo R^{2} = \frac{- 2 \cdot (- 67.68585) - [(- 2 \cdot (- 30.80079))]}{- 2 \cdot (- 67.68585)} = 0.5449$

si46_e

$χ_{4 d . f .}^{2} = - 2 \cdot [- 67.68585 - (- 30.80079)] = 73.77012 > χ_{c 4 d . f .}^{2} = 9.48773$

As such, we can write the Z_i logit as follows:

$Z_{i} = - 30.935 + 0.204 \cdot {dist}_{i} + 2.920 \cdot {sem}_{i} - 3.776 \cdot {per}_{i} + 2.459 \cdot style 3_{i}$

with the final estimated probability expression that student i will arrive late to school:

$p_{i} = \frac{1}{1 + e^{- (- 30.935 + 0.204 \cdot {dist}_{i} + 2.920 \cdot {sem}_{i} - 3.776 \cdot {per}_{i} + 2.459 \cdot style 3_{i})}}$

si49_e

These parameters and respective statistics can also be obtained by means of the Stepwise procedure when estimating the binary logistic regression model in Stata and SPSS.

Based on the estimation of the probability function, a curious researcher could, for example, desire to prepare a graph of the estimated probabilities for each student to arrive late to school (column H in the final model file in Excel) in function of the number of traffic lights through which each must go on the route (column D in Excel). Fig. 14.13 presents this graph and, contrary to the graph in Fig. 14.5B, which offers determined logistic adjustment (only values equal to 0 or 1 for the dependent variable), this new graph presents a logistic probability adjustment.

Based on Fig. 14.13, which also presents the logistic curve adjusted to the cloud of points that represents the estimated probabilities for each observation, we can see that, while the probability of arriving late to school is very low when going through up to 8 traffic lights along the route, the probability becomes quite high when the student is obliged to go through 11 or more traffic lights during the trip.

Deepening the analysis of the probability function, we can return to our three important questions, answering each one at a time:

Using the last probability expression and substituting the provided values in this equation, we will have:

$p = \frac{1}{1 + e^{- [- 30.935 + 0.204 \cdot (17) + 2.920 \cdot (10) - 3.776 \cdot (1) + 2.459 \cdot (1)]}} = 0.603$

si50_e

Then, the average estimated probability of arriving late to school is, within the provided conditions, equal to 60.3%.

On average, how much does chance to arrive late to school change if a route 1 kilometer longer is adopted while maintaining the remaining conditions constant?

To answer this question, we should resort to Expression (14.3), which can be written as follows:

${odds}_{Y_{i} = 1} = e^{Z_{i}}$

(14.20)

such that, maintaining the remaining conditions constant, the chance of arriving late to school when adopting a route that is 1 kilometer longer is:

${odds}_{Y = 1} = e^{0.204} = 1.226$

Then, the chance is multiplied by a factor of 1.226, or rather, if the remaining conditions are maintained constant, the chance of arriving late to school when adopting a route that is 1 kilometer longer is, on average, 22.6% higher.

Being that β₅ is positive, we can state that the probability that a student who is considered aggressive arriving late to school is higher than a student who is considered calm, a fact that is also proven when we analyze chance, given that, if β₅ > 0, then e^β₅ > 1, or rather, the chance will be higher to arrive late when the student has an aggressive driving style in relation to the one who is calm. This is proved, once again, that being aggressive behind the wheel leads nowhere!

Maintaining the remaining conditions constant, the chance to arrive late to school when being aggressive behind the wheel in relation to being calm is given as:

${odds}_{Y = 1} = e^{2.459} = 11.693$

Then, that chance is multiplied by a factor of 11.693, or rather, maintaining the remaining conditions constant, the chance of arriving late to school when being aggressive behind the wheel in relation to being calm is, on average, 1069.3% higher.

It is worth commenting that there are no differences in the probability of arriving late to school when one is considered moderate or calm, given that the β₄ parameter (referent to the moderate category) presents itself as statistically equal to zero, at the 5% significance level.

As we can see, these calculations always use the average estimates for the parameters. We now embark on the study of the confidence intervals for these parameters.

14.2.3 Construction of the Confidence Intervals of the Parameters for the Binary Logistic Regression Model

The confidence intervals for the coefficients of Expression (14.10), for parameters α and β_j (j = 1, 2, …, k), at the 95% confidence level, can be written as follows:

$\begin{array}{l} α \pm 1.96 \cdot [s . e . (α)] \\ β_{j} \pm 1.96 \cdot [s . e . (β_{j})] \end{array}$

si54_e (14.21)

where, as we have seen, 1.96 is the z_c for the 95% confidence level (5% significance level).

As such, we can prepare Table 14.6, which gives the estimated parameter coefficients for the probability expression for the event of interest in our example, with the respective standard errors, the Wald z statistics and the confidence intervals for the 5% significance level.

Table 14.6

Calculation of the Confidence Intervals of the Parameter Coefficients
Parameter	Coefficient	Standard Error (s.e.)	z	Confidence Interval (95%)
Parameter	Coefficient	Standard Error (s.e.)	z	α − 1.96. [s.e. (α)] β_j − 1.96. [s.e. (β_j)]	α + 1.96. [s.e. (α)] β_j + 1.96. [s.e. (β_j)]
α (constant)	− 30.935	10.636	− 2.909	− 51.782	− 10.088
β₁ (dist variable)	0.204	0.101	2.020	0.006	0.402
β₂ (sem variable)	2.920	1.011	2.888	0.938	4.902
β₃ (per variable)	− 3.776	0.847	− 4.458	− 5.436	− 2.116
β₅ (style3 variable)	2.459	1.139	2.159	0.227	4.691

Table 14.6

This table is equal to what we will obtain when estimating the model in Stata and SPSS by means of the Stepwise procedure. Based on the parameter confidence intervals, we can write the lower (minimum) and upper (maximum) limits expressions for the estimated probability that a student i will arrive late to school, with 95% confidence. As such, we will have:

$p_{i_{\min}} = \frac{1}{1 + e^{- (- 51.782 + 0.006 \cdot {dist}_{i} + 0.938 \cdot {sem}_{i} - 5.436 \cdot {per}_{i} + 0.227 \cdot style 3_{i})}}$

si55_e

$p_{i_{\max}} = \frac{1}{1 + e^{- (- 10.088 + 0.402 \cdot {dist}_{i} + 4.902 \cdot {sem}_{i} - 2.116 \cdot {per}_{i} + 4.691 \cdot style 3_{i})}}$

si56_e

Based on Expression (14.20), the confidence interval of the chance an event of interest occurs for each parameter β_j (j = 1, 2, …, k), at the 95% confidence level, can be written the following way:

$e^{β_{j} \pm 1.96 \cdot [s . e . (β_{j})]}$

(14.22)

Notice that we did not present the expression for the chance confidence interval for parameter α, since it only makes sense to discuss that change in the chance of the occurrence of an event under study when it is altered by a unit, for example, a determined model explanatory variable, maintaining the remaining conditions constant.

For the data in our example and based on the values of Table 14.6, we will, then, prepare Table 14.7, which presents the confidence intervals of the chance (odds) of occurrence for an event of interest for each parameter β_j.

Table 14.7

Calculation of Confidence Intervals of Chance (Odds) for Each Parameter β_j
Parameter	Chance (Odds)	Chance Confidence Interval (95%)
Parameter	e^β_j	e^{β_j − 1.96. [s.e. (β_j)]}	e^{β_j + 1.96. [s.e. (β_j)]}
β₁ (dist variable)	1.226	1.006	1.495
β₂ (sem variable)	18.541	2.555	134.458
β₃ (per variable)	0.023	0.004	0.120
β₅ (style3 variable)	11.693	1.254	109.001

Table 14.7

These values can also be obtained by means of Stata and SPSS, as we will show in Sections 14.4 and 14.5, respectively.

According to what was discussed in the previous chapter, if the confidence interval for a determined parameter contains zero (or if chance contains 1), the same will be considered statistically equal to zero for the confidence level with which the researcher is working. If this happens with the parameter α, it is recommended that nothing be altered in the modeling, since such fact is due to the use of small samples, and a larger sample will solve this problem. On the other hand, if the confidence interval of a parameter β_j contains zero, this will be excluded from the final model when done by the Stepwise procedure. Even though it was not shown here, the confidence interval of the estimated parameter for the variable style2 contains zero being that, as discussed, its z_cal value was situated between − 1.96 and 1.96 and, therefore, such variable was excluded from the final model.

As was also discussed, the rejection of the null hypothesis for a determined β parameter, at a specified significance level, indicates that the corresponding X variable is significant to explain the probability of occurrence for an event of interest and, consequently, should remain in the final model. We can, therefore, conclude that the decision to exclude a determined X variable in a logistic regression model can be done by means of the direct analysis of the Wald z statistic of its respective β parameter (if − z_c < z_cal < z_c → P-value > 0.05 → we cannot reject that the parameter is statistically equal to zero) or by means of the analysis of the confidence interval (if the same contains zero). Box 14.1 presents the criteria of inclusion or exclusion of the β_j (j = 1, 2, …, k) parameters in logistic regression models.

Box 14.1

Decision to Include β_j Parameters in Logistic Regression Models

Parameter	Wald z Statistic (For Significance Level α)	z Test (Analysis of P-Value for Significance Level α)	Analysis of Confidence Interval	Decision
β_j	− z_{c α/2} < z_cal < z_{c α/2}	P-value > sig. level α	Confidence interval contains zero	Exclude parameter from model
β_j	z_cal > z_{c α/2} or z_cal < − z_{c α/2}	P-value < sig. level α	Confidence interval does not contain zero	Maintain parameter in model

Unlabelled Table

Obs.: Most common in applied social science is the adoption of significance level α = 5%.

14.2.4 Cutoff, Sensitivity Analysis, Overall Model Efficiency, Sensitivity, and Specificity

Having estimated the probability model for the occurrence of an event, we will now define the concept of cutoff, based on which it will be possible to classify, in our example, the observations based on the estimated probabilities for each of them. We return to the estimated probability expression for the final model:

$p_{i} = \frac{1}{1 + e^{- (- 30.935 + 0.204 \cdot {dist}_{i} + 2.920 \cdot {sem}_{i} - 3.776 \cdot {per}_{i} + 2.459 \cdot style 3_{i})}}$

si49_e

Having calculated the p_i values by means of the LateMaximumLikelihoodFinalModel.xls file, we will prepare a table with some observations from our sample. Table 14.8 gives the p_i values for the randomly chosen 10 observations, solely for teaching purposes.

Table 14.8

p_i Values for 10 Observations
Observation	p_i
Adelino	0.05444
Carolina	0.67206
Cristina	0.55159
Eduardo	0.81658
Cintia	0.64918
Raimundo	0.05340
Emerson	0.04484
Raquel	0.56702
Rita	0.85048
Leandro	0.46243

A cutoff is defined by the researcher so that the observations can be classified in function of their calculated probabilities and, as such, is used when there is the desire to prepare occurrence predictions of the event for observations not present in the sample, based on the probability of observations present in the sample.

Thus, if a determined observation not present in the sample presents a probability of occurring in the event higher than the defined cutoff, it is hoped that there is the incidence of the event and, therefore, will be classified as an event. On the other hand, if its probability is lower than the defined cutoff, it is hoped that there is the incidence of a non-event and, therefore classified as a non-event.

In general, we can stipulate the following criteria:

If p_i > cutoff → the i observation should be classified as an event.

If p_i < cutoff → the i observation should be classified as a non-event.

Being that the probability expression is estimated based on the observations present in the sample, the classification, for other observations not initially present in the sample, takes into consideration the behavioral consistency of the estimators and, therefore, for inferential effects, the sample should be significant and representative of population behavior, as with any confirmatory model.¹

The cutoff serves for the researcher to evaluate the real incidence of the event for each observation and compare it with the expectation that each observation occurs, in fact, in the event. This being done, it will be possible to evaluate the model success rate based on the actual observations present in the sample and, per inference, assume that such success rate is maintained when there is the desire to evaluate the event incidence for other observations not present in the sample (prediction).

Based on the data from the observations presented in Table 14.8, and choosing, for example, a cutoff of 0.5, we can define that:

If p_i > 0.5 → the i observation should be classified as an event.

If p_i < 0.5 → the i observation should be classified as a non-event.

Table 14.9 gives, for each of the 10 randomly chosen observations, the real occurrence of the event and its respective classification based on the cutoff definition.

Table 14.9

Real Event Occurrence and Classification for 10 Observations With Cutoff = 0.5
Observation	Event	p_i	Classification Cutoff = 0.5
Adelino	No	0.05444	No
Carolina	No	0.67206	Yes
Cristina	No	0.55159	Yes
Eduardo	No	0.81658	Yes
Cintia	No	0.64918	Yes
Raimundo	No	0.05340	No
Emerson	No	0.04484	No
Raquel	No	0.56702	Yes
Rita	Yes	0.85048	Yes
Leandro	Yes	0.46243	No

Table 14.9

Now we can prepare a new classification table, still based only on these 10 observations, so as to evaluate if the observations were correctly classified with a cutoff of 0.5 (Table 14.10).

Table 14.10

Classification Table for 10 Observations (Cutoff = 0.5)
	Real Occurrence of the Event	Real Occurrence of the Non-event
Classified as event	1	5
Classified as non-event	1	3

In other words, for these 10 observations, only one was an event and presented a probability higher than 0.5, or rather, it was an event and was in fact classified as such (correctly classified). The other three observations were also classified correctly, or rather, they were not an event and were not classified as an event. On the other hand, six observations were classified incorrectly, or rather, while one was an event, even though it presented a probability lower than 0.5 and, therefore, not classified as an event, the other five were not an event but presented estimated probabilities higher than 0.5 and, consequently, were classified as an event.

For our sample of 100 observations, we can elaborate Table 14.11, which gives the complete classification for the 0.5 cutoff. This table can also be obtained by modeling in Stata and SPSS.

Table 14.11

Classification Table for Complete Sample (Cutoff = 0.5)
	Real Occurrence of the Event	Real Occurrence of the Non-event
Classified as event	56	11
Classified as non-event	3	30

For the complete sample, we see that 86 observations were correctly classified, for a cutoff of 0.5, being that 56 were an event and were in fact classified as such, and another 30 were an event not occurring and were not classified as an event with this cutoff. However, 14 observations were incorrectly classified, being that 3 were an event but were not classified as such and 11 were not an event but were classified as having been.

This analysis, known as sensitivity analysis, generates classifications that depend on the choice of cutoff. Further ahead, we will make alterations to the cutoff, so as to show that the quantity of classified observations as event or non-event, change.

At this time, we will define the concepts of overall model efficiency, sensitivity, and specificity.

The overall model efficiency (OME) corresponds to the percentage of classification hits for a determined cutoff. For our example, the overall model efficiency is calculated as follows:

$OME = \frac{56 + 30}{100} = 0.8600$

si60_e

For the 0.5 cutoff, 86.00% of the observations are classified correctly. As was mentioned in Section 14.2.2, the overall model efficiency, for a determined cutoff, is much more adequate to evaluate model performance than the McFadden pseudo R² since the dependent variable presents itself in a dichotomic qualitative way.

Sensitivity deals with the percentage of hits, for a determined cutoff, considering only the observations that are, in fact, events. Then, in our example, the denominator for calculating sensitivity is 59, and its expression is given as:

$Sensitivity = \frac{56}{59} = 0.9492$

si61_e

As such, for a cutoff of 0.5, 94.92% of the observations that are events are classified correctly.

Now, specificity, on the other hand, refers to the percentage of hits, for a given cutoff, considering only the observations that are not events. In our example, the expression is given as:

$Specificity = \frac{30}{41} = 0.7317$

si62_e

As such, 73.17% of the observations of events not occurring are classified correctly, or rather, for a cutoff of 0.5, the probability of the occurrence of an event lower than 50% is presented.

Obviously, overall model efficiency, sensitivity, and specificity change when the cutoff value is changed. Table 14.12 presents a new classification for the sample observations, considering a cutoff of 0.3. In this case, we have the following classification criteria:

If p_i > 0.3 → the i observation should be classified as event.

If p_i < 0.3 → the i observation should be classified as non-event.

Table 14.12

Classification Table for Complete Sample (Cutoff = 0.3)
	Real Occurrence of the Event	Real Occurrence of the Non-event
Classified as event	57	13
Classified as non-event	2	28
	Overall model efficiency	0.8500
	Sensitivity	0.9661
	Specificity	0.6829

In comparing the values obtained for a cutoff of 0.5, we see, in this case (cutoff of 0.3), that while sensitivity presents a small increase, specificity is reduced a little more dramatically, which results, in the overall ambit, in a reduction of the overall model efficiency percentage.

Now, let’s alter the cutoff once again, which will be, for our example, 0.7. For this new situation, we have the following classification criteria:

If p_i > 0.7 → the i observation should be classified as event.

If p_i < 0.7 → the i observation should be classified as non-event.

Table 14.13 shows this new classification, with the calculations for overall model efficiency, sensitivity, and specificity.

Table 14.13

Classification Table for Complete Sample (Cutoff = 0.7)
	Real Occurrence of the Event	Real Occurrence of the Non-event
Classified as event	47	5
Classified as non-event	12	36
	Overall model efficiency	0.8300
	Sensitivity	0.7966
	Specificity	0.8780

In this case, we see another behavior, or rather, while sensitivity presents a considerable reduction, specificity increases. We can even see that the rate of hits for those that are events becomes lower than the rate of hits for those that are not events. However, the overall model efficiency, with a 0.7 cutoff, also presents a reduction in percentage in relation to the model with a cutoff of 0.5.

This sensitivity analysis can be done with any cutoff value of between 0 and 1, which allows the researcher to decide regarding defining a cutoff that attends their prediction objectives. If, for example, the objective is to maximize the overall model efficiency, a determined cutoff can be used that, as we know, can generate nonmaximized values of sensitivity or specificity. If, on the other hand, the objective is to maximize sensitivity, or rather, the rate of hits for those that are events, a cutoff can be defined that will not necessarily maximize the overall model efficiency. Finally, if there is the desire to maximize the rate of hits for observations that are not events (specificity), another cutoff can be defined.

In other words, the analysis of sensitivity is prepared based on the subjacent theory for each study and takes into consideration the choices desired by the researcher in terms of event occurrence prediction for observations not present in the sample, being, therefore, a management and strategic analysis of the phenomenon being investigated.

In academic work and in management reports from diverse organizations, it is common that sensitivity analysis graphs be presented and discussed. The most common are those known as the sensitivity curve and the ROC curve, which have distinct ends. While the sensitivity curve is a graph that presents the sensitivity and specificity values in function of the different cutoff values, the ROC curve is a graph that presents the variation in sensitivity in function of (1 − specificity).

We will present the sensitivity curve (Fig. 14.14) and the ROC curve (Fig. 14.15) for the data calculated in our example. Even though not complete, being that three cutoff values have already been used (0.3, 0.5, and 0.7), said curves will allow that some analyses be formed.

By means of the sensitivity curve, we can see that it is possible to define a cutoff that matches sensitivity with specificity, or rather, the cutoff that causes the rate of hits prediction to those which will be events to be equal to the rate of hits prediction for those that which will not be events. It is important to mention, however, that this cutoff does not guarantee that the overall model efficiency is the maximum possible.

Besides this, the sensitivity curve allows the researcher to evaluate the tradeoff between sensitivity and specificity as to the alteration in cutoff, being that, in many cases, as has been discussed, the objective of the prediction could be to increase rate of hits for those that will be an event without there being a considerable loss in the rate of hits for those that are not events.

The ROC shows the actual behavior of the tradeoff between sensitivity and specificity by bringing, on the abscises axis, the values of (1 − specificity), presenting a convex format in relation to point (0, 1). As such, a determined model with a greater area below the ROC curve presents greater overall prediction efficiency, combining all of the cutoff possibilities and, as such, its choice should be preferable in comparison with another model with a smaller area below the ROC curve. In other words, if a researcher wants, for example, to include new explanatory variables in the model, a comparison of the overall performance of the models can be prepared based on the area below the ROC curve, being that, the greater its convexity in relation to point (0, 1), the greater its area (higher sensitivity and higher specificity) and, consequently, better the model estimated for the effects of prediction. Fig. 14.16 presents an illustration of this concept.

Fig. 14.16 Model choice criteria with greater area below the ROC curve.

According to Swets (1996), the ROC curve has this name because it compares the alteration of two model operational characteristics (sensitivity and specificity). It was first used by engineers in the Second World War in the study to detect enemy objects in battle. Next, it was introduced to psychology for the investigation of the perceptual detection of determined stimuli and is, today, widely used in the field of medicine, such as radiology, and in different fields of applied social science, such as economics and finance. In this specific case, it is used considerably in risk and credit management and the probability of default.

In Sections 14.4 and 14.5, we will present the sensitivity ROC curves by means of Stata and SPSS, respectively, with all cutoff value possibilities between 0 and 1 for the final estimated model, including calculation of the respective area below the ROC curve.

14.3 The Multinomial Logistic Regression Model

When the dependent variable that represents the phenomenon under study is qualitative, but offers more than two possible answers (categories), we should use the multinomial logistic regression to estimate the occurrence probabilities for each alternative. To do this, we must first define the reference category.

Imagine a situation where a variable dependent presents itself in a qualitative form with three possible answer categories (0, 1, or 2). If the chosen reference category is the 0 category, we will have two other event possibilities in relation to this category, which will be represented by categories 1 and 2 and, as such, two explanatory variable vectors will be defined with the respective estimated parameters, or rather, two logits, as follows:

$Z_{i_{1}} = α_{1} + β_{11} \cdot X_{1 i} + β_{21} \cdot X_{2 i} + \dots + β_{k 1} \cdot X_{ki}$

(14.23)

$Z_{i_{2}} = α_{2} + β_{12} \cdot X_{1 i} + β_{22} \cdot X_{2 i} + \dots + β_{k 2} \cdot X_{ki}$

(14.24)

where the logit number now appears in the subscript of each parameter to be estimated.

Then, generically, if the dependent variable that represents the variable under study presents M answer categories, the number of estimated logits will be (M − 1) and, based on the same, we can estimate the probability of occurrence for each of the categories. The general expression of the logit Z_{i_m} (m = 0, 1, …, M − 1) for a model where a dependent variable assumes M answer categories is:

$Z_{i_{m}} = α_{m} + β_{1 m} \cdot X_{1 i} + β_{2 m} \cdot X_{2 i} + \dots + β_{km} \cdot X_{ki}$

(14.25)

where Z_i₀ = 0 and, therefore, e^Z_i₀ = 1.

Until now, in this chapter, we have been working with two categories and, consequently, only one Z_i logit. In this way, the probabilities of the occurrence of a non-event and an event were calculated, respectively, by means of the following expressions:

Probability of occurrence of the non-event:

$1 - p_{i} = \frac{1}{1 + e^{Z_{i}}}$

si66_e (14.26)

Probability of occurrence of the event:

$p_{i} = \frac{e^{Z_{i}}}{1 + e^{Z_{i}}}$

si67_e (14.27)

Now for three categories, and based on Expressions (14.23) and (14.24), we can estimate the probability of occurrence for reference category 0 and the occurrence probabilities of the two distinct events represented by categories 1 and 2. As such, the expressions for these probabilities can be written in the following way:

Probability of occurrence for category 0 (reference):

$p_{i_{0}} = \frac{1}{1 + e^{Z_{i_{1}}} + e^{Z_{i_{2}}}}$

si68_e (14.28)

Probability of occurrence for category 1:

$p_{i_{1}} = \frac{e^{Z_{i_{1}}}}{1 + e^{Z_{i_{1}}} + e^{Z_{i_{2}}}}$

si69_e (14.29)

Probability of occurrence for category 2:

$p_{i_{2}} = \frac{e^{Z_{i_{2}}}}{1 + e^{Z_{i_{1}}} + e^{Z_{i_{2}}}}$

si70_e (14.30)

such that the sum of the probability of event occurrences, represented by the distinct categories, will always be 1.

In their complete form, Expressions (14.28)–(14.30) can be written as:

$p_{i_{0}} = \frac{1}{1 + e^{(α_{1} + β_{11} \cdot X_{1 i} + β_{21} \cdot X_{2 i} + \dots + β_{k 1} \cdot X_{ki})} + e^{(α_{2} + β_{12} \cdot X_{1 i} + β_{22} \cdot X_{2 i} + \dots + β_{k 2} \cdot X_{ki})}}$

si71_e (14.31)

$p_{i_{1}} = \frac{e^{(α_{1} + β_{11} \cdot X_{1 i} + β_{21} \cdot X_{2 i} + \dots + β_{k 1} \cdot X_{ki})}}{1 + e^{(α_{1} + β_{11} \cdot X_{1 i} + β_{21} \cdot X_{2 i} + \dots + β_{k 1} \cdot X_{ki})} + e^{(α_{2} + β_{12} \cdot X_{1 i} + β_{22} \cdot X_{2 i} + \dots + β_{k 2} \cdot X_{ki})}}$

si72_e (14.32)

$p_{i_{2}} = \frac{e^{(α_{2} + β_{12} \cdot X_{1 i} + β_{22} \cdot X_{2 i} + \dots + β_{k 2} \cdot X_{ki})}}{1 + e^{(α_{1} + β_{11} \cdot X_{1 i} + β_{21} \cdot X_{2 i} + \dots + β_{k 1} \cdot X_{ki})} + e^{(α_{2} + β_{12} \cdot X_{1 i} + β_{22} \cdot X_{2 i} + \dots + β_{k 2} \cdot X_{ki})}}$

si73_e (14.33)

In general, for a model where a dependent variable assumes M answer categories, we can write the probability expressionp_{i_m} (m = 0, 1, …, M − 1) as follows:

$p_{i_{m}} = \frac{e^{Z_{i_{m}}}}{\sum_{m = 0}^{M - 1} e^{Z_{i_{m}}}}$

si74_e (14.34)

Analogous to the procedure developed in Sections 14.2.1–14.2.3, we will now estimate the parameters for Expressions (14.23) and (14.24) by using an example. We will also evaluate the general statistical significance of the model and parameters, as well as estimate their confidence intervals at a determined significance level. As such, we will again use, at this time, Excel.

14.3.1 Estimation of the Multinomial Logistic Regression Model by Maximum Likelihood

We will present the concepts pertinent to estimation of a multinomial logistic regression by maximum likelihood using an example similar to that developed in the previous section.

Now, imagine that our tireless professor is not only interested in studying what causes students to arrive late to school or not. He now wants to know if the students arrive late to their first or second class. In other words, the professor is now interested in investigating if some variables relative to the route taken influence the probability of arriving or not arriving late to the first class or the second class. Now, the dependent variable comes to have three categories: not arrive late, arrive late to the first class, and arrive late to the second class.

Being thus, the professor researched the same 100 students in the school where he lectures; however, the research was done on another day. Being that some students were a little tired of answering so many questions as of late, the professor, besides the variable referent to the phenomenon under study, decided to ask only regarding the distance (dist) and the number of traffic lights (sem) each went through that day on their way to school. Part of the dataset can be found in Table 14.14.

Table 14.14

Example: Late (No, Yes to First Class or Yes to Second Class) × Distance Traveled and Amount of Traffic Lights
Student	Arrived Late to School (No = 0; Yes to First Class = 1; Yes to Second Class = 2) (Y_i)	Distance Traveled to School (km) (X_1i)	Number of Traffic Lights—sem (X_2i)
Gabriela	2	20.5	15
Patricia	2	21.3	18
Gustavo	2	21.4	16
Leticia	2	31.5	15
Luiz Ovidio	2	17.5	16
Leonor	2	21.5	18
Dalila	2	21.5	18
Antonio	2	23.4	18
Julia	2	22.7	18
Mariana	2	22.7	18
…
Rodrigo	1	16.0	16
…
Estela	0	1.0	13

Table 14.14

As we can see, the dependent variable now has three distinct values, which are nothing more than labels referent to each of the three answer categories (M = 3). It is unfortunately common for beginning researchers to prepare multiple regression models, for example, assuming that the dependent variable is quantitative, being that it presents numbers in its column. As we discussed in the previous section, this is a serious mistake!

The complete dataset for this new example can be found in the LateMultinomial.xls file.

The expressions for the logits we wish to estimate are, therefore:

$Z_{i_{1}} = α_{1} + β_{11} \cdot {dist}_{i} + β_{21} \cdot {sem}_{i}$

$Z_{i_{2}} = α_{2} + β_{12} \cdot {dist}_{i} + β_{22} \cdot {sem}_{i}$

which refer to events 1 and 2, respectively, presented in Table 14.14. Notice that the event represented by the label 0 refers to the reference category.

Then, based on Expressions (14.31)–(14.33), we can write the estimated occurrence probability expressions for each event corresponding to each category of the dependent variable. Being thus, we have:

$p_{i_{0}} = \frac{1}{1 + e^{(α_{1} + β_{11} \cdot {dist}_{i} + β_{21} \cdot {sem}_{i})} + e^{(α_{2} + β_{12} \cdot {dist}_{i} + β_{22} \cdot {sem}_{i})}}$

si77_e

$p_{i_{1}} = \frac{e^{(α_{1} + β_{11} \cdot {dist}_{i} + β_{21} \cdot {sem}_{i})}}{1 + e^{(α_{1} + β_{11} \cdot {dist}_{i} + β_{21} \cdot {sem}_{i})} + e^{(α_{2} + β_{12} \cdot {dist}_{i} + β_{22} \cdot {sem}_{i})}}$

si78_e

$p_{i_{2}} = \frac{e^{(α_{2} + β_{12} \cdot {dist}_{i} + β_{22} \cdot {sem}_{i})}}{1 + e^{(α_{1} + β_{11} \cdot {dist}_{i} + β_{21} \cdot {sem}_{i})} + e^{(α_{2} + β_{12} \cdot {dist}_{i} + β_{22} \cdot {sem}_{i})}}$

si79_e

where p_i₀, p_i₁, and p_i₂ represent the probability that a student i will not arrive late (category 0), the probability that a student i will arrive late to the first class (category 1), and the probability that a student i will arrive late to the second class (category 2), respectively.

To estimate the parameters of the probability expressions, we will again use estimation by maximum likelihood. Generically, in the multinomial logistic regression, where the dependent variable follows a binomial distribution, an i observation can occur in a determined event of interest, given M possible events and, therefore, the occurrence probability p_{i_m} (m = 0, 1, …, M − 1) for this specific event can be written in the following manner:

$p (Y_{im}) = \prod_{m = 0}^{M - 1} {(p_{i_{m}})}^{Y_{im}}$

si80_e (14.35)

For a sample with n observations, we can define the likelihood function in the following way:

$L = \prod_{i = 1}^{n} \prod_{m = 0}^{M - 1} {(p_{i_{m}})}^{Y_{im}}$

si81_e (14.36)

from which comes, based on Expression (14.34), that:

$L = \prod_{i = 1}^{n} \prod_{m = 0}^{M - 1} {(\frac{e^{Z_{i_{m}}}}{\sum_{m = 0}^{M - 1} e^{Z_{i_{m}}}})}^{Y_{im}}$

si82_e (14.37)

Analogous to the procedure adopted when studying the binary logistic regression, we will here work with the logarithmic likelihood function, which leads us to the following function, also known as log likelihood function:

$LL = \sum_{i = 1}^{n} \sum_{m = 0}^{M - 1} [(Y_{im}) \cdot ln (\frac{e^{Z_{i_{m}}}}{\sum_{m = 0}^{M - 1} e^{Z_{i_{m}}}})]$

si83_e (14.38)

And, therefore, we can ask an important question: Given M categories of the dependent variable, what are the values of the logit parameters Z_{i_m} (m = 0, 1, …, M − 1) represented by Expression (14.25) that cause the LL value of Expression (14.38) to be maximized? This fundamental question is the main key to the estimation of the parameters of the multinomial logistic regression model by maximum likelihood method, and can be answered with the use of linear programming tools, so as to solve the problem with the following objective function:

$LL = \sum_{i = 1}^{n} \sum_{m = 0}^{M - 1} [(Y_{im}) \cdot ln (\frac{e^{Z_{i_{m}}}}{\sum_{m = 0}^{M - 1} e^{Z_{i_{m}}}})] = max$

si84_e (14.39)

Returning to our example, we will solve this problem using the Excel Solver tool. To do this, we should open the LateMultinomialMaximumLikelihood.xls file, which will help in the parameter calculation.

In this file, besides the dependent variable and explanatory variables, three Y_im (m = 0, 1, 2) variables were created referent to the three categories of the dependent variable. This procedure should be done so as to validate Expression (14.35). These variables were created based on the criteria presented in Table 14.15.

Table 14.15

Criteria for Creation of Variables Y_im (m = 0, 1, 2)
Y_i	Y_i0	Y_i1	Y_i2
0	1	0	0
1	0	1	0
2	0	0	1

Table 14.15

Besides this, six other new variables were also created and correspond to the logits Z_i₁ and Z_i₂, the probabilities p_i₀, p_i₁ and p_i₂ to the logarithmic likelihood function LL_i for each observation, respectively. Table 14.16 shows part of the data when all parameters are equal to 0.

Table 14.16

Calculation of LL When α₁ = β₁₁ = β₂₁ = α₂ = β₁₂ = β₂₂ = 0
Student	Y_i	Y_i0	Y_i1	Y_i2	X_1i	X_2i	Z_i₁	Z_i₂	p_i₀	p_i₁	p_i₂	$\begin{array}{c} {LL}_{i} \\ \sum_{m = 0}^{2} [(Y_{im}) \cdot ln (p_{i_{m}})] \end{array}$
Gabriela	2	0	0	1	20.5	15	0	0	0.33	0.33	0.33	− 1.09861
Patricia	2	0	0	1	21.3	18	0	0	0.33	0.33	0.33	− 1.09861
Gustavo	2	0	0	1	21.4	16	0	0	0.33	0.33	0.33	− 1.09861
Leticia	2	0	0	1	31.5	15	0	0	0.33	0.33	0.33	− 1.09861
Luiz Ovidio	2	0	0	1	17.5	16	0	0	0.33	0.33	0.33	− 1.09861
Leonor	2	0	0	1	21.5	18	0	0	0.33	0.33	0.33	− 1.09861
Dalila	2	0	0	1	21.5	18	0	0	0.33	0.33	0.33	− 1.09861
Antonio	2	0	0	1	23.4	18	0	0	0.33	0.33	0.33	− 1.09861
Julia	2	0	0	1	22.7	18	0	0	0.33	0.33	0.33	− 1.09861
Mariana	2	0	0	1	22.7	18	0	0	0.33	0.33	0.33	− 1.09861
…
Rodrigo	1	0	1	0	16.0	16	0	0	0.33	0.33	0.33	− 1.09861
…
Estela	0	1	0	0	1.0	13	0	0	0.33	0.33	0.33	− 1.09861
Sum	$LL = \sum_{i = 1}^{100} \sum_{m = 0}^{2} [(Y_{im}) \cdot ln (p_{i_{m}})]$											− 109.86123

Table 14.16

Exclusively for teaching purposes, we present the calculation of LL for an observation where Y_i = 2 and where all parameters are equal to zero:

$\begin{matrix} {LL}_{1} = \sum_{m = 0}^{2} [(Y_{1 m}) \cdot ln (p_{1_{m}})] = (Y_{10}) \cdot ln (p_{1_{0}}) + (Y_{11}) \cdot ln (p_{1_{1}}) + (Y_{12}) \cdot ln (p_{1_{2}}) \\ = (0) \cdot ln (0.33) + (0) \cdot ln (0.33) + (1) \cdot ln (0.33) = - 1.09861 \end{matrix}$

si85_e

Fig. 14.17 presents part of the data present in the LateMultinomialMaximumLikelihood.xls file.

As we discussed in Section 14.2.1, we should also have an optimum combination of parameter values here, such that the objective function presented in Expression (14.39) should be obeyed, or rather, that the sum value of the likelihood function be the maximum possible. We again resort to Excel Solver to solve this problem.

The objective function is in cell M103, which will be our destination cell and should be maximized. The parameters α₁, β₁₁, β₂₁, α₂, β₁₂, and β₂₂, whose values are in cells P3, P5, P7, P9, P11, and P13, respectively, are the variables cells. The Solver window will be as shown in Fig. 14.18.

Fig.14.18 Solver—maximization of the sum of logarithmic likelihood function for the multinomial logistic regression model.

In clicking on Solve and then OK, we will obtain the optimum solution for the linear programming problem Table 14.17 shows part of the data obtained.

Table 14.17

Values Obtained for the Maximization of LL for the Multinomial Logistic Regression Model
Student	Y_i	Y_i0	Y_i1	Y_i2	X_1i	X_2i	Z_i₁	Z_i₂	p_i₀	p_i₁	p_i₂	$\begin{array}{c} {LL}_{i} \\ \sum_{m = 0}^{2} [(Y_{ij}) \cdot ln (p_{i_{j}})] \end{array}$
Gabriela	2	0	0	1	20.5	15	3.37036	3.23816	0.01799	0.52341	0.45860	− 0.77959
Patricia	2	0	0	1	21.3	18	8.82883	12.78751	0.00000	0.01873	0.98127	− 0.01891
Gustavo	2	0	0	1	21.4	16	5.54391	7.10441	0.00068	0.17346	0.82586	− 0.19133
Leticia	2	0	0	1	31.5	15	9.51977	15.10301	0.00000	0.00375	0.99625	− 0.00375
Luiz Ovidio	2	0	0	1	17.5	16	3.36367	2.89778	0.02082	0.60162	0.37756	− 0.97402
Leonor	2	0	0	1	21.5	18	8.94064	13.00323	0.00000	0.01691	0.98308	− 0.01706
Dalila	2	0	0	1	21.5	18	8.94064	13.00323	0.00000	0.01691	0.98308	− 0.01706
Antonio	2	0	0	1	23.4	18	10.00281	15.05262	0.00000	0.00637	0.99363	− 0.00639
Julia	2	0	0	1	22.7	18	9.61149	14.29758	0.00000	0.00914	0.99086	− 0.00918
Mariana	2	0	0	1	22.7	18	9.61149	14.29758	0.00000	0.00914	0.99086	− 0.00918
…
Rodrigo	1	0	1	0	16.0	16	2.52511	1.27985	0.05852	0.73104	0.21044	− 0.31329
…
Estela	0	1	0	0	1.0	13	0	− 10.87168	− 23.58594	0.99998	0.00002	0.00000
Sum	$LL = \sum_{i = 1}^{100} \sum_{m = 0}^{2} [(Y_{im}) \cdot ln (p_{i_{m}})]$											− 24.51180

Table 14.17

The maximum value possible for the logarithmic likelihood function is LL_max = − 24.51180. The solution to this problem generated the following parameter estimates:

α₁ = − 33.135
β₁₁ = 0.559
β₂₁ = 1.670
α₂ = − 62.292
β₁₂ = 1.078
β₂₂ = 2.895

and, in this way, the logits Z_i₁ and Z_i₂ can be written as follows:

$Z_{i_{1}} = - 33.135 + 0.559 \cdot {dist}_{i} + 1.670 \cdot {sem}_{i}$

$Z_{i_{2}} = - 62.292 + 1.078 \cdot {dist}_{i} + 2.895 \cdot {sem}_{i}$

Fig. 14.19 presents part of the results obtained by modeling the LateMultinomialMaximumLikelihood.xls file.

Based on the expressions of the logits Z_i₁ and Z_i₂, we can write the expressions of the occurrence probabilities for each of the categories of the dependent variable as follows:

Probability of a student i not arriving late (category 0):

$p_{i_{0}} = \frac{1}{1 + e^{(- 33.135 + 0.559 \cdot {dist}_{i} + 1.670 \cdot {sem}_{i})} + e^{(- 62.292 + 1.078 \cdot {dist}_{i} + 2.895 \cdot {sem}_{i})}}$

si88_e

Probability of a student i arriving late to the first class (category 1):

$p_{i_{1}} = \frac{e^{(- 33.135 + 0.559 \cdot {dist}_{i} + 1.670 \cdot {sem}_{i})}}{1 + e^{(- 33.135 + 0.559 \cdot {dist}_{i} + 1.670 \cdot {sem}_{i})} + e^{(- 62.292 + 1.078 \cdot {dist}_{i} + 2.895 \cdot {sem}_{i})}}$

si89_e

Probability of a student i arriving late to the second class (category 2):

$p_{i_{2}} = \frac{e^{(- 62.292 + 1.078 \cdot {dist}_{i} + 2.895 \cdot {sem}_{i})}}{1 + e^{(- 33.135 + 0.559 \cdot {dist}_{i} + 1.670 \cdot {sem}_{i})} + e^{(- 62.292 + 1.078 \cdot {dist}_{i} + 2.895 \cdot {sem}_{i})}}$

si90_e

Having estimated by maximum likelihood the equations parameters of occurrence probability for each of the categories of the dependent variable, we can prepare the classification of the observations and define the overall model efficiency of the multinomial logistic regression. Different from the binary logistic regression, where the classification is prepared based on the definition of a cutoff, in the multinomial logistic regression, the classification of each observation is done based on the higher probability among those calculated (p_i₀, p_i₁, or p_i₂). As such, for example, being that observation 1 (Gabriela) presented p_i₀ = 0.018, p_i₁ = 0.523, and p_i₂ = 0.459, we should classify it as category 1, or rather, by means of example it is expected that Gabriela will arrive late to the first class. However, we can see that, actually, this student arrived late to the second class and, therefore, for this case, we did not obtain a hit.

Table 14.18 presents the classification for our complete sample, with emphasis on the hits for each category of the dependent variable, highlighting as well the overall model efficiency (overall percentage of hits).

Table 14.18

Classification Table for Complete Sample
Observed	Classification
Observed	Did Not Arrive Late	Arrived Late to First Class	Arrived Late to Second Class	Percentage of Positives (%)
Did not arrive late	47	2	0	95.9
Arrived late to first class	1	12	3	75.0
Arrived late to second class	0	5	30	85.7
		Overall model efficiency		89.0

Table 14.18

By means of the Table 14.18 analysis, we can see that the model presents an overall percentage of hits of 89.0%. However, the model presents a higher percentage of hits (95.9%) for those cases where there was the indication of not arriving late to class. On the other hand, when there are indications that a student will arrive late to the first class, the model will have a lower percentage of hits (75.0%).

We now go on to the study of the general statistical significance of the obtained model, as well as the statistical significance of the actual parameters, as we did in Section 14.2.

14.3.2 General Statistical Significance of the Multinomial Logistic Regression Model and Each of Its Parameters

As in the binary logistic regression studied in Section 14.2, multinomial logistic regression modeling also offers statistics referent to the McFadden pseudo R² and to χ², whose calculations are given based on Expressions (14.16) and (14.17), respectively, given again here:

$pseudo R^{2} = \frac{- 2 \cdot {LL}_{0} - (- 2 \cdot {LL}_{\max})}{- 2 \cdot {LL}_{0}}$

si28_e (14.40)

$χ^{2} = - 2 \cdot ({LL}_{0} - {LL}_{\max})$

(14.41)

While the McFadden pseudo R², as discussed in Section 14.2.2, is quite limited in terms of information regarding the model adjustment, able to be used when the researcher is interested in comparing distinct models, the χ² statistic allows that a verification test be performed on the proposed model, since that, if all estimated parameters β_jm (j = 1, 2, …, k; m = 1, 2, …, M − 1) are statistically equal to 0, the behavior of the alteration of each of the explanatory variables will not influence in any way the occurrence probabilities of the events represented by the categories of the dependent variables. The null hypotheses and alternative of the χ² test for a general multinomial logistic regression model are:

H₀: β₁₁ = β₂₁ = … = β_k1 = β₁₂ = β₂₂ = … = β_k2 = β_{1 M − 1} = β_{2 M − 1} = … = β_{k M − 1} = 0
H₁: there is at least one β_jm ≠ 0, respectively

Returning to our example, we have that LL_max, which is the maximum possible value of the sum of the logarithmic likelihood function, is equal to − 24.51180. To calculate LL₀, which represents the maximum possible value of the sum of the logarithmic likelihood function for a model that only presents the constants α₁ and α₂ and no explanatory variable, we will again use Solver, by means of the LateMultinomialMaximumLikelihoodNullModel.xls file. Figs. 14.20 and 14.21 show the Solver window and part of the results obtained by modeling in this file, respectively.

Fig. 14.20 Solver—maximization of the sum of logarithmic likelihood function for the multinomial logistic regression null model.

Fig. 14.21 Estimating parameters through LL maximization by Solver—multinomial logistic regression null model.

Based on the null model, we have LL₀ = − 101.01922 and, as such, we can calculate the following statistics:

$pseudo R^{2} = \frac{- 2 \cdot (- 101.01922) - [(- 2 \cdot (- 24.51180))]}{- 2 \cdot (- 101.01922)} = 0.7574$

si93_e

$χ_{4 g . l .}^{2} = - 2 \cdot [- 101.01922 - (- 24.51180)] = 153.0148$

For 4 degrees of freedom (number of β parameters being that there are two explanatory variables and two logits), we have, by means of Table D in the Appendix, that χ_c² = 9.488 (χ² critical for 4 degrees of freedom and for the 5% significance level). In this way, being χ² calculated χ_cal² = 153.0148 > χ_c² = 9.488, we can reject the null hypothesis that all β_jm (j = 1, 2; m = 1, 2) parameters are statistically equal to zero. Then, at least one X variable is statistically significant to explain the probability of occurrence of at least one of the events under study. In the same way as discussed in Section 14.2.2, we can define the following criteria:

If P-value (or Sig. χ_cal² or Prob. χ_cal²) < 0.05, there is at least one β_jm ≠ 0.

Besides the general statistical significance of the model, it is necessary to verify the statistical significance of each parameter, by means of the analysis of the respective Wald z statistics, whose null hypotheses and alternative are, for the parameters α_m (m = 1, 2, …, M − 1) and β_jm (j = 1, 2, …, k; m = 1, 2, …, M − 1):

H₀: α_m = 0
H₁: α_m ≠ 0

H₀: β_jm = 0
H₁: β_jm ≠ 0, respectively

The Wald z statistics are obtained based on Expression (14.18); however, maintaining the pattern presented in Section 14.2.2, we will not perform the standard error calculations for each parameter that, for our example, are:

s.e. (α₁) = 12.183
s.e. (β₁₁) = 0.243
s.e. (β₂₁) = 0.577
s.e. (α₂) = 14.675
s.e. (β₁₂) = 0.302
s.e. (β₂₂) = 0.686

Then, as we have already estimated the parameters, we have that:

$z_{α_{1}} = \frac{α_{1}}{s . e . (α_{1})} = \frac{- 33.135}{12.183} = - 2.720$

si95_e

$z_{β_{11}} = \frac{β_{11}}{s . e . (β_{11})} = \frac{0.559}{0.243} = 2.300$

si96_e

$z_{β_{21}} = \frac{β_{21}}{s . e . (β_{21})} = \frac{1.670}{0.577} = 2.894$

si97_e

$z_{α_{2}} = \frac{α_{2}}{s . e . (α_{2})} = \frac{- 62.292}{14.675} = - 4.244$

si98_e

$z_{β_{12}} = \frac{β_{12}}{s . e . (β_{12})} = \frac{1.078}{0.302} = 3.570$

si99_e

$z_{β_{22}} = \frac{β_{22}}{s . e . (β_{22})} = \frac{2.895}{0.686} = 4.220$

si100_e

As we can see, all calculated Wald z statistics present values lower than z_c = − 1.96 or greater than z_c = 1.96 (values critical to the 5% significance level, being the probabilities on the lower tail and the upper tail are equal to 0.025).

As such, we can see, for our example, that the criteria:

If P-value (or Sig. z_cal or Prob. z_cal) < 0.05 for α_m, α_m ≠ 0

If P-value (or Sig. z_cal or Prob. z_cal) < 0.05 for β_jm, β_jm ≠ 0

are obeyed. In other words, the dist and sem variables are statistically significant, to the 95% confidence level, to explain the differences in the probabilities of arriving late to the first class and the second class in relation to not arriving late. The expressions for these probabilities are those estimated in Section 14.3.1 and presented at its end.

As such, based on the final estimated probability models, we can propose three interesting questions, as we did in Section 14.2.2:

What is the average estimated probability of arriving late to the first class after traveling 17 kilometers and going through 15 traffic lights?

Being that arriving late to the first class is category 1, we should use the estimated probability expression p_i₁. As such, for this situation, we have that:

$p_{1} = \frac{e^{[- 33.135 + 0.559 \cdot (17) + 1.670 \cdot (15)]}}{1 + e^{[- 33.135 + 0.559 \cdot (17) + 1.670 \cdot (15)]} + e^{[- 62.292 + 1.078 \cdot (17) + 2.895 \cdot (15)]}} = 0.722$

si101_e

Then, the average estimated probability of arriving late to the first class is, under the informed conditions, equal to 72.2%.

On average, how much does one alter the chance of arriving late to the first class, in relation to not arriving late to school, in adopting a route that is 1 kilometer longer, maintaining the remaining conditions constant?

To answer this question, we will again resort to Expression (14.3), which can be written as follows:

${odds}_{Y_{i 1} = 1} = e^{Z_{i_{1}}}$

(14.42)

such that, maintaining the remaining conditions constant, the chance to arrive late to the first class in relation to not arriving late to school, by adopting a route that is 1 kilometer longer, is:

${odds}_{Y_{1} = 1} = e^{0.559} = 1.749$

Then, the chance is multiplied by a factor of 1.749, or rather, maintaining the other conditions constant, the chance of arriving late to the first class in relation to not arriving late, by adopting a route that is 1 kilometer longer, is, on average, 74.9% higher. In multinomial logistic regression models, chance (odds ratio) is also known as relative risk ratio.

On average, how much does one alter the chance of arriving late to the second class, in relation to not arriving late to school, in going through 1 more traffic light, maintaining the remaining conditions constant?

In this case, being the event of interest refers to the category of arriving late to the second class, the chance expression comes to be:

${odds}_{Y_{1} = 2} = e^{2.895} = 18.081$

Then, the chance is multiplied by a factor of 18.081, or rather, maintaining the remaining conditions constant, the chance of arriving late to the second class in relation to not arriving late to school, in going through 1 more traffic light in the route to school, is, on average, 1708.1% higher.

As we can see, these calculations always use the average parameter estimates. As we did in Section 14.2, we will now go on to the study of the confidence intervals for these parameters.

14.3.3 Construction of the Confidence Intervals of the Parameters for the Multinomial Logistic Regression Model

The confidence intervals for the estimated parameters in a multinomial logistic regression are also calculated by means of Expression (14.21) presented in Section 14.2.3. Then, at the 95% confidence level, they can be defined, for parameters α_m (m = 1, 2, …, M − 1) and β_jm (j = 1, 2, …, k; m = 1, 2, …, M − 1) in the following way:

$\begin{array}{l} α_{m} \pm 1.96 \cdot [s . e . (α_{m})] \\ β_{jm} \pm 1.96 \cdot [s . e . (β_{jm})] \end{array}$

si105_e (14.43)

in that 1.96 is the z_c for the 5% significance level.

For the data in our example, Table 14.19 presents the estimated coefficients of parameters α_m (m = 1, 2) and β_jm (j = 1, 2; m = 1, 2) of the occurrence probability expressions in the events of interest, with the respective standard errors, the Wald z statistics, and the confidence intervals for the 5% significance level.

Table 14.19

Confidence Interval Calculation for the Multinomial Logistic Regression Parameters
Parameter	Coefficient	Standard Error (s.e.)	z	Confidence Interval (95%)
Parameter	Coefficient	Standard Error (s.e.)	z	α_m − 1.96. [s.e. (α_m)] β_jm − 1.96. [s.e. (β_jm)]	α_m + 1.96. [s.e. (α_m)] β_jm + 1.96. [s.e. (β_jm)]
α₁ (constant)	− 33.135	12.183	− 2.720	− 57.014	− 9.256
β₁₁ (dist variable)	0.559	0.243	2.300	0.082	1.035
β₂₁ (sem variable)	1.670	0.577	2.894	0.539	2.800
α₂ (constant)	− 62.292	14.675	− 4.244	− 91.055	− 33.529
β₁₂ (dist variable)	1.078	0.302	3.570	0.486	1.671
β₂₂ (sem variable)	2.895	0.686	4.220	1.550	4.239

Table 14.19

As we already know, no confidence interval contains zero and, based on their values, we can write the lower (minimum) and upper (maximum) limits of the estimated occurrence probabilities for each of the categories of the dependent variable.

Confidence Interval (95%) of estimated probability of student i to not arrive late (category 0):

$p_{i_{0_{\min}}} = \frac{1}{1 + e^{(- 57.014 + 0.082 \cdot {dist}_{i} + 0.539 \cdot {sem}_{i})} + e^{(- 91.055 + 0.486 \cdot {dist}_{i} + 1.550 \cdot {sem}_{i})}}$

si106_e

$p_{i_{0_{\max}}} = \frac{1}{1 + e^{(- 9.256 + 1.035 \cdot {dist}_{i} + 2.800 \cdot {sem}_{i})} + e^{(- 33.529 + 1.671 \cdot {dist}_{i} + 4.239 \cdot {sem}_{i})}}$

si107_e

Confidence Interval (95%) of the estimated probability student i arrives late to first class (category 1):

$p_{i_{1_{\min}}} = \frac{e^{(- 57.014 + 0.082 \cdot {dist}_{i} + 0.539 \cdot {sem}_{i})}}{1 + e^{(- 57.014 + 0.082 \cdot {dist}_{i} + 0.539 \cdot {sem}_{i})} + e^{(- 91.055 + 0.486 \cdot {dist}_{i} + 1.550 \cdot {sem}_{i})}}$

si108_e

$p_{i_{1_{\max}}} = \frac{e^{(- 9.256 + 1.035 \cdot {dist}_{i} + 2.800 \cdot {sem}_{i})}}{1 + e^{(- 9.256 + 1.035 \cdot {dist}_{i} + 2.800 \cdot {sem}_{i})} + e^{(- 33.529 + 1.671 \cdot {dist}_{i} + 4.239 \cdot {sem}_{i})}}$

si109_e

Confidence Interval (95%) of the estimated probability student i arrives late to second class (category 2):

$p_{i_{2_{\min}}} = \frac{e^{(- 91.055 + 0.486 \cdot {dist}_{i} + 1.550 \cdot {sem}_{i})}}{1 + e^{(- 57.014 + 0.082 \cdot {dist}_{i} + 0.539 \cdot {sem}_{i})} + e^{(- 91.055 + 0.486 \cdot {dist}_{i} + 1.550 \cdot {sem}_{i})}}$

si110_e

$p_{i_{2_{\max}}} = \frac{e^{(- 33.529 + 1.671 \cdot {dist}_{i} + 4.239 \cdot {sem}_{i})}}{1 + e^{(- 9.256 + 1.035 \cdot {dist}_{i} + 2.800 \cdot {sem}_{i})} + e^{(- 33.529 + 1.671 \cdot {dist}_{i} + 4.239 \cdot {sem}_{i})}}$

si111_e

Analogous to that prepared in Section 14.2.3, we can define the confidence interval expression for the chances (odds or relative risk ratios) of occurrence for each of the events represented by the subscript m (m = 1, 2, M − 1) in relation to the event occurrence represented by category 0 (reference) for each parameter β_jm (j = 1, 2, …, k; m = 1, 2, …, M − 1), at the 95% confidence level, in the following way:

$e^{β_{jm} \pm 1.96 \cdot [s . e . (β_{jm})]}$

(14.44)

For the data in our example, and based on the values calculated in Table 14.19, we will prepare Table 14.20, which represents the confidence intervals for the chances (odds or relative risk ratios) of occurrence for each of the events in relation to the reference event for each parameter β_jm (j = 1, 2; m = 1, 2).

Table 14.20

Calculation of the Confidence Intervals of the Chances (Odds or Relative Risk Ratios) for Each Parameter β_jm
Event	Parameter	Chance (Odds)	Confidence Interval for Chance (95%)
Event	Parameter	e^β_jm	e^{β_jm − 1.96. [s. e. ^(β_jm)]}	e^{β_jm + 1.96. [s. e. ^(β_jm)]}
Arrive late to first class	β₁₁ (dist variable)	1.749	1.085	2.817
Arrive late to first class	β₂₁ (sem variable)	5.312	1.715	16.453
Arrive late to second class	β₁₂ (dist variable)	2.939	1.625	5.318
Arrive late to second class	β₂₂ (sem variable)	18.081	4.713	69.363

Table 14.20

These values will also be obtained by means of Stata modeling, to be presented in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 14: Binary and Multinomial Logistic Regression Models

Create new playlist

Sign In

Sign Up