This chapter presents the binary and multinomial logistic regression models, establishing the circumstances based upon which the binary and multinomial regression models can be used. The objective is to estimate an occurrence probability model of an event based on the maximum likelihood method. The results of statistics tests pertinent to the logistic models are evaluated. Confidence intervals of the model parameters for the purpose of prediction are also elaborated, as well as the analysis of sensitivity and interpretation of the sensitivity curve, the ROC curve and the cutoff concepts, overall model efficiency, sensitivity, and specificity. The binary and multinomial regression models are also prepared in Microsoft Office Excel®, Stata Statistical Software®, and IBM SPSS Statistics Software® and their results are interpreted.
Binary logistic regression; Multinomial logistic regression; Probability of event occurrence; Odds; Estimation by maximum likelihood; Cutoff; Sensitivity analysis; Overall model efficiency; Sensitivity; Specificity; Excel; Stata and SPSS
In the fields of observation, chance favors only the mind that is prepared.
Louis Pasteur
The logistic regression models, even though quite useful and easy to apply, are still little used in many areas of human knowledge. Even though the development of software and the increase in computer processing capability has provided for their application in a more direct way, many researchers still do not know their usefulness and, above all, the conditions for their correct use.
Different from the traditional technique of regression estimated by ordinary least squares methods, where the dependent variable is presented in a quantitative way and some presuppositions should be obeyed, as we studied in the previous chapter, the techniques of logistic regression are used when the phenomenon to be studied (outcome variable) presents itself in a qualitative way and, therefore, is represented by one or more dummy variables, depending on the number of possible answers (categories) for this dependent variable.
Imagine, for example, that a researcher is interested in evaluating the probability of heart attacks in financial market executives, based on their physical characteristics (weight, waistline), their eating habits, and their health habits (physical exercise, smoking). A second researcher wants to evaluate the chance of consumers who acquire durable goods in a determined period to go into default due to the income, marital status, and educational level of each. Notice that the heart attack or default are dependent variables in both cases and their events may or may not occur, due to the explanatory variables inserted into the respective models, and, therefore, are qualitative dichotomous variables that represent each of the variables under study. Our intent is to estimate the occurrence probability of these phenomena and, therefore, we will use the binary logistic regression.
Imagine now that a third researcher is interested in studying the probability of obtaining credit by small- and medium-sized companies, due to their financial and operational characteristics. It is known that each company can receive unrestricted credit, restricted credit, or no credit at all. In this case, the dependent variable that represents the phenomenon is also qualitative, but offers three possible answers (categories). Therefore, to estimate the probability of the alternative proposals occurring, we should use the multinomial logistic regression.
Then, if a phenomenon under study is presented by means of two, and only two, categories, it will be represented by only one dummy variable. The first category will be the reference and indicate the event of noninterest (dummy = 0) and the other category will indicate the event of interest (dummy = 1), and we are dealing with a binary logistic regression technique. On the other hand, if the phenomenon under study presents more than two categories as occurrence possibilities, we must initially define the reference category to then estimate the multinomial logistic regression model.
In having a qualitative variable as a phenomenon to be studied, estimation by means of the least minimum squares method, as studied in the previous chapter, is not viable since this dependent variable does not present average and variance and, therefore, there is no way to minimize the sum of the square of the error terms without generating an incoherent arbitrary ponderation. Being that the insertion of this dependent variable in modeling software is done based on typing in the values that represent each of the answer possibilities, it is common to see forgetfulness in defining the category labels that correspond to each of the entered values and, therefore, it is possible that an unadvised or beginning researcher estimate the model by means of the least squares regression, including obtaining outputs, since the software will interpret the dependent variable as being quantitative. This serious mistake is, unfortunately, more common than one would think! The binary and multinomial logistic regression techniques are elaborated based on the estimation by maximum likelihood, to be studied in Sections 14.2.1 and 14.3.1, respectively.
Analogous to what was discussed in the previous chapter, the logistic regression models are defined based on subjacent theory and the experience of the researcher, in such a way that it is possible to estimate the desired model, analyze the obtained results by means of statistical tests, and prepare predictions.
In this chapter, we will cover the binary and multinomial logistic regression models, with the following objectives: (1) introduce the concepts of logistic regression, (2) present estimation by maximum likelihood, (3) interpret the obtained results and prepare predictions, and (4) present the application of these techniques in Excel, Stata, and SPSS. First, the solution to an example will be worked out in Excel simultaneously with the presentation of the concepts and its manual solution. After introducing the concepts, the procedures for preparing the technique in Stata and SPSS will be presented, maintaining the standard adopted in the book.
The binary logistic regression model has, as its main objective, as the study of the probability of the occurrence of an event defined by Y, which presents itself in a qualitative, dichotomic form (Y = 1 to describe the occurrence of an event of interest and Y = 0 to describe the occurrence of the non-event), based on the behavior of explanatory variables. In this way, we can define a vector of explanatory variables, with respective estimated parameters, in the following way:
Zi=α+β1⋅X1i+β2⋅X2i+⋯+βk⋅Xki
in that Z is known as the logit, α represents the constant, βj (j = 1, 2, …, k) are the estimated parameters for each explanatory variable, Xj are the explanatory variables (metric or dummies), and the subscript i represents each sample observation (i = 1, 2, …, n, where n is the size of the sample). It is important to highlight that Z does not represent the dependent variable, denominated by Y, and our present objective is to define the pi probability expression for the occurrence of the event of interest for each observation, in function of logit Zi, or rather, in function of the estimated parameters for each explanatory variable. To do this, we should define the concept of an event’s chance of occurrence, also known as odds, in the following way:
oddsYi=1=pi1−pi
Imagine that we are interested in studying the event “passing the calculus course.” If, for example, the probability of a determined student to pass this course is of 80%, their chance of passing will be of 4 to 1 (0.8/0.2 = 4). If the probability of another student passing the same course is 25%, given that they studied much less than the first student, the chance of passing will be 1 to 3 (0.25/0.75 = 1/3). Even though we are used to daily using the terms chance or odds as synonyms for probability, the concepts are different!
The binary logistic regression defines the Z logit as the natural logarithm of odds, such that:
ln(oddsYi=1)=Zi
from which comes:
ln(pi1−pi)=Zi
Being that our intent is to define an expression for the probability of occurrence of an event under study in function of the logit, we can mathematically isolate pi based on Expression (14.4) in the following manner:
pi1−pi=eZi
pi=(1−pi)⋅eZi
pi⋅(1+eZi)=eZi
And, therefore, we have that:
Probability of occurrence of the event:
pi=eZi1+eZi=11+e−Zi
Probability of occurrence of the non-event:
1−pi=1−eZi1+eZi=11+eZi
Obviously, the sum of Expressions (14.8) and (14.9) is equal to 1.
Based on Expression (14.8), we can elaborate a table with p values in function of the Z values. Being that Z varies from −∞ to +∞, we will, for teaching purposes only, use integer values between − 5 and + 5. Table 14.1 gives these values.
Table 14.1
pi=11+e−Zi | Zi |
---|---|
0.0067 | − 5 |
0.0180 | − 4 |
0.0474 | − 3 |
0.1192 | − 2 |
0.2689 | − 1 |
0.5000 | 0 |
0.7311 | 1 |
0.8808 | 2 |
0.9526 | 3 |
0.9820 | 4 |
0.9933 | 5 |
Based on Table 14.1, we can prepare a graph of p = f(Z), as presented in Fig. 14.1. By means of this graph, we see that the estimated probabilities, in function of the different values assumed by Z, are situated between 0 and 1, which was guaranteed when we imposed that the logit was equal to the natural logarithm of odds. As such, given the parameter data estimated in the model and the value of each of the explanatory variables for each i data observation, we can calculate the value of Zi and, by means of the logistic curve presented in Fig. 14.1 (also known as the S curve, or sigmoid), estimate the probability of occurrence for an event under study for this determined i observation.
Based on Expressions (14.1) and (14.8), we can define the general expression for the estimated probability of occurrence for an event that is presented in a dichotomic form for an i observation in the following way:
pi=11+e−(α+β1⋅X1i+β2⋅X2i+⋯+βk⋅Xki)
What the binary logistic regression estimates, therefore, is not the predicted values of the dependent variable, but, yet, the probability of occurrence of the event under study for each observation. We now move on to the estimation of the logit parameters by means of presenting an example prepared initially in Excel.
We will now present the concepts pertinent to estimation by maximum likelihood using an example similar to that developed throughout the previous chapter. However, now the dependent variable will be qualitative and dichotomic.
Imagine that our curious professor, who has already considerably explored the effects of determined explanatory variables on the travel time for a group of students to get to school, by means of the multiple regression technique, is now interested in investigating if these same explanatory variables influence the probability of a student arriving late to class. In other words, the phenomenon in question to be studied only presents two categories (arrive late to class or not) and the event of interest refers to arriving late.
To this end, the professor researched 100 students at the school where he teaches, questioning each of them regarding if they had arrived late that day. The professor also asked regarding the distance traveled (in kilometers), the number of traffic lights through which each went, the time of day when the trip was made (morning or afternoon), and the driving style each considers themselves to have (calm, moderate, or aggressive). Part of the prepared dataset is found in Table 14.2.
Table 14.2
Student | Arrived Late to School (Yi) | Distance Traveled to School (km) (X1i) | Number of Traffic Lights—sem (X2i) | Time of Day (X3i) | Driving Style (X4i) |
---|---|---|---|---|---|
Gabriela | No | 12.5 | 7 | morning | calm |
Patricia | No | 13.3 | 10 | morning | calm |
Gustavo | No | 13.4 | 8 | morning | moderate |
Leticia | No | 23.5 | 7 | morning | calm |
Luiz Ovidio | No | 9.5 | 8 | morning | calm |
Leonor | No | 13.5 | 10 | morning | calm |
Dalila | No | 13.5 | 10 | morning | calm |
Antonio | No | 15.4 | 10 | morning | calm |
Julia | No | 14.7 | 10 | morning | calm |
Mariana | No | 14.7 | 10 | morning | calm |
… | |||||
Filomena | Yes | 12.8 | 11 | afternoon | aggressive |
… | |||||
Estela | Yes | 1.0 | 13 | morning | calm |
For the dependent variable, being the event of interest refers to arrive late, this category will present values equal to 1, with the category not arrive late with values equal to 0.
Following what was defined in the previous chapter in relation to the explanatory qualitative variables, the reference category of the variable corresponding to the time of day will be afternoon, or rather, the cells in the dataset with this value will assume values equal to 0, leaving the cells with the category morning with values equal to 1. Now, the driving style variable should be transformed into two dummies (variables style2 for the moderate category and style3 for the aggressive category), being that we have defined the calm category as being the reference.
As such, Table 14.3 presents part of the final dataset to be used for the estimation of our binary logistic regression model.
Table 14.3
Student | Arrived Late to School (Dummy Yes = 1; No = 0) (Yi) | Distance Traveled to School (km) (X1i) | Number of Traffic Lights—sem (X2i) | Time of Day Dummy per (X3i) | Driving Style Dummy style2 (X4i) | Driving Style Dummy style3 (X5i) |
---|---|---|---|---|---|---|
Gabriela | 0 | 12.5 | 7 | 1 | 0 | 0 |
Patricia | 0 | 13.3 | 10 | 1 | 0 | 0 |
Gustavo | 0 | 13.4 | 8 | 1 | 1 | 0 |
Leticia | 0 | 23.5 | 7 | 1 | 0 | 0 |
Luiz Ovidio | 0 | 9.5 | 8 | 1 | 0 | 0 |
Leonor | 0 | 13.5 | 10 | 1 | 0 | 0 |
Dalila | 0 | 13.5 | 10 | 1 | 0 | 0 |
Antonio | 0 | 15.4 | 10 | 1 | 0 | 0 |
Julia | 0 | 14.7 | 10 | 1 | 0 | 0 |
Mariana | 0 | 14.7 | 10 | 1 | 0 | 0 |
… | ||||||
Filomena | 1 | 12.8 | 11 | 0 | 0 | 1 |
… | ||||||
Estela | 1 | 1.0 | 13 | 1 | 0 | 0 |
The complete dataset can be accessed by means of the Late.xls file.
In this way, the logit whose parameters we wish to estimate is defined in the following way:
Zi=α+β1⋅disti+β2⋅semi+β3⋅peri+β4⋅style2i+β5⋅style3i
and the estimated probability that a determined student arrives late can be written in the following way:
pi=11+e−(α+β1⋅disti+β2⋅semi+β3⋅peri+β4⋅style2i+β5⋅style3i)
Being that it does not make sense for us to define the term of error for each observation, given that the dependent variable presents itself in a dichotomic form, there is no way to estimate the equation parameters by means of the sum of the residuals squares, as we did when estimating the traditional regression models. In this case, therefore, we will use the likelihood function from which the maximum likelihood estimation will be elaborated. Estimation by maximum likelihood is the most popular parameter estimation technique for logistic regression models.
Due to this fact, it is also important to mention, in relation to the presuppositions studied for regression models estimated by least minimum squares, that the researcher should only be concerned with the presupposition of the absence of multicollinearity of the explanatory variables as to the estimation of logistic regression models.
In the binary logistic regression, the dependent variable follows a Bernoulli distribution, in other words, the fact of a determined i observation have occurred or not in the event of interest can be considered as a Bernoulli trial, in which the probability of the occurrence of the event is pi and probability of the occurrence of the non-event is (1 − pi). In general, we can write that the probability of the occurrence of Yi, being that Yi is equal to 1 or equal to 0, is given as:
p(Yi)=pYii⋅(1−pi)1−Yi
For a sample with n observations, we can define the likelihood function as being:
L=n∏i=1[pYii⋅(1−pi)1−Yi]
from which comes, based on Expressions (14.8) and (14.9), that:
L=n∏i=1[(eZi1+eZi)Yi⋅(11+eZi)1−Yi]
Being, in practice, it is more convenient to work with the logarithmic likelihood function, we arrive at the following function, also known as the log likelihood function:
LL=n∑i=1{[(Yi)⋅ln(eZi1+eZi)]+[(1−Yi)⋅ln(11+eZi)]}
And now, a question must be asked: What are the values of the logit parameters that cause the LL value of Expression (14.14) to be maximized? This important question is the main key for the estimation by maximum likelihood of binary logistic regression models, and can be answered using linear programming tools, so as to estimate the α, β1, β2, …, βk parameters based on the following objective function:
LL=n∑i=1{[(Yi)⋅ln(eZi1+eZi)]+[(1−Yi)⋅ln(11+eZi)]}=max
We will solve this problem with the Excel Solver tool using the data from our example. For such, we should open the LateMaximumLikelihood.xls file, which will help in the calculation of the parameters.
In this file, besides the dependent variable and the explanatory variables, three new variables were created, which correspond to the Zi logit, the probability of occurrence for the event of interest pi, and to the LLi logarithmic likelihood function for each observation, respectively. Table 14.4 shows part of the data when the α, β1, β2, β3, β4, and β5 parameters are equal to 0.
Table 14.4
Student | Yi | X1i | X2i | X3i | X4i | X5i | Zi | pi | LLi(Yi)⋅ln(pi)+(1−Yi)⋅ln(1−pi) |
---|---|---|---|---|---|---|---|---|---|
Gabriela | 0 | 12.5 | 7 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
Patricia | 0 | 13.3 | 10 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
Gustavo | 0 | 13.4 | 8 | 1 | 1 | 0 | 0 | 0.5 | − 0.69315 |
Leticia | 0 | 23.5 | 7 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
Luiz Ovidio | 0 | 9.5 | 8 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
Leonor | 0 | 13.5 | 10 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
Dalila | 0 | 13.5 | 10 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
Antonio | 0 | 15.4 | 10 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
Julia | 0 | 14.7 | 10 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
Mariana | 0 | 14.7 | 10 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
… | |||||||||
Filomena | 1 | 12.8 | 11 | 0 | 0 | 1 | 0 | 0.5 | − 0.69315 |
… | |||||||||
Estela | 1 | 1.0 | 13 | 1 | 0 | 0 | 0 | 0.5 | − 0.69315 |
Sum | LL=100∑i=1{[(Yi)⋅ln(pi)]+[(1−Yi)⋅ln(1−pi)]} | − 69.31472 |
Fig. 14.2 presents part of the data in the LateMaximumLikelihood.xls file, being that some cells were hidden due to the number of observations being equal to 100.
As we can see, when α = β1 = β2 = β3 = β4 = β5 = 0, the sum value of the logarithmic likelihood function is equal to − 69.31472. However, there should be an excellent combination of parameter values, such that the objective function presented in Expression (14.15) should be obeyed, or rather, the sum value of the logarithmic likelihood function is the maximum possible.
According to the logic proposed by Belfiore and Fávero (2012), we will now open the Excel Solver tool. The objective function is in cell J103, which is our destination cell and which should be maximized. Besides this, parameters α, β1, β2, β3, β4, and β5, whose values are in cells M3, M5, M7, M9, M11, and M13, respectively, are variables cells. The Solver window will be as shown in Fig. 14.3.
By clicking on Solve and then OK, we will obtain the best solution to the linear programming problem. Table 14.5 shows part of the obtained data.
Table 14.5
Student | Yi | X1i | X2i | X3i | X4i | X5i | Zi | pi | LLi(Yi) ⋅ ln (pi) + (1−Yi) ⋅ ln (1−pi) |
---|---|---|---|---|---|---|---|---|---|
Gabriela | 0 | 12.5 | 7 | 1 | 0 | 0 | − 11.73478 | 0.00001 | − 0.00001 |
Patricia | 0 | 13.3 | 10 | 1 | 0 | 0 | − 3.25815 | 0.03704 | − 0.03774 |
Gustavo | 0 | 13.4 | 8 | 1 | 1 | 0 | − 7.42373 | 0.00060 | − 0.00060 |
Leticia | 0 | 23.5 | 7 | 1 | 0 | 0 | − 9.31255 | 0.00009 | − 0.00009 |
Luiz Ovidio | 0 | 9.5 | 8 | 1 | 0 | 0 | − 9.62856 | 0.00007 | − 0.00007 |
Leonor | 0 | 13.5 | 10 | 1 | 0 | 0 | − 3.21411 | 0.03864 | − 0.03940 |
Dalila | 0 | 13.5 | 10 | 1 | 0 | 0 | − 3.21411 | 0.03864 | − 0.03940 |
Antonio | 0 | 15.4 | 10 | 1 | 0 | 0 | − 2.79572 | 0.05756 | − 0.05928 |
Julia | 0 | 14.7 | 10 | 1 | 0 | 0 | − 2.94987 | 0.04974 | − 0.05102 |
Mariana | 0 | 14.7 | 10 | 1 | 0 | 0 | − 2.94987 | 0.04974 | − 0.05102 |
… | |||||||||
Filomena | 1 | 12.8 | 11 | 0 | 0 | 1 | 5.96647 | 0.99744 | − 0.00256 |
… | |||||||||
Estela | 1 | 1.0 | 13 | 1 | 0 | 0 | 2.33383 | 0.91164 | − 0.09251 |
Sum | LL=100∑i=1{[(Yi)⋅ln(pi)]+[(1−Yi)⋅ln(1−pi)]} | − 29.06568 |
Then, the maximum possible value of the sum of the logarithmic likelihood function is LLmax = − 29.06568. The solution to this problem generated the following parameter estimates:
and, as such, the Zi logit can be written as follows:
Zi=−30.202+0.220⋅disti+2.767⋅semi−3.653⋅peri+1.346⋅style2i+2.914⋅style3i
Fig. 14.4 presents part of the results obtained by modeling the LateMaximumLikelihood.xls file.
And, therefore, the estimated probability expression for a student to arrive late can be written in the following way:
pi=11+e−(−30.202+0.220⋅disti+2.767⋅semi−3.653⋅peri+1.346⋅style2i+2.914⋅style3i)
Thus, the posing of some interesting questions is now fitting:
What is the average estimated probability to arrive late to school when traveling 17 kilometers and going through 10 traffic lights, making the trip in the morning and having what is considered an aggressive driving style?
On average, how much does chance to arrive late to school change if a route 1 kilometer longer is adopted while maintaining the remaining conditions constant?
Does a student considered aggressive present, on average, a higher chance of arriving late than another who is considered calm? If yes, how much is this chance increased, maintaining the remaining conditions constant?
Before answering these important questions, we need to verify if all the estimated parameters are statistically significant at a determined confidence level. If this is not the case, we will need to re-estimate the final model, so that the model presents only statistically significant parameters so that, from then on, the elaboration of inferences and predictions is possible.
Therefore, having estimated by maximum likelihood the probability equation parameters of the event occurrence, we now begin the study of the general statistical significance of the obtained model, as well as the statistical significance of the parameters, analogous to that done when studying the traditional regression models in the previous chapter. It is important to mention that, in the Appendix of this chapter, we will make a brief presentation of the probit regression models, which can be used alternately to the binary logistic regression models for those cases where the probability of occurrence curve for a determined event adjusts itself more adequately to the cumulative density function of the standard normal distribution.
If, for example, we prepare a linear graph of our dependent variable (late) in function of the variable referent to the number of traffic lights (sem), we notice that the model estimates are not able to adjust themselves in a satisfactory way to the behavior of the dependent variable, being that it is a dummy. The graph in Fig. 14.5A presents this behavior. On the other hand, if the binary logistic regression model is prepared and the estimates of the probability of arriving late for each observation in our sample are plotted, specifically in function of the number of traffic lights through which each student goes, we notice that the adjustment is much more adequate to the behavior of the dependent variable (S curve), with estimated values limited to between 0 and 1 (Fig. 14.5B).
Therefore, being the dependent variable is qualitative, it makes no sense to discuss its percentage of variance explained by the predicting variables. In other words, in logistic regression models, there is no coefficient of determination R2 as in the traditional regressions estimated by the least minimum squares method. However, many researchers present, in their work, a coefficient known as the McFadden pseudo R2. Its expression is given as:
pseudoR2=−2⋅LL0−(−2⋅LLmax)−2⋅LL0
Its usefulness is quite limited and is restricted to cases where the researcher in interested in comparing one or two distinct models, given that one of the many existing criteria for the model choice is that of the higher McFadden pseudo R2.
In our example, as we have already discussed in the previous section and calculated by means of Excel Solver, LLmax, which is the maximum possible value of the sum of the logarithmic likelihood function, is equal to − 29.06568.
Now LL0 represents the maximum possible value of the sum of the logarithmic likelihood function for a model known as the null model, in other words, for a model that only presents constant α and no explanatory variable. By means of the same procedure performed in the previous section, however now using the LateMaximumLikelihoodNullModel.xls file, we will obtain that LL0 = − 67.68585. Figs. 14.6 and 14.7 show the Solver window and part of the results obtained by modeling in this file, respectively.
Then, based on Expression (14.16), we obtain:
pseudoR2=−2⋅(−67.68585)−[(−2⋅(−29.06568))]−2⋅(−67.68585)=0.5706
As we discussed, a higher McFadden pseudo R2 can be used as criteria to choose one model over another. However, as we will study in Section 14.2.4, there is another, more adequate criteria to choose the best model, which refers to a greater area below the receiver operating characteristic (ROC) curve.
Many researchers also use the McFadden pseudo R2 as a performance indicator for a chosen model, independent of the comparison with other models. However, its interpretation demands much care and, at times, there is the inevitable temptation to erroneously associate it with the variance percentages of the dependent variable. As we will study in Section 14.2.4, the best performance indicator for a binary logistic regression model is the overall model efficiency, which is defined based on the determination of a cutoff, the concepts of which will be studied in the same section.
Even though the usefulness of the McFadden pseudo R2 is limited, software such as Stata and SPSS calculate and present it in their respective outputs, as we will see in Sections 14.4 and 14.5, respectively.
Analogous to the procedure presented in the previous chapter, we will first study the general statistical significance of the model being proposed. The χ2 test provides the means to verify the model significance, since its null and alternate hypotheses, for a general logistic regression model, are:
While the F-test is used for regression models where the dependent variable presents itself quantitatively, which generates decomposition of the variance (ANOVA table), studied in the previous chapter, the χ2 test is more adequate for models estimated by the maximum likelihood method, such as the logistic regression models.
The χ2 test provides the researcher an initial verification regarding the existence of the model being proposed, since, if all the estimated βj (j = 1, 2, …, k) parameters are statistically equal to 0, the behavior of the alteration of each of the X variables will not influence the probability occurrence of the event under study in any way. The χ2 statistic has the following expression:
χ2=−2⋅(LL0−LLmax)
Returning to our example, we have that:
χ25d.f.=−2⋅[−67.68585−(−29.06568)]=77.2403
For 5 degrees of freedom (number of explanatory variables considered in the model, or rather, the number of β parameters), we have, by means of Table D in the Appendix, that the χc2 = 11.070 (χ2 critical for 5 degrees of freedom and for the significance level of 5%). In this way, being the χ2 calculated χcal2 = 77.2403 > χc2 = 11.070, we can reject the null hypothesis that all of the βj (j = 1, 2, …, 5) parameters are statistically equal to zero. Then, at least one X variable is statistically significant to explain the probability of occurrence of the event under study and we have a statistically significant binary logistic regression model for the purpose of prediction.
Software such as Stata and SPSS do not offer the χc2 for the defined degrees of freedom and a determined level of significance. However, they offer the χcal2 level of significance for these degrees of freedom. As such, instead of analyzing if χcal2 > χc2, we should check if the level of significance of χcal2 is lower than 0.05 (5%) so as to give continuity to the regression analysis. As such:
The χcal2 level of significance can be obtained in Excel by means of the command Formulas → Insert Function → DIST.QUI, which will open the dialog box seen in Fig. 14.8.
Analogous to the F-test, the χ2 test evaluates the joint significance of the explanatory variables, not defining which one or ones of the variables considered in the model are statistically significant to influence the probability of an event occurrence.
In this way, it is necessary that the researcher evaluate if each of the binary logistic regression model parameters is statistically significant and, in this sense, the Wald z statistic would be important to provide the statistical significance for each parameter to be considered in the model. The z nomenclature refers to the fact that the distribution of this statistic is standard normal. The Wald z test hypotheses for the α and for each βj (j = 1, 2, …, k) are:
The expressions for the calculation of the Wald z statistic for each α and βj parameter are given by:
zα=αs.e.(α)zβj=βjs.e.(βj)
where s.e. refers to the standard error of each parameter under analysis. Given the complexity of the standard error calculation for each parameter, we will not perform them at this time. However, we recommend reading Engle (1984). The s.e. values for each parameter, for our example, are:
Then, as we have already calculated the parameter estimates, we have that:
zα=αs.e.(α)=−30.2029.981=−3.026
zβ1=β1s.e.(β1)=0.2200.110=2.000
zβ2=β2s.e.(β2)=2.7670.922=3.001
zβ3=β3s.e.(β3)=−3.6530.878=−4.161
zβ4=β4s.e.(β4)=1.3460.748=1.799
zβ5=β5s.e.(β5)=2.9141.179=2.472
After obtaining the Wald z statistics, the researcher can use the normal curve distribution table to obtain the critical values for a given level of significance and check if such tests reject or do not reject the null hypothesis.
For the 5% level of significance, we have, by means of Table E in the Appendix, that zc = − 1.96 for the lower tail (probability for the lower tail of 0.025 for the two-tailed distribution) and zc = 1.96 for the upper tail (probability for the upper tail also of 0.025 for the two-tailed distribution).
The zc values for the 5% significance level can be obtained in Excel by means of the command Formulas → Insert Function → NORM.S.INV, being that the researcher should type in a probability of 2.5% to obtain zc for each lower tail, and 97.5% to obtain the zc for each upper tail, as shown in Figs. 14.9 and 14.10, respectively.
Only the Wald z statistic for the β4 parameter presented a value between − 1.96 and 1.96, which indicates, at the 5% significance level, that, for this case, there was no rejection of the null hypothesis, or rather, this parameter cannot be considered statistically different from zero.
As in the case of the χ2 test, statistical packages also offer the Wald z test values for the significance levels, which facilitates the decision, being that, with a 95% confidence level (5% significance level) we have:
As such, being − 1.96 < zβ4 = 1.799 < 1.96, we see that the P-value of the Wald z statistic of the style2 variable will be greater than 0.05.
The nonrejection of the null hypothesis for the β4, parameter, at the 5% significance level, indicates that the corresponding style2 variable is not statistically significant to increase or decrease the probability of arriving late to school late in the presence of the other explanatory variables and, therefore, should be excluded from the final model.
At this time, we will perform a manual exclusion of this variable so as to obtain the final model. However, it is important to remember that the manual exclusion of a variable can cause another initially significant variable to come to present a nonsignificant parameter, and this problem tends to worsen with the greater number of explanatory variables in the dataset. The opposite can also occur, i.e., it is not recommended to perform a simultaneous manual exclusion of two or more variables whose parameters, at first sight, do show themselves to be statistically different from zero, since a determined β parameter can become statistically different from zero. Fortunately, these phenomena do not occur in this example and, as such, we opt to manually exclude the style2 variable. This will be proven when we estimate the binary logistic regression model by means of the Stepwise procedure in Stata (Section 14.4) and SPSS (Section 14.5).
Therefore, we will open the LateMaximumLikelihoodFinalModel.xls file. Notice that now the calculation of the (Zi) logit no longer takes into account the variable style2 parameter, which was excluded from the model. Figs. 14.11 and 14.12 show the Solver window and part of the results obtained in the modeling by means of this last file, respectively.
Then, for the final model, we have that LLmax = − 30.80079. Before parting to the definition of the final expression of the occurrence probability of the event under study, we need to define if the new estimated model (final model) presents loss in the quality of adjustment in relation to the estimated complete model with all the explanatory variables. To do this, the likelihood-ratio test, which verifies the complete model adjustment in comparison with the final model adjustment, can be used, presenting the following expression:
χ21d.f.=−2⋅(LLfinal model−LLcomplete model)
For our example data, we have that:
χ21d.f.=−2⋅[−30.80079−(−29.06568)]=3.4702
Then, for 1 degree of freedom, we have, by means of Table D in the Appendix, that χc2 = 3.841 (χ2 critical for 1 degree of freedom and for the 5% significance level). This way, being that the χ2 calculated χcal2 = 3.4702 < χc2 = 3.841, we do not reject the null hypothesis of the likelihood-ratio test, or rather, the estimation of the final model with the exclusion of the style2 variable did not alter the quality of the adjustment, to the 5% significance level, which causes this model to be preferable in relation to the estimated complete model with all of the explanatory variables.
In Sections 14.4 and 14.5 we will present, by means of Stata and SPSS, respectively, another quite usual test to verify the quality of adjustment for the final model, known as the Hosmer-Lemeshow test. By dividing the dataset into 10 groups per the deciles of the final model estimated probabilities for each observation, this test evaluates, by means of performing an χ2 test, if there are significant differences between the number of frequencies observed and expected for the number of observations in each of the 10 groups and, in case such differences are not statistically significant, at a determined significance level, the estimated model will not present problems in relation to the quality of the proposed adjustment.
Being as such, we return to the analysis of the final estimated model results. The solution to this new problem generated the following final parameter estimates:
with the respective standard errors:
and the following Wald z statistics:
zα=αs.e.(α)=−30.93510.636=−2.909
zβ1=β1s.e.(β1)=0.2040.101=2.020
zβ2=β2s.e.(β2)=2.9201.011=2.888
zβ3=β3s.e.(β3)=−3.7760.847=−4.458
zβ5=β5s.e.(β5)=2.4591.139=2.159
with all values of zcal < − 1.96 or > 1.96 and, therefore, with P-values for the Wald z statistics < 0.05.
The final model also presents the following statistics:
pseudoR2=−2⋅(−67.68585)−[(−2⋅(−30.80079))]−2⋅(−67.68585)=0.5449
χ24d.f.=−2⋅[−67.68585−(−30.80079)]=73.77012>χ2c4d.f.=9.48773
As such, we can write the Zi logit as follows:
Zi=−30.935+0.204⋅disti+2.920⋅semi−3.776⋅peri+2.459⋅style3i
with the final estimated probability expression that student i will arrive late to school:
pi=11+e−(−30.935+0.204⋅disti+2.920⋅semi−3.776⋅peri+2.459⋅style3i)
These parameters and respective statistics can also be obtained by means of the Stepwise procedure when estimating the binary logistic regression model in Stata and SPSS.
Based on the estimation of the probability function, a curious researcher could, for example, desire to prepare a graph of the estimated probabilities for each student to arrive late to school (column H in the final model file in Excel) in function of the number of traffic lights through which each must go on the route (column D in Excel). Fig. 14.13 presents this graph and, contrary to the graph in Fig. 14.5B, which offers determined logistic adjustment (only values equal to 0 or 1 for the dependent variable), this new graph presents a logistic probability adjustment.
Based on Fig. 14.13, which also presents the logistic curve adjusted to the cloud of points that represents the estimated probabilities for each observation, we can see that, while the probability of arriving late to school is very low when going through up to 8 traffic lights along the route, the probability becomes quite high when the student is obliged to go through 11 or more traffic lights during the trip.
Deepening the analysis of the probability function, we can return to our three important questions, answering each one at a time:
What is the average estimated probability to arrive late to school when traveling 17 kilometers and going through 10 traffic lights, making the trip in the morning and having what is considered an aggressive driving style?
Using the last probability expression and substituting the provided values in this equation, we will have:
p=11+e−[−30.935+0.204⋅(17)+2.920⋅(10)−3.776⋅(1)+2.459⋅(1)]=0.603
Then, the average estimated probability of arriving late to school is, within the provided conditions, equal to 60.3%.
On average, how much does chance to arrive late to school change if a route 1 kilometer longer is adopted while maintaining the remaining conditions constant?
To answer this question, we should resort to Expression (14.3), which can be written as follows:
oddsYi=1=eZi
such that, maintaining the remaining conditions constant, the chance of arriving late to school when adopting a route that is 1 kilometer longer is:
oddsY=1=e0.204=1.226
Then, the chance is multiplied by a factor of 1.226, or rather, if the remaining conditions are maintained constant, the chance of arriving late to school when adopting a route that is 1 kilometer longer is, on average, 22.6% higher.
Does a student considered aggressive present, on average, a higher chance of arriving late than another who is considered calm? If yes, how much is this chance increased, maintaining the remaining conditions constant?
Being that β5 is positive, we can state that the probability that a student who is considered aggressive arriving late to school is higher than a student who is considered calm, a fact that is also proven when we analyze chance, given that, if β5 > 0, then eβ5 > 1, or rather, the chance will be higher to arrive late when the student has an aggressive driving style in relation to the one who is calm. This is proved, once again, that being aggressive behind the wheel leads nowhere!
Maintaining the remaining conditions constant, the chance to arrive late to school when being aggressive behind the wheel in relation to being calm is given as:
oddsY=1=e2.459=11.693
Then, that chance is multiplied by a factor of 11.693, or rather, maintaining the remaining conditions constant, the chance of arriving late to school when being aggressive behind the wheel in relation to being calm is, on average, 1069.3% higher.
It is worth commenting that there are no differences in the probability of arriving late to school when one is considered moderate or calm, given that the β4 parameter (referent to the moderate category) presents itself as statistically equal to zero, at the 5% significance level.
As we can see, these calculations always use the average estimates for the parameters. We now embark on the study of the confidence intervals for these parameters.
The confidence intervals for the coefficients of Expression (14.10), for parameters α and βj (j = 1, 2, …, k), at the 95% confidence level, can be written as follows:
α±1.96⋅[s.e.(α)]βj±1.96⋅[s.e.(βj)]
where, as we have seen, 1.96 is the zc for the 95% confidence level (5% significance level).
As such, we can prepare Table 14.6, which gives the estimated parameter coefficients for the probability expression for the event of interest in our example, with the respective standard errors, the Wald z statistics and the confidence intervals for the 5% significance level.
Table 14.6
Parameter | Coefficient | Standard Error (s.e.) | z | Confidence Interval (95%) | |
---|---|---|---|---|---|
α − 1.96. [s.e. (α)] βj − 1.96. [s.e. (βj)] | α + 1.96. [s.e. (α)] βj + 1.96. [s.e. (βj)] | ||||
α (constant) | − 30.935 | 10.636 | − 2.909 | − 51.782 | − 10.088 |
β1 (dist variable) | 0.204 | 0.101 | 2.020 | 0.006 | 0.402 |
β2 (sem variable) | 2.920 | 1.011 | 2.888 | 0.938 | 4.902 |
β3 (per variable) | − 3.776 | 0.847 | − 4.458 | − 5.436 | − 2.116 |
β5 (style3 variable) | 2.459 | 1.139 | 2.159 | 0.227 | 4.691 |
This table is equal to what we will obtain when estimating the model in Stata and SPSS by means of the Stepwise procedure. Based on the parameter confidence intervals, we can write the lower (minimum) and upper (maximum) limits expressions for the estimated probability that a student i will arrive late to school, with 95% confidence. As such, we will have:
pimin=11+e−(−51.782+0.006⋅disti+0.938⋅semi−5.436⋅peri+0.227⋅style3i)
pimax=11+e−(−10.088+0.402⋅disti+4.902⋅semi−2.116⋅peri+4.691⋅style3i)
Based on Expression (14.20), the confidence interval of the chance an event of interest occurs for each parameter βj (j = 1, 2, …, k), at the 95% confidence level, can be written the following way:
eβj±1.96⋅[s.e.(βj)]
Notice that we did not present the expression for the chance confidence interval for parameter α, since it only makes sense to discuss that change in the chance of the occurrence of an event under study when it is altered by a unit, for example, a determined model explanatory variable, maintaining the remaining conditions constant.
For the data in our example and based on the values of Table 14.6, we will, then, prepare Table 14.7, which presents the confidence intervals of the chance (odds) of occurrence for an event of interest for each parameter βj.
Table 14.7
Parameter | Chance (Odds) | Chance Confidence Interval (95%) | |
---|---|---|---|
eβj | eβj − 1.96. [s.e. (βj)] | eβj + 1.96. [s.e. (βj)] | |
β1 (dist variable) | 1.226 | 1.006 | 1.495 |
β2 (sem variable) | 18.541 | 2.555 | 134.458 |
β3 (per variable) | 0.023 | 0.004 | 0.120 |
β5 (style3 variable) | 11.693 | 1.254 | 109.001 |
These values can also be obtained by means of Stata and SPSS, as we will show in Sections 14.4 and 14.5, respectively.
According to what was discussed in the previous chapter, if the confidence interval for a determined parameter contains zero (or if chance contains 1), the same will be considered statistically equal to zero for the confidence level with which the researcher is working. If this happens with the parameter α, it is recommended that nothing be altered in the modeling, since such fact is due to the use of small samples, and a larger sample will solve this problem. On the other hand, if the confidence interval of a parameter βj contains zero, this will be excluded from the final model when done by the Stepwise procedure. Even though it was not shown here, the confidence interval of the estimated parameter for the variable style2 contains zero being that, as discussed, its zcal value was situated between − 1.96 and 1.96 and, therefore, such variable was excluded from the final model.
As was also discussed, the rejection of the null hypothesis for a determined β parameter, at a specified significance level, indicates that the corresponding X variable is significant to explain the probability of occurrence for an event of interest and, consequently, should remain in the final model. We can, therefore, conclude that the decision to exclude a determined X variable in a logistic regression model can be done by means of the direct analysis of the Wald z statistic of its respective β parameter (if − zc < zcal < zc → P-value > 0.05 → we cannot reject that the parameter is statistically equal to zero) or by means of the analysis of the confidence interval (if the same contains zero). Box 14.1 presents the criteria of inclusion or exclusion of the βj (j = 1, 2, …, k) parameters in logistic regression models.
Having estimated the probability model for the occurrence of an event, we will now define the concept of cutoff, based on which it will be possible to classify, in our example, the observations based on the estimated probabilities for each of them. We return to the estimated probability expression for the final model:
pi=11+e−(−30.935+0.204⋅disti+2.920⋅semi−3.776⋅peri+2.459⋅style3i)
Having calculated the pi values by means of the LateMaximumLikelihoodFinalModel.xls file, we will prepare a table with some observations from our sample. Table 14.8 gives the pi values for the randomly chosen 10 observations, solely for teaching purposes.
Table 14.8
Observation | pi |
---|---|
Adelino | 0.05444 |
Carolina | 0.67206 |
Cristina | 0.55159 |
Eduardo | 0.81658 |
Cintia | 0.64918 |
Raimundo | 0.05340 |
Emerson | 0.04484 |
Raquel | 0.56702 |
Rita | 0.85048 |
Leandro | 0.46243 |
A cutoff is defined by the researcher so that the observations can be classified in function of their calculated probabilities and, as such, is used when there is the desire to prepare occurrence predictions of the event for observations not present in the sample, based on the probability of observations present in the sample.
Thus, if a determined observation not present in the sample presents a probability of occurring in the event higher than the defined cutoff, it is hoped that there is the incidence of the event and, therefore, will be classified as an event. On the other hand, if its probability is lower than the defined cutoff, it is hoped that there is the incidence of a non-event and, therefore classified as a non-event.
In general, we can stipulate the following criteria:
Being that the probability expression is estimated based on the observations present in the sample, the classification, for other observations not initially present in the sample, takes into consideration the behavioral consistency of the estimators and, therefore, for inferential effects, the sample should be significant and representative of population behavior, as with any confirmatory model.1
The cutoff serves for the researcher to evaluate the real incidence of the event for each observation and compare it with the expectation that each observation occurs, in fact, in the event. This being done, it will be possible to evaluate the model success rate based on the actual observations present in the sample and, per inference, assume that such success rate is maintained when there is the desire to evaluate the event incidence for other observations not present in the sample (prediction).
Based on the data from the observations presented in Table 14.8, and choosing, for example, a cutoff of 0.5, we can define that:
Table 14.9 gives, for each of the 10 randomly chosen observations, the real occurrence of the event and its respective classification based on the cutoff definition.
Table 14.9
Observation | Event | pi | Classification Cutoff = 0.5 |
---|---|---|---|
Adelino | No | 0.05444 | No |
Carolina | No | 0.67206 | Yes |
Cristina | No | 0.55159 | Yes |
Eduardo | No | 0.81658 | Yes |
Cintia | No | 0.64918 | Yes |
Raimundo | No | 0.05340 | No |
Emerson | No | 0.04484 | No |
Raquel | No | 0.56702 | Yes |
Rita | Yes | 0.85048 | Yes |
Leandro | Yes | 0.46243 | No |
Now we can prepare a new classification table, still based only on these 10 observations, so as to evaluate if the observations were correctly classified with a cutoff of 0.5 (Table 14.10).
Table 14.10
Real Occurrence of the Event | Real Occurrence of the Non-event | |
---|---|---|
Classified as event | 1 | 5 |
Classified as non-event | 1 | 3 |
In other words, for these 10 observations, only one was an event and presented a probability higher than 0.5, or rather, it was an event and was in fact classified as such (correctly classified). The other three observations were also classified correctly, or rather, they were not an event and were not classified as an event. On the other hand, six observations were classified incorrectly, or rather, while one was an event, even though it presented a probability lower than 0.5 and, therefore, not classified as an event, the other five were not an event but presented estimated probabilities higher than 0.5 and, consequently, were classified as an event.
For our sample of 100 observations, we can elaborate Table 14.11, which gives the complete classification for the 0.5 cutoff. This table can also be obtained by modeling in Stata and SPSS.
Table 14.11
Real Occurrence of the Event | Real Occurrence of the Non-event | |
---|---|---|
Classified as event | 56 | 11 |
Classified as non-event | 3 | 30 |
For the complete sample, we see that 86 observations were correctly classified, for a cutoff of 0.5, being that 56 were an event and were in fact classified as such, and another 30 were an event not occurring and were not classified as an event with this cutoff. However, 14 observations were incorrectly classified, being that 3 were an event but were not classified as such and 11 were not an event but were classified as having been.
This analysis, known as sensitivity analysis, generates classifications that depend on the choice of cutoff. Further ahead, we will make alterations to the cutoff, so as to show that the quantity of classified observations as event or non-event, change.
At this time, we will define the concepts of overall model efficiency, sensitivity, and specificity.
The overall model efficiency (OME) corresponds to the percentage of classification hits for a determined cutoff. For our example, the overall model efficiency is calculated as follows:
OME=56+30100=0.8600
For the 0.5 cutoff, 86.00% of the observations are classified correctly. As was mentioned in Section 14.2.2, the overall model efficiency, for a determined cutoff, is much more adequate to evaluate model performance than the McFadden pseudo R2 since the dependent variable presents itself in a dichotomic qualitative way.
Sensitivity deals with the percentage of hits, for a determined cutoff, considering only the observations that are, in fact, events. Then, in our example, the denominator for calculating sensitivity is 59, and its expression is given as:
Sensitivity=5659=0.9492
As such, for a cutoff of 0.5, 94.92% of the observations that are events are classified correctly.
Now, specificity, on the other hand, refers to the percentage of hits, for a given cutoff, considering only the observations that are not events. In our example, the expression is given as:
Specificity=3041=0.7317
As such, 73.17% of the observations of events not occurring are classified correctly, or rather, for a cutoff of 0.5, the probability of the occurrence of an event lower than 50% is presented.
Obviously, overall model efficiency, sensitivity, and specificity change when the cutoff value is changed. Table 14.12 presents a new classification for the sample observations, considering a cutoff of 0.3. In this case, we have the following classification criteria:
Table 14.12
Real Occurrence of the Event | Real Occurrence of the Non-event | |
---|---|---|
Classified as event | 57 | 13 |
Classified as non-event | 2 | 28 |
Overall model efficiency | 0.8500 | |
Sensitivity | 0.9661 | |
Specificity | 0.6829 |
In comparing the values obtained for a cutoff of 0.5, we see, in this case (cutoff of 0.3), that while sensitivity presents a small increase, specificity is reduced a little more dramatically, which results, in the overall ambit, in a reduction of the overall model efficiency percentage.
Now, let’s alter the cutoff once again, which will be, for our example, 0.7. For this new situation, we have the following classification criteria:
Table 14.13 shows this new classification, with the calculations for overall model efficiency, sensitivity, and specificity.
Table 14.13
Real Occurrence of the Event | Real Occurrence of the Non-event | |
---|---|---|
Classified as event | 47 | 5 |
Classified as non-event | 12 | 36 |
Overall model efficiency | 0.8300 | |
Sensitivity | 0.7966 | |
Specificity | 0.8780 |
In this case, we see another behavior, or rather, while sensitivity presents a considerable reduction, specificity increases. We can even see that the rate of hits for those that are events becomes lower than the rate of hits for those that are not events. However, the overall model efficiency, with a 0.7 cutoff, also presents a reduction in percentage in relation to the model with a cutoff of 0.5.
This sensitivity analysis can be done with any cutoff value of between 0 and 1, which allows the researcher to decide regarding defining a cutoff that attends their prediction objectives. If, for example, the objective is to maximize the overall model efficiency, a determined cutoff can be used that, as we know, can generate nonmaximized values of sensitivity or specificity. If, on the other hand, the objective is to maximize sensitivity, or rather, the rate of hits for those that are events, a cutoff can be defined that will not necessarily maximize the overall model efficiency. Finally, if there is the desire to maximize the rate of hits for observations that are not events (specificity), another cutoff can be defined.
In other words, the analysis of sensitivity is prepared based on the subjacent theory for each study and takes into consideration the choices desired by the researcher in terms of event occurrence prediction for observations not present in the sample, being, therefore, a management and strategic analysis of the phenomenon being investigated.
In academic work and in management reports from diverse organizations, it is common that sensitivity analysis graphs be presented and discussed. The most common are those known as the sensitivity curve and the ROC curve, which have distinct ends. While the sensitivity curve is a graph that presents the sensitivity and specificity values in function of the different cutoff values, the ROC curve is a graph that presents the variation in sensitivity in function of (1 − specificity).
We will present the sensitivity curve (Fig. 14.14) and the ROC curve (Fig. 14.15) for the data calculated in our example. Even though not complete, being that three cutoff values have already been used (0.3, 0.5, and 0.7), said curves will allow that some analyses be formed.
By means of the sensitivity curve, we can see that it is possible to define a cutoff that matches sensitivity with specificity, or rather, the cutoff that causes the rate of hits prediction to those which will be events to be equal to the rate of hits prediction for those that which will not be events. It is important to mention, however, that this cutoff does not guarantee that the overall model efficiency is the maximum possible.
Besides this, the sensitivity curve allows the researcher to evaluate the tradeoff between sensitivity and specificity as to the alteration in cutoff, being that, in many cases, as has been discussed, the objective of the prediction could be to increase rate of hits for those that will be an event without there being a considerable loss in the rate of hits for those that are not events.
The ROC shows the actual behavior of the tradeoff between sensitivity and specificity by bringing, on the abscises axis, the values of (1 − specificity), presenting a convex format in relation to point (0, 1). As such, a determined model with a greater area below the ROC curve presents greater overall prediction efficiency, combining all of the cutoff possibilities and, as such, its choice should be preferable in comparison with another model with a smaller area below the ROC curve. In other words, if a researcher wants, for example, to include new explanatory variables in the model, a comparison of the overall performance of the models can be prepared based on the area below the ROC curve, being that, the greater its convexity in relation to point (0, 1), the greater its area (higher sensitivity and higher specificity) and, consequently, better the model estimated for the effects of prediction. Fig. 14.16 presents an illustration of this concept.
According to Swets (1996), the ROC curve has this name because it compares the alteration of two model operational characteristics (sensitivity and specificity). It was first used by engineers in the Second World War in the study to detect enemy objects in battle. Next, it was introduced to psychology for the investigation of the perceptual detection of determined stimuli and is, today, widely used in the field of medicine, such as radiology, and in different fields of applied social science, such as economics and finance. In this specific case, it is used considerably in risk and credit management and the probability of default.
In Sections 14.4 and 14.5, we will present the sensitivity ROC curves by means of Stata and SPSS, respectively, with all cutoff value possibilities between 0 and 1 for the final estimated model, including calculation of the respective area below the ROC curve.
When the dependent variable that represents the phenomenon under study is qualitative, but offers more than two possible answers (categories), we should use the multinomial logistic regression to estimate the occurrence probabilities for each alternative. To do this, we must first define the reference category.
Imagine a situation where a variable dependent presents itself in a qualitative form with three possible answer categories (0, 1, or 2). If the chosen reference category is the 0 category, we will have two other event possibilities in relation to this category, which will be represented by categories 1 and 2 and, as such, two explanatory variable vectors will be defined with the respective estimated parameters, or rather, two logits, as follows:
Zi1=α1+β11⋅X1i+β21⋅X2i+⋯+βk1⋅Xki
Zi2=α2+β12⋅X1i+β22⋅X2i+⋯+βk2⋅Xki
where the logit number now appears in the subscript of each parameter to be estimated.
Then, generically, if the dependent variable that represents the variable under study presents M answer categories, the number of estimated logits will be (M − 1) and, based on the same, we can estimate the probability of occurrence for each of the categories. The general expression of the logit Zim (m = 0, 1, …, M − 1) for a model where a dependent variable assumes M answer categories is:
Zim=αm+β1m⋅X1i+β2m⋅X2i+⋯+βkm⋅Xki
where Zi0 = 0 and, therefore, eZi0 = 1.
Until now, in this chapter, we have been working with two categories and, consequently, only one Zi logit. In this way, the probabilities of the occurrence of a non-event and an event were calculated, respectively, by means of the following expressions:
Probability of occurrence of the non-event:
1−pi=11+eZi
Probability of occurrence of the event:
pi=eZi1+eZi
Now for three categories, and based on Expressions (14.23) and (14.24), we can estimate the probability of occurrence for reference category 0 and the occurrence probabilities of the two distinct events represented by categories 1 and 2. As such, the expressions for these probabilities can be written in the following way:
Probability of occurrence for category 0 (reference):
pi0=11+eZi1+eZi2
Probability of occurrence for category 1:
pi1=eZi11+eZi1+eZi2
Probability of occurrence for category 2:
pi2=eZi21+eZi1+eZi2
such that the sum of the probability of event occurrences, represented by the distinct categories, will always be 1.
In their complete form, Expressions (14.28)–(14.30) can be written as:
pi0=11+e(α1+β11⋅X1i+β21⋅X2i+⋯+βk1⋅Xki)+e(α2+β12⋅X1i+β22⋅X2i+⋯+βk2⋅Xki)
pi1=e(α1+β11⋅X1i+β21⋅X2i+⋯+βk1⋅Xki)1+e(α1+β11⋅X1i+β21⋅X2i+⋯+βk1⋅Xki)+e(α2+β12⋅X1i+β22⋅X2i+⋯+βk2⋅Xki)
pi2=e(α2+β12⋅X1i+β22⋅X2i+⋯+βk2⋅Xki)1+e(α1+β11⋅X1i+β21⋅X2i+⋯+βk1⋅Xki)+e(α2+β12⋅X1i+β22⋅X2i+⋯+βk2⋅Xki)
In general, for a model where a dependent variable assumes M answer categories, we can write the probability expressionpim (m = 0, 1, …, M − 1) as follows:
pim=eZim∑M−1m=0eZim
Analogous to the procedure developed in Sections 14.2.1–14.2.3, we will now estimate the parameters for Expressions (14.23) and (14.24) by using an example. We will also evaluate the general statistical significance of the model and parameters, as well as estimate their confidence intervals at a determined significance level. As such, we will again use, at this time, Excel.
We will present the concepts pertinent to estimation of a multinomial logistic regression by maximum likelihood using an example similar to that developed in the previous section.
Now, imagine that our tireless professor is not only interested in studying what causes students to arrive late to school or not. He now wants to know if the students arrive late to their first or second class. In other words, the professor is now interested in investigating if some variables relative to the route taken influence the probability of arriving or not arriving late to the first class or the second class. Now, the dependent variable comes to have three categories: not arrive late, arrive late to the first class, and arrive late to the second class.
Being thus, the professor researched the same 100 students in the school where he lectures; however, the research was done on another day. Being that some students were a little tired of answering so many questions as of late, the professor, besides the variable referent to the phenomenon under study, decided to ask only regarding the distance (dist) and the number of traffic lights (sem) each went through that day on their way to school. Part of the dataset can be found in Table 14.14.
Table 14.14
Student | Arrived Late to School (No = 0; Yes to First Class = 1; Yes to Second Class = 2) (Yi) | Distance Traveled to School (km) (X1i) | Number of Traffic Lights—sem (X2i) |
---|---|---|---|
Gabriela | 2 | 20.5 | 15 |
Patricia | 2 | 21.3 | 18 |
Gustavo | 2 | 21.4 | 16 |
Leticia | 2 | 31.5 | 15 |
Luiz Ovidio | 2 | 17.5 | 16 |
Leonor | 2 | 21.5 | 18 |
Dalila | 2 | 21.5 | 18 |
Antonio | 2 | 23.4 | 18 |
Julia | 2 | 22.7 | 18 |
Mariana | 2 | 22.7 | 18 |
… | |||
Rodrigo | 1 | 16.0 | 16 |
… | |||
Estela | 0 | 1.0 | 13 |
As we can see, the dependent variable now has three distinct values, which are nothing more than labels referent to each of the three answer categories (M = 3). It is unfortunately common for beginning researchers to prepare multiple regression models, for example, assuming that the dependent variable is quantitative, being that it presents numbers in its column. As we discussed in the previous section, this is a serious mistake!
The complete dataset for this new example can be found in the LateMultinomial.xls file.
The expressions for the logits we wish to estimate are, therefore:
Zi1=α1+β11⋅disti+β21⋅semi
Zi2=α2+β12⋅disti+β22⋅semi
which refer to events 1 and 2, respectively, presented in Table 14.14. Notice that the event represented by the label 0 refers to the reference category.
Then, based on Expressions (14.31)–(14.33), we can write the estimated occurrence probability expressions for each event corresponding to each category of the dependent variable. Being thus, we have:
pi0=11+e(α1+β11⋅disti+β21⋅semi)+e(α2+β12⋅disti+β22⋅semi)
pi1=e(α1+β11⋅disti+β21⋅semi)1+e(α1+β11⋅disti+β21⋅semi)+e(α2+β12⋅disti+β22⋅semi)
pi2=e(α2+β12⋅disti+β22⋅semi)1+e(α1+β11⋅disti+β21⋅semi)+e(α2+β12⋅disti+β22⋅semi)
where pi0, pi1, and pi2 represent the probability that a student i will not arrive late (category 0), the probability that a student i will arrive late to the first class (category 1), and the probability that a student i will arrive late to the second class (category 2), respectively.
To estimate the parameters of the probability expressions, we will again use estimation by maximum likelihood. Generically, in the multinomial logistic regression, where the dependent variable follows a binomial distribution, an i observation can occur in a determined event of interest, given M possible events and, therefore, the occurrence probability pim (m = 0, 1, …, M − 1) for this specific event can be written in the following manner:
p(Yim)=M−1∏m=0(pim)Yim
For a sample with n observations, we can define the likelihood function in the following way:
L=n∏i=1M−1∏m=0(pim)Yim
from which comes, based on Expression (14.34), that:
L=n∏i=1M−1∏m=0(eZim∑M−1m=0eZim)Yim
Analogous to the procedure adopted when studying the binary logistic regression, we will here work with the logarithmic likelihood function, which leads us to the following function, also known as log likelihood function:
LL=n∑i=1M−1∑m=0[(Yim)⋅ln(eZim∑M−1m=0eZim)]
And, therefore, we can ask an important question: Given M categories of the dependent variable, what are the values of the logit parameters Zim (m = 0, 1, …, M − 1) represented by Expression (14.25) that cause the LL value of Expression (14.38) to be maximized? This fundamental question is the main key to the estimation of the parameters of the multinomial logistic regression model by maximum likelihood method, and can be answered with the use of linear programming tools, so as to solve the problem with the following objective function:
LL=n∑i=1M−1∑m=0[(Yim)⋅ln(eZim∑M−1m=0eZim)]=max
Returning to our example, we will solve this problem using the Excel Solver tool. To do this, we should open the LateMultinomialMaximumLikelihood.xls file, which will help in the parameter calculation.
In this file, besides the dependent variable and explanatory variables, three Yim (m = 0, 1, 2) variables were created referent to the three categories of the dependent variable. This procedure should be done so as to validate Expression (14.35). These variables were created based on the criteria presented in Table 14.15.
Besides this, six other new variables were also created and correspond to the logits Zi1 and Zi2, the probabilities pi0, pi1 and pi2 to the logarithmic likelihood function LLi for each observation, respectively. Table 14.16 shows part of the data when all parameters are equal to 0.
Table 14.16
Student | Yi | Yi0 | Yi1 | Yi2 | X1i | X2i | Zi1 | Zi2 | pi0 | pi1 | pi2 | LLi2∑m=0[(Yim)⋅ln(pim)] |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Gabriela | 2 | 0 | 0 | 1 | 20.5 | 15 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Patricia | 2 | 0 | 0 | 1 | 21.3 | 18 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Gustavo | 2 | 0 | 0 | 1 | 21.4 | 16 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Leticia | 2 | 0 | 0 | 1 | 31.5 | 15 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Luiz Ovidio | 2 | 0 | 0 | 1 | 17.5 | 16 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Leonor | 2 | 0 | 0 | 1 | 21.5 | 18 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Dalila | 2 | 0 | 0 | 1 | 21.5 | 18 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Antonio | 2 | 0 | 0 | 1 | 23.4 | 18 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Julia | 2 | 0 | 0 | 1 | 22.7 | 18 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Mariana | 2 | 0 | 0 | 1 | 22.7 | 18 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
… | ||||||||||||
Rodrigo | 1 | 0 | 1 | 0 | 16.0 | 16 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
… | ||||||||||||
Estela | 0 | 1 | 0 | 0 | 1.0 | 13 | 0 | 0 | 0.33 | 0.33 | 0.33 | − 1.09861 |
Sum | LL=∑100i=1∑2m=0[(Yim)⋅ln(pim)] | − 109.86123 |
Exclusively for teaching purposes, we present the calculation of LL for an observation where Yi = 2 and where all parameters are equal to zero:
LL1=2∑m=0[(Y1m)⋅ln(p1m)]=(Y10)⋅ln(p10)+(Y11)⋅ln(p11)+(Y12)⋅ln(p12)=(0)⋅ln(0.33)+(0)⋅ln(0.33)+(1)⋅ln(0.33)=−1.09861
Fig. 14.17 presents part of the data present in the LateMultinomialMaximumLikelihood.xls file.
As we discussed in Section 14.2.1, we should also have an optimum combination of parameter values here, such that the objective function presented in Expression (14.39) should be obeyed, or rather, that the sum value of the likelihood function be the maximum possible. We again resort to Excel Solver to solve this problem.
The objective function is in cell M103, which will be our destination cell and should be maximized. The parameters α1, β11, β21, α2, β12, and β22, whose values are in cells P3, P5, P7, P9, P11, and P13, respectively, are the variables cells. The Solver window will be as shown in Fig. 14.18.
In clicking on Solve and then OK, we will obtain the optimum solution for the linear programming problem Table 14.17 shows part of the data obtained.
Table 14.17
Student | Yi | Yi0 | Yi1 | Yi2 | X1i | X2i | Zi1 | Zi2 | pi0 | pi1 | pi2 | LLi2∑m=0[(Yij)⋅ln(pij)] |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Gabriela | 2 | 0 | 0 | 1 | 20.5 | 15 | 3.37036 | 3.23816 | 0.01799 | 0.52341 | 0.45860 | − 0.77959 |
Patricia | 2 | 0 | 0 | 1 | 21.3 | 18 | 8.82883 | 12.78751 | 0.00000 | 0.01873 | 0.98127 | − 0.01891 |
Gustavo | 2 | 0 | 0 | 1 | 21.4 | 16 | 5.54391 | 7.10441 | 0.00068 | 0.17346 | 0.82586 | − 0.19133 |
Leticia | 2 | 0 | 0 | 1 | 31.5 | 15 | 9.51977 | 15.10301 | 0.00000 | 0.00375 | 0.99625 | − 0.00375 |
Luiz Ovidio | 2 | 0 | 0 | 1 | 17.5 | 16 | 3.36367 | 2.89778 | 0.02082 | 0.60162 | 0.37756 | − 0.97402 |
Leonor | 2 | 0 | 0 | 1 | 21.5 | 18 | 8.94064 | 13.00323 | 0.00000 | 0.01691 | 0.98308 | − 0.01706 |
Dalila | 2 | 0 | 0 | 1 | 21.5 | 18 | 8.94064 | 13.00323 | 0.00000 | 0.01691 | 0.98308 | − 0.01706 |
Antonio | 2 | 0 | 0 | 1 | 23.4 | 18 | 10.00281 | 15.05262 | 0.00000 | 0.00637 | 0.99363 | − 0.00639 |
Julia | 2 | 0 | 0 | 1 | 22.7 | 18 | 9.61149 | 14.29758 | 0.00000 | 0.00914 | 0.99086 | − 0.00918 |
Mariana | 2 | 0 | 0 | 1 | 22.7 | 18 | 9.61149 | 14.29758 | 0.00000 | 0.00914 | 0.99086 | − 0.00918 |
… | ||||||||||||
Rodrigo | 1 | 0 | 1 | 0 | 16.0 | 16 | 2.52511 | 1.27985 | 0.05852 | 0.73104 | 0.21044 | − 0.31329 |
… | ||||||||||||
Estela | 0 | 1 | 0 | 0 | 1.0 | 13 | 0 | − 10.87168 | − 23.58594 | 0.99998 | 0.00002 | 0.00000 |
Sum | LL=∑100i=1∑2m=0[(Yim)⋅ln(pim)] | − 24.51180 |
The maximum value possible for the logarithmic likelihood function is LLmax = − 24.51180. The solution to this problem generated the following parameter estimates:
and, in this way, the logits Zi1 and Zi2 can be written as follows:
Zi1=−33.135+0.559⋅disti+1.670⋅semi
Zi2=−62.292+1.078⋅disti+2.895⋅semi
Fig. 14.19 presents part of the results obtained by modeling the LateMultinomialMaximumLikelihood.xls file.
Based on the expressions of the logits Zi1 and Zi2, we can write the expressions of the occurrence probabilities for each of the categories of the dependent variable as follows:
Probability of a student i not arriving late (category 0):
pi0=11+e(−33.135+0.559⋅disti+1.670⋅semi)+e(−62.292+1.078⋅disti+2.895⋅semi)
Probability of a student i arriving late to the first class (category 1):
pi1=e(−33.135+0.559⋅disti+1.670⋅semi)1+e(−33.135+0.559⋅disti+1.670⋅semi)+e(−62.292+1.078⋅disti+2.895⋅semi)
Probability of a student i arriving late to the second class (category 2):
pi2=e(−62.292+1.078⋅disti+2.895⋅semi)1+e(−33.135+0.559⋅disti+1.670⋅semi)+e(−62.292+1.078⋅disti+2.895⋅semi)
Having estimated by maximum likelihood the equations parameters of occurrence probability for each of the categories of the dependent variable, we can prepare the classification of the observations and define the overall model efficiency of the multinomial logistic regression. Different from the binary logistic regression, where the classification is prepared based on the definition of a cutoff, in the multinomial logistic regression, the classification of each observation is done based on the higher probability among those calculated (pi0, pi1, or pi2). As such, for example, being that observation 1 (Gabriela) presented pi0 = 0.018, pi1 = 0.523, and pi2 = 0.459, we should classify it as category 1, or rather, by means of example it is expected that Gabriela will arrive late to the first class. However, we can see that, actually, this student arrived late to the second class and, therefore, for this case, we did not obtain a hit.
Table 14.18 presents the classification for our complete sample, with emphasis on the hits for each category of the dependent variable, highlighting as well the overall model efficiency (overall percentage of hits).
Table 14.18
Observed | Classification | |||
---|---|---|---|---|
Did Not Arrive Late | Arrived Late to First Class | Arrived Late to Second Class | Percentage of Positives (%) | |
Did not arrive late | 47 | 2 | 0 | 95.9 |
Arrived late to first class | 1 | 12 | 3 | 75.0 |
Arrived late to second class | 0 | 5 | 30 | 85.7 |
Overall model efficiency | 89.0 |
By means of the Table 14.18 analysis, we can see that the model presents an overall percentage of hits of 89.0%. However, the model presents a higher percentage of hits (95.9%) for those cases where there was the indication of not arriving late to class. On the other hand, when there are indications that a student will arrive late to the first class, the model will have a lower percentage of hits (75.0%).
We now go on to the study of the general statistical significance of the obtained model, as well as the statistical significance of the actual parameters, as we did in Section 14.2.
As in the binary logistic regression studied in Section 14.2, multinomial logistic regression modeling also offers statistics referent to the McFadden pseudo R2 and to χ2, whose calculations are given based on Expressions (14.16) and (14.17), respectively, given again here:
pseudoR2=−2⋅LL0−(−2⋅LLmax)−2⋅LL0
χ2=−2⋅(LL0−LLmax)
While the McFadden pseudo R2, as discussed in Section 14.2.2, is quite limited in terms of information regarding the model adjustment, able to be used when the researcher is interested in comparing distinct models, the χ2 statistic allows that a verification test be performed on the proposed model, since that, if all estimated parameters βjm (j = 1, 2, …, k; m = 1, 2, …, M − 1) are statistically equal to 0, the behavior of the alteration of each of the explanatory variables will not influence in any way the occurrence probabilities of the events represented by the categories of the dependent variables. The null hypotheses and alternative of the χ2 test for a general multinomial logistic regression model are:
Returning to our example, we have that LLmax, which is the maximum possible value of the sum of the logarithmic likelihood function, is equal to − 24.51180. To calculate LL0, which represents the maximum possible value of the sum of the logarithmic likelihood function for a model that only presents the constants α1 and α2 and no explanatory variable, we will again use Solver, by means of the LateMultinomialMaximumLikelihoodNullModel.xls file. Figs. 14.20 and 14.21 show the Solver window and part of the results obtained by modeling in this file, respectively.
Based on the null model, we have LL0 = − 101.01922 and, as such, we can calculate the following statistics:
pseudoR2=−2⋅(−101.01922)−[(−2⋅(−24.51180))]−2⋅(−101.01922)=0.7574
χ24g.l.=−2⋅[−101.01922−(−24.51180)]=153.0148
For 4 degrees of freedom (number of β parameters being that there are two explanatory variables and two logits), we have, by means of Table D in the Appendix, that χc2 = 9.488 (χ2 critical for 4 degrees of freedom and for the 5% significance level). In this way, being χ2 calculated χcal2 = 153.0148 > χc2 = 9.488, we can reject the null hypothesis that all βjm (j = 1, 2; m = 1, 2) parameters are statistically equal to zero. Then, at least one X variable is statistically significant to explain the probability of occurrence of at least one of the events under study. In the same way as discussed in Section 14.2.2, we can define the following criteria:
Besides the general statistical significance of the model, it is necessary to verify the statistical significance of each parameter, by means of the analysis of the respective Wald z statistics, whose null hypotheses and alternative are, for the parameters αm (m = 1, 2, …, M − 1) and βjm (j = 1, 2, …, k; m = 1, 2, …, M − 1):
The Wald z statistics are obtained based on Expression (14.18); however, maintaining the pattern presented in Section 14.2.2, we will not perform the standard error calculations for each parameter that, for our example, are:
Then, as we have already estimated the parameters, we have that:
zα1=α1s.e.(α1)=−33.13512.183=−2.720
zβ11=β11s.e.(β11)=0.5590.243=2.300
zβ21=β21s.e.(β21)=1.6700.577=2.894
zα2=α2s.e.(α2)=−62.29214.675=−4.244
zβ12=β12s.e.(β12)=1.0780.302=3.570
zβ22=β22s.e.(β22)=2.8950.686=4.220
As we can see, all calculated Wald z statistics present values lower than zc = − 1.96 or greater than zc = 1.96 (values critical to the 5% significance level, being the probabilities on the lower tail and the upper tail are equal to 0.025).
As such, we can see, for our example, that the criteria:
are obeyed. In other words, the dist and sem variables are statistically significant, to the 95% confidence level, to explain the differences in the probabilities of arriving late to the first class and the second class in relation to not arriving late. The expressions for these probabilities are those estimated in Section 14.3.1 and presented at its end.
As such, based on the final estimated probability models, we can propose three interesting questions, as we did in Section 14.2.2:
What is the average estimated probability of arriving late to the first class after traveling 17 kilometers and going through 15 traffic lights?
Being that arriving late to the first class is category 1, we should use the estimated probability expression pi1. As such, for this situation, we have that:
p1=e[−33.135+0.559⋅(17)+1.670⋅(15)]1+e[−33.135+0.559⋅(17)+1.670⋅(15)]+e[−62.292+1.078⋅(17)+2.895⋅(15)]=0.722
Then, the average estimated probability of arriving late to the first class is, under the informed conditions, equal to 72.2%.
On average, how much does one alter the chance of arriving late to the first class, in relation to not arriving late to school, in adopting a route that is 1 kilometer longer, maintaining the remaining conditions constant?
To answer this question, we will again resort to Expression (14.3), which can be written as follows:
oddsYi1=1=eZi1
such that, maintaining the remaining conditions constant, the chance to arrive late to the first class in relation to not arriving late to school, by adopting a route that is 1 kilometer longer, is:
oddsY1=1=e0.559=1.749
Then, the chance is multiplied by a factor of 1.749, or rather, maintaining the other conditions constant, the chance of arriving late to the first class in relation to not arriving late, by adopting a route that is 1 kilometer longer, is, on average, 74.9% higher. In multinomial logistic regression models, chance (odds ratio) is also known as relative risk ratio.
On average, how much does one alter the chance of arriving late to the second class, in relation to not arriving late to school, in going through 1 more traffic light, maintaining the remaining conditions constant?
In this case, being the event of interest refers to the category of arriving late to the second class, the chance expression comes to be:
oddsY1=2=e2.895=18.081
Then, the chance is multiplied by a factor of 18.081, or rather, maintaining the remaining conditions constant, the chance of arriving late to the second class in relation to not arriving late to school, in going through 1 more traffic light in the route to school, is, on average, 1708.1% higher.
As we can see, these calculations always use the average parameter estimates. As we did in Section 14.2, we will now go on to the study of the confidence intervals for these parameters.
The confidence intervals for the estimated parameters in a multinomial logistic regression are also calculated by means of Expression (14.21) presented in Section 14.2.3. Then, at the 95% confidence level, they can be defined, for parameters αm (m = 1, 2, …, M − 1) and βjm (j = 1, 2, …, k; m = 1, 2, …, M − 1) in the following way:
αm±1.96⋅[s.e.(αm)]βjm±1.96⋅[s.e.(βjm)]
in that 1.96 is the zc for the 5% significance level.
For the data in our example, Table 14.19 presents the estimated coefficients of parameters αm (m = 1, 2) and βjm (j = 1, 2; m = 1, 2) of the occurrence probability expressions in the events of interest, with the respective standard errors, the Wald z statistics, and the confidence intervals for the 5% significance level.
Table 14.19
Parameter | Coefficient | Standard Error (s.e.) | z | Confidence Interval (95%) | |
---|---|---|---|---|---|
αm − 1.96. [s.e. (αm)] βjm − 1.96. [s.e. (βjm)] | αm + 1.96. [s.e. (αm)] βjm + 1.96. [s.e. (βjm)] | ||||
α1 (constant) | − 33.135 | 12.183 | − 2.720 | − 57.014 | − 9.256 |
β11 (dist variable) | 0.559 | 0.243 | 2.300 | 0.082 | 1.035 |
β21 (sem variable) | 1.670 | 0.577 | 2.894 | 0.539 | 2.800 |
α2 (constant) | − 62.292 | 14.675 | − 4.244 | − 91.055 | − 33.529 |
β12 (dist variable) | 1.078 | 0.302 | 3.570 | 0.486 | 1.671 |
β22 (sem variable) | 2.895 | 0.686 | 4.220 | 1.550 | 4.239 |
As we already know, no confidence interval contains zero and, based on their values, we can write the lower (minimum) and upper (maximum) limits of the estimated occurrence probabilities for each of the categories of the dependent variable.
Confidence Interval (95%) of estimated probability of student i to not arrive late (category 0):
pi0min=11+e(−57.014+0.082⋅disti+0.539⋅semi)+e(−91.055+0.486⋅disti+1.550⋅semi)
pi0max=11+e(−9.256+1.035⋅disti+2.800⋅semi)+e(−33.529+1.671⋅disti+4.239⋅semi)
Confidence Interval (95%) of the estimated probability student i arrives late to first class (category 1):
pi1min=e(−57.014+0.082⋅disti+0.539⋅semi)1+e(−57.014+0.082⋅disti+0.539⋅semi)+e(−91.055+0.486⋅disti+1.550⋅semi)
pi1max=e(−9.256+1.035⋅disti+2.800⋅semi)1+e(−9.256+1.035⋅disti+2.800⋅semi)+e(−33.529+1.671⋅disti+4.239⋅semi)
Confidence Interval (95%) of the estimated probability student i arrives late to second class (category 2):
pi2min=e(−91.055+0.486⋅disti+1.550⋅semi)1+e(−57.014+0.082⋅disti+0.539⋅semi)+e(−91.055+0.486⋅disti+1.550⋅semi)
pi2max=e(−33.529+1.671⋅disti+4.239⋅semi)1+e(−9.256+1.035⋅disti+2.800⋅semi)+e(−33.529+1.671⋅disti+4.239⋅semi)
Analogous to that prepared in Section 14.2.3, we can define the confidence interval expression for the chances (odds or relative risk ratios) of occurrence for each of the events represented by the subscript m (m = 1, 2, M − 1) in relation to the event occurrence represented by category 0 (reference) for each parameter βjm (j = 1, 2, …, k; m = 1, 2, …, M − 1), at the 95% confidence level, in the following way:
eβjm±1.96⋅[s.e.(βjm)]
For the data in our example, and based on the values calculated in Table 14.19, we will prepare Table 14.20, which represents the confidence intervals for the chances (odds or relative risk ratios) of occurrence for each of the events in relation to the reference event for each parameter βjm (j = 1, 2; m = 1, 2).
Table 14.20
Event | Parameter | Chance (Odds) | Confidence Interval for Chance (95%) | |
---|---|---|---|---|
eβjm | eβjm − 1.96. [s. e. (βjm)] | eβjm + 1.96. [s. e. (βjm)] | ||
Arrive late to first class | β11 (dist variable) | 1.749 | 1.085 | 2.817 |
β21 (sem variable) | 5.312 | 1.715 | 16.453 | |
Arrive late to second class | β12 (dist variable) | 2.939 | 1.625 | 5.318 |
β22 (sem variable) | 18.081 | 4.713 | 69.363 |
These values will also be obtained by means of Stata modeling, to be presented in the next section.