14.4 Estimation of Binary and Multinomial Logistic Regression Models in Stata

The objective of this section is not to once again discuss all of the concepts inherent to the binary and multinomial logistic regression statistics, but to provide the researcher with an opportunity to prepare the same examples explored throughout the chapter by means Stata Statistical Software. The reproduction of the images in this section have been authorized by StataCorp LP©.

14.4.1 Binary Logistic Regression in Stata

Returning to the first example, we remember that the professor was interested in evaluating if the distance traveled, the amount of traffic lights, the time of day the trip was taken, and the driving style of students influenced the fact of arriving or not arriving late to school. We now return to the final dataset constructed by the professor by means of the questionnaires given to the group of 100 students. The dataset can be found in the Late.dta file and is exactly equal to that which was partially presented in Table 14.2.

At first, we should type in the command desc, which allows us to analyze the dataset characteristics, such as the number of observations, the number of variables, and a description of each of them. Fig. 14.22 presents this first output in Stata.

Fig. 14.22
Fig. 14.22 Description of the Late.dta dataset.

The dependent variable, which refers to the fact of arriving to school late or not, is qualitative and possesses only two categories, already labels in the dataset as dummy (No = 0; Yes = 1). The command tab offers the distribution of frequencies for a qualitative variable, highlighting the number of categories. If the research has any doubt regarding the number of categories, they can easily resort to this command. Fig. 14.23 presents the distribution of frequencies for the late dependent variable.

Fig. 14.23
Fig. 14.23 Distribution of frequencies for the late variable.

It is common to discuss the need of equality of frequencies between the reference category and the category that represents the event of interest when estimating binary logistic regression models. The fact that the frequencies are not equal will affect the probability of event occurrence for each observation in the sample, presented by means of Expression (14.11), and, consequently, the respective logarithmic likelihood function. However, being our objective is to estimate a probability of occurrence for an event of interest based on the maximization of the sum of the logarithmic likelihood function for the whole sample, respecting the dataset, there is no need for the frequencies in the two categories to be equal.

In relation to the qualitative explanatory variables, the per variable also has only two categories that, in the dataset, are already labeled as dummy (morning = 1; afternoon = 0). On the other hand, the style variable has three categories and, therefore it will be necessary to create (n − 1 = 2) dummies. The command xi i.style will provide these two dummies, named by Stata as _Istyle_2 and _Istyle_3. While Figs. 14.24 and 14.25 present the frequency distributions of variables per and style, respectively, Fig. 14.26 presents the procedure to create the two dummies based on the style variable.

Fig. 14.24
Fig. 14.24 Distribution of frequencies for the per variable.
Fig. 14.25
Fig. 14.25 Distribution of frequencies for the style variable.
Fig. 14.26
Fig. 14.26 Creation of two dummies based on style variable.

Let’s now go to the modeling. To do this, we should type in the following command:

logit late dist sem per _Istyle_2 _Istyle_3

The command logit prepares a binary logistic regression estimated by maximum likelihood. If the researcher does not inform the confidence level desired to define the intervals for the estimated parameters, the standard will be 95%. However, if the researcher desires to alter the confidence level for the parameters to, for example, 90%, the following command should be entered:

logit late dist sem per _Istyle_2 _Istyle_3, level(90)

We now continue with the analysis, maintaining the standard confidence level for the parameter intervals, which is 95%. The outputs are displayed in Fig. 14.27, and are exactly equal to those calculated in Section 14.2.

Fig. 14.27
Fig. 14.27 Binary logistic regression outputs in Stata.

Being that the binary logistic regression is part of a group of models known as Generalized Linear Models, and being that the dependent variable presents a Bernoulli distribution, as discussed in Section 14.2.1, the estimation presented in Fig. 14.27 can also be equally obtained by typing the following command:

glm late dist sem per _Istyle_2 _Istyle_3, family(bernoulli)

At first we can see that the maximum values for the logarithmic likelihood function for the complete model and for the null model are − 29.06565 and − 67.68585, respectively, and are exactly those calculated and presented in Figs. 14.4 and 14.7, respectively. Then, making use of Expression (14.17), we have that:

χ5d.f.2=267.6858529.06568=77.24withP-valueor Prob.χcal2=0.000

si113_e

Then, based on the χ2 test, we can reject the null hypothesis that all βj (j = 1, 2, …, 5) parameters are equal to zero at the 5% significance level, or rather, at least one X variable is statistically significant to explain the probability of occurrence of the fact of arriving late to school.

Even though the McFadden pseudo R2, as discussed, is quite limited in relation to its interpretation, Stata calculates it based on Expression (14.16), exactly as we did in Section 14.2.2.

pseudoR2=267.68585229.06568267.68585=0.5706

si29_e

By means of maximization of the logarithmic likelihood function, we estimate the model parameters, which are exactly equal to those presented in Fig. 14.4. However, as we discussed in Section 14.2.2, the variable _Istyle_2 (parameter β4) does not show itself statistically significant to increase or decrease the probability of arriving late to school in the presence of the remaining explanatory variables, at the 5% significance level, since − 1.96 < zβ4 = 1.80 < 1.96 and, therefore, the P-value of the Wald z statistic presented a value greater than 0.05.

The nonrejection of the null hypothesis for parameter β4, at the 5% significance level, obliges us to estimate the binary logistic regression model by means of the Stepwise procedure. However, before performing this procedure, we should save the results for the complete model. To do this, we should type in the following command:

lrtest, saving(0)

This command saves the parameter estimates for the complete model so that ahead we can perform a verification test of the complete model adjustment in comparison with the final estimated model adjustment by means of the Stepwise procedure.

Let’s now perform the Stepwise procedure by means of the following command, where the significance level of the Wald z test, based on which the explanatory variables will be excluded from the final model.

stepwise, pr(0.05): logit late dist sem per _Istyle_2 _Istyle_3

Our final model outputs are found in Fig. 14.28.

Fig. 14.28
Fig. 14.28 Binary logistic regression outputs with stepwise procedure in Stata.

Analogously, the estimation presented in Fig. 14.28 can also be obtained by means of the following command:

stepwise, pr(0.05): glm late dist sem per _Istyle_2 _Istyle_3, family(bernoulli)

Before analyzing these new outputs, we can perform the likelihood-ratio test that, according to what was discussed in Section 14.2.2, verifies the complete model adequacy compared with the final estimated model adjustment by the Stepwise procedure. To do this, we should type the following command:

lrtest

The result found in Fig. 14.29 is exactly equal to that calculated manually by means of Expression (14.19).

Fig. 14.29
Fig. 14.29 Likelihood-ratio test for verification of the final model adjustment quality.

χ1d.f.2 = − 2 ⋅ [− 30.80079 − (− 29.06568)] = 3.47    with  P - value (or Prob. χcal2) > 0.05

By analyzing the likelihood-ratio test, we can see that the final model estimation with exclusion of the _Istylel_2 did not alter the adjustment quality, at the 5% significance level, making the model estimated by means of the Stepwise procedure preferable in relation to the complete estimated model with all explanatory variables.

Another quite usual test to verify the final adjusted model quality is the Hosmer-Lemeshow test. Its principle consists in dividing the dataset into 10 parts by means of the estimated probability deciles by the last generated model and, from that point, perform a χ2 test to verify if there are significant differences between the number of observed and expected frequencies in each of the 10 groups. To perform this test in Stata, we should type in the following command:

estat gof, group(10) table

where the term gof refers to the goodness-of-fit of the model, or rather, the model adjustment quality.

The output for this test is found in Fig. 14.30.

Fig. 14.30
Fig. 14.30 Hosmer-Lemeshow test for verification of the goodness-of-fit of the final model.

The results presented in Fig. 14.30 show the groups formed by the estimated probability deciles and the number of observed and expected observations per group, as was the result of the χ2 test that, for 8 degrees of freedom, did not reject the null hypothesis that the expected and observed frequencies were equal, at the 5% significance level. Therefore, the final estimated model does not present problems in relation to the quality of the proposed adjustment.

In relation to this final estimated model (Fig. 14.28), all statistics presented, the estimated parameters with respective confidence levels, the standard errors and the Wald z statistics are exactly equal to those calculated for the final model in Sections 14.2.2 and 14.2.3. Therefore, for this model, we have that LLmax = − 30.80079 and, therefore:

pseudoR2=267.68585230.80079267.68585=0.5449

si46_e

χ4d.f.2 = − 2 ⋅ [− 67.68585 − (− 30.80079)] = 73.77    with  P-value (or Prob. χcal2) = 0.000

Being the final estimated model was prepared using the Stepwise procedure with a 5% significance level, obviously all Wald z statistics are lower than − 1.96 or greater than 1.96 and, therefore, all their P-values are lower than 0.05.

As such, based on the outputs in Fig. 14.28, we can write the final estimated probability expression that a student i arrives late to school in the following way:

pi=11+e30.933+0.204disti+2.920semi3.776peri+2.459_Istyle_3i

si116_e

and, in this way, we can return to the first question asked at the end of Section 14.2.2:

What is the average estimated probability to arrive late to school when traveling 17 kilometers and going through 10 traffic lights, making the trip in the morning and having what is considered an aggressive driving style?

The command mfx allows the researcher to answer this question directly. To do this, we should type the following command:

mfx, at(dist = 17 sem = 10 per = 1 _Istyle_3 = 1)

Obviously, the term _Istyle_2 = 0 does not need to be included in the mfx command being that the _Istyle_2 variable is not present in the final model. The output is given in Fig. 14.31, by means of which we arrive at the answer of 0.603 (60.3%), which is exactly that calculated manually in Section 14.2.2.

Fig. 14.31
Fig. 14.31 Calculation of the estimated probability for specified values of explanatory variables—command mfx.

Also by means of Fig. 14.28, we can write the expressions for the lower (minimum) and upper (maximum) limits of the estimated probability that a student i arrives late to school, with 95% confidence. As such, we have:

pimin=11+e51.780+0.006disti+0.938semi5.436peri+0.226_Istyle_3i

si117_e

pimax=11+e10.087+0.402disti+4.901semi2.116peri+4.692_Istyle_3i

si118_e

Small differences in the third decimal place in relation to the parameters presented in Section 14.2.2 are due to rounding criteria.

While the logit command directs Stata to present the estimated parameter coefficients for the event occurrence probability, the logistic command causes the software to present the chances of occurrence for an event of interest in altering a unit of the corresponding explanatory variable, maintaining the remaining conditions constant. To do this, we will type in the following command:

logistic late dist sem per _Istyle_2 _Istyle_3

The outputs are presented in Fig. 14.32.

Fig. 14.32
Fig. 14.32 Binary logistic regression outputs in Stata—command logistic to obtain odds ratios.

The only difference between the outputs in Fig. 14.32 (logistic command) and those presented in Fig. 14.27 (logit command) is that now Stata presents the odds ratios for each explanatory variable, calculated based on Expression (14.3). In what remains, we can see that the Wald z statistics and their respective P-values are exactly the same as those presented in Fig. 14.27 and, in this way, it makes sense to prepare, also for the logistic command, the Stepwise procedure. To do this, we will type the following command:

stepwise, pr(0.05): logistic late dist sem per _Istyle_2 _Istyle_3

The outputs are found in Fig. 14.33.

Fig. 14.33
Fig. 14.33 Binary logistic regression with stepwise procedure in Stata—logistic command to obtain odds ratios.

Analogously, the outputs in Fig. 14.33 are the same as those presented in Fig. 14.28, which exception of the odds ratios.

The estimations presented in Figs. 14.32 and 14.33 can also be obtained by means of the following commands:

glm late dist sem per _Istyle_2 _Istyle_3, family(bernoulli) eform
stepwise, pr(0.05): glm late dist sem per _Istyle_2 _Istyle_3, family(bernoulli) eform

in which the term eform of the glm command is equivalent to the logistic command.

Being thus, we can return to the last two questions asked at the end of Section 14.2.2:

On average, how much does chance to arrive late to school change if a route 1 km longer is adopted while maintaining the remaining conditions constant?

Does a student considered aggressive present, on average, a higher chance of arriving late than another who is considered calm? If yes, how much is this chance increased, maintaining the remaining conditions constant?

The answers can now be given directly. The chance of arriving late to school in adopting a route that is 1 km longer is, on average and maintaining the remaining conditions constant, multiplied by a factor of 1.226 (22.6% higher). The chance of arriving late to school when a student is considered an aggressive driver in relation to being calm is, on average and also maintaining the remaining conditions constant, multiplied by a factor of 11.693 (1069.3% higher). These values are exactly the same as those calculated manually at the end of Section 14.2.2.

The probability model being estimated, we can, by means of the predict phat command, generate a new variable (phat) in the dataset. This new variable corresponds to the expected (predicted) values of the probability of an event occurrence for each observation, calculated based on the estimated parameters in the last modeling executed.

For teaching purposes, we will prepare three distinct graphs that relate the dependent variable and the sem variable. These graphs are presented in Figs. 14.3414.36. The commands to obtain each are the following:

Fig. 14.34
Fig. 14.34 Linear adjustment between dependent and sem variables.
Fig. 14.35
Fig. 14.35 Deterministic logistic adjustment between dependent and sem variables.
Fig. 14.36
Fig. 14.36 Probabilistic logistic adjustment between dependent and sem variables.
graph twoway scatter late sem || lfit phat sem
graph twoway scatter late sem || mspline phat sem
graph twoway scatter phat sem || mspline phat sem

While the graph in Fig. 14.34 only presents the linear adjustment between the dependent and sem variables, which does not benefit the analysis much, the graph in Fig. 14.35 gives the logistic adjustment based on the estimated probabilities, while yet presenting the dependent variable in a dichotomic form, which causes this graph to be called the deterministic logistic adjustment. Finally, the graph in Fig. 14.36, even though similar to the previous, shows how the probabilities of occurrence for the event of interest behave in function of the alterations in the sem variable, being, therefore, called the probabilistic logistic adjustment.

Based on the final estimated model, we can now prepare the sensitivity analysis for the proposed model, in accordance with what was presented in Section 14.2.4. To do this, we should type in the following command:

estat class

We will begin the sensitivity analysis with a cutoff of 0.5. We point out that the estat class command already presents, as standard, a cutoff of 0.5. The generated output is found in Fig. 14.37, which corresponds exactly to Table 14.11.

Fig. 14.37
Fig. 14.37 Sensitivity analysis (cutoff = 0.5).

Then, as discussed in Section 14.2.4, we see that 86 observations were classified correctly, for a cutoff of 0.5, being that 56 were events and were in fact classified as such, and another 30 were not events and were not classified as events, for this cutoff. However, 14 observations were classified incorrectly, being that 3 were events but were not classified as such and 11 were not events, but classified as being so.

Stata also offers overall model efficiency (OME) in its outputs, also called Correctly Classified (overall percentage of hits), Sensitivity (percentage of hits considering only the observations that were actually events), and Specificity (percentage of hits considering only the observations that were not events), for a 0.5 cutoff. Thus being, we have, respectively:

OME=56+30100=0.8600

si60_e

Sensitivity=5659=0.9492

si61_e

Specificity=3041=0.7317

si62_e

The table in Fig. 14.37 can also be obtained by means of typing the following sequence of commands. The outputs can be found in Fig. 14.38.

Fig. 14.38
Fig. 14.38 Obtaining the classification table by a sequence of commands (cutoff = 0.5).
gen classlate = 1 if phat >=0.5
replace classlate = 0 if classlate ==.
tab classlate late

Figs. 14.39 and 14.40 present the sensitivity analyses of the model for cutoff values equal to 0.3 and 0.7. Their classification tables correspond to Tables 14.12 and 14.13, respectively, presented in Section 14.2.4. The commands to obtain Figs. 14.39 and 14.40 are, respectively:

Fig. 14.39
Fig. 14.39 Sensitivity analysis (cutoff = 0.3).
Fig. 14.40
Fig. 14.40 Sensitivity analysis (cutoff = 0.7).
estat class, cutoff(0.3)
estat class, cutoff(0.7)

Since the cutoff values vary between 0 and 1, it becomes operationally impossible to prepare a sensitivity analysis for each cutoff. Being thus, it makes sense, at this time, to prepare the sensitivity curve and the ROC curve for all the cutoff possibilities. The commands to prepare each are, respectively:

lsens
lroc

While Figs. 14.14 and 14.15 (Section 14.2.4) only present part of the complete sensitivity and ROC curves (at that time they were only plotted considering three cutoff values), Figs. 14.41 and 14.42 present the complete curves, respectively.

Fig. 14.41
Fig. 14.41 Sensitivity curve.
Fig. 14.42
Fig. 14.42 ROC curve.

An analysis of the sensitivity curve (Fig. 14.41) allows us to arrive at an approximate cutoff value that equals sensitivity to specificity. This cutoff, for our example, is approximately equal to 0.67. The biggest problem that we see on the sensitivity curve deals with the behavior of the specificity curve. While the sensitivity curve presents high percentages of hits for most of the cutoff values (up to about 0.65), the same cannot be said in relation to the behavior of the specificity curve, which presents high percentages of hits for only a very small interval of cutoffs (only for cutoffs larger than about 0.75). In other words, while the percentage of hits for those that will be events is high, almost independent of the cutoff used, the percentage of those that will be not event is only high for a few cutoff values, which can hinder the overall model hits efficiency for prediction. This model, therefore, is good to see if a student will in fact arrive late to school, but it does not present the same performance in predicting the non-event, or rather, if there is an indication that a student will not arrive late to school. When there is this last case, therefore, the model will commit more prediction errors for most of the cutoff values!

Being thus, even though we have a high overall efficiency model with statistically significant explanatory variables to compose the probability of occurrence of the event and non-event, we could suggest the inclusion of new explanatory variables. This would, eventually, improve the prediction character of those will not arrive late to school and, in this way, the overall model efficiency, with the consequent increase of area below the ROC curve. Even though this is true, it is important for us to underscore, for our example, that the area below the ROC curve is 0.9378 (Fig. 14.42), which is considered very good for prediction!

14.4.2 Multinomial Logistic Regression in Stata

The example from Section 14.3 has, as a phenomenon to be studied, a qualitative variable with three categories (did not arrive late, arrived late to the first class, or arrived late to the second class). The dataset is found in the LateMultinomial.dta file and is exactly equal to that which was partially presented in Table 14.14. Following the same procedure adopted in Section 14.4.1, we will first type in the desc command to as to analyze the databank characteristics, such as the number of observations, the number of variables, and the description of each. Fig. 14.43 presents these characteristics.

Fig. 14.43
Fig. 14.43 Description of the LateMultinomial.dta dataset.

In this example, only two explanatory variables were considered (dist and sem), both of which are quantitative. Fig. 14.44 presents the frequency distribution of the late dependent variable distribution, which was obtained by means of typing the following command:

Fig. 14.44
Fig. 14.44 Frequency distribution for variable late.

tab late

Having made these initial considerations, we now begin the actual modeling of the multinomial logistic regression. To do this, we type the following command:

mlogit late dist sem

The outputs are found in Fig. 14.45.

Fig. 14.45
Fig. 14.45 Multinomial logistic regression outputs in Stata.

As we can see by looking at Fig. 14.45, the category adopted as reference by Stata is that of highest frequency, or rather, the category did not arrive late, as we can see in Fig. 14.44. Coincidentally, this is the category that we really want to be the reference and, therefore, nothing needs to be done in relation to an eventual change in reference category before estimating the model. However, in case a researcher is interested in altering the reference category to, for example, the category arrived late to second class, should type the following command:

mlogit late dist sem, b(2)

We will continue with the analysis of the outputs obtained in Fig. 14.45.

First, we see that the maximum values of the logarithmic likelihood function for the complete model and the null model are − 24.51180 and − 101.01922, respectively, exactly those calculated and presented in Figs. 14.19 and 14.21, respectively. Thus, using Expression (14.41), we have that:

χ4d.f.2=2101.0192224.51180=153.01withP-valueorProb.χcal2=0.000

si122_e

Then, based on the χ2 test, we can reject the null hypothesis that all parameters βjm (j = 1, 2; m = 1, 2) are statistically equal to zero at the 5% significance level, or rather, at least one X variable is statistically significant to explain the probability of occurrence for at least one of the events under study.

Stata also presents the McFadden pseudo R2, which is calculated based on Expression (14.40), exactly as we did in Section 14.3.2.

pseudoR2=2101.01922224.511802101.01922=0.7574

si93_e

As we can see, all Wald z statistics present values lower than zc = − 1.96 or greater than zc = 1.96, according to what we already discussed in Section 14.3.2. Being thus, still based on the outputs in Fig. 14.45, we can write the final expression. For the average estimated probabilities of the occurrence of each of the three categories of the variable dependent, as with the respective expressions of the lower (minimum) and upper (maximum) limits for these estimated probabilities, with 95% confidence.

Probability that student i does not arrive late (category 0):

pi0=11+e33.135+0.559disti+1.670semi+e62.292+1.078disti+2.895semi

si88_e

Confidence interval (95%) of the estimated probability that student i does not arrive late (category 0):

pi0min=11+e57.014+0.082disti+0.539semi+e91.055+0.486disti+1.550semi

si106_e

pi0max=11+e9.256+1.035disti+2.800semi+e33.529+1.671disti+4.239semi

si107_e

Probability that student i arrives late to first class (category 1):

pi1=e33.135+0.559disti+1.670semi1+e33.135+0.559disti+1.670semi+e62.292+1.078disti+2.895semi

si89_e

Confidence interval (95%) of the estimated probability that student i arrives late to first class (category 1):

pi1min=e57.014+0.082disti+0.539semi1+e57.014+0.082disti+0.539semi+e91.055+0.486disti+1.550semi

si108_e

pi1max=e9.256+1.035disti+2.800semi1+e9.256+1.035disti+2.800semi+e33.529+1.671disti+4.239semi

si109_e

Probability that student i arrives late to second class (category 2):

pi2=e62.292+1.078disti+2.895semi1+e33.135+0.559disti+1.670semi+e62.292+1.078disti+2.895semi

si90_e

Confidence interval (95%) of the estimated probability that student i arrives late to second class (category 2):

pi2min=e91.055+0.486disti+1.550semi1+e57.014+0.082disti+0.539semi+e91.055+0.486disti+1.550semi

si110_e

pi2max=e33.529+1.671disti+4.239semi1+e9.256+1.035disti+2.800semi+e33.529+1.671disti+4.239semi

si111_e

Having estimated the probability expressions, we will generate, in the dataset, three variables corresponding to the expressions of the average probability of the occurrence of each of the events, by means of typing in the following commands:

Generation of variable referent to the probability that student i did not arrive late (category 0):

     gen pi0 = (1) / (1 + (exp(-33.13523 + .558829⁎dist + 1.669908⁎sem)) + (exp(-62.29224 + 1.078369⁎dist + 2.894861⁎sem)))

Generation of variable referent to the probability that student i arrives late to first class (category 1):

     gen pi1 = (exp(-33.13523 + .558829⁎dist + 1.669908⁎sem)) / (1 + (exp(-33.13523 + .558829⁎dist + 1.669908⁎sem)) + (exp(-62.29224 + 1.078369⁎dist + 2.894861⁎sem)))

Generation of variable referent to the probability that student i arrives late to second class (category 2):

     gen pi2 = (exp(-62.29224 + 1.078369⁎dist + 2.894861⁎sem)) / (1 + (exp(-33.13523 + .558829⁎dist + 1.669908⁎sem)) + (exp(-62.29224 + 1.078369⁎dist + 2.894861⁎sem)))

We can see that these new variables (pi0, pi1, and pi2) are identical to those obtained when we elaborated Fig. 14.19 (Excel Solver). In that case, the variables present in columns J, K, and L, respectively). Variables pi0, pi1, and pi2 could also be generated in the dataset through the direct command predict pi0 pi1 pi2, which can be typed after the model estimation (command mlogit).

Having generated the new variables, we are now able to construct two interesting graphs, based upon which some conclusions can be come to. While the first graph (Fig. 14.46) shows the behavior of the occurrence probabilities for each of the events in function of the distance traveled to school, the second graph (Fig. 14.4) shows the behavior of these probabilities in function of the amount of traffic lights each student passes through. The commands to generate these graphs are, respectively:

Fig. 14.46
Fig. 14.46 Occurrence probabilities for each event × distance traveled.
 graph twoway mspline pi0 dist || mspline pi1 dist || mspline pi2 dist ||, legend(label(1 "did not arrive late") label(2 "arrived late to first class") label(3 "arrived late to second class"))
 graph twoway mspline pi0 sem || mspline pi1 sem || mspline pi2 sem ||, legend(label(1 "did not arrive late") label(2 "arrived late to first class") label(3 "arrived late to second class"))

The graph in Fig. 14.46 shows that there are differences in the probabilities in arriving late to the first or second class in relation to not arriving late, when varying the distance traveled to school. We can also see that, up to about the distance of 20 km, the differences in the probabilities of arriving late to the first or second class are small. Howbeit, the highest differences occur for the probability of not arriving late, which is much higher. On the other hand, a travel distance greater than about 20 kilometers comes to cause the probability of arriving late to the second class to increase considerably in relation to the probability of arriving late to the first class. Besides this, beginning from this distance, the probability of not arriving to school late falls considerably. This explains the fact of the dist variable being statistically significant, at the 5% significance level, for both model logits, being considered reference to the corresponding category of not arriving late. Besides this, we can also note, independent of the distance traveled, that the probability of arriving late to the first class is always the least among the three occurrence possibilities for the event, and almost never presents considerable alterations with the change in distance. As such, if, for example, we prepared a logistic regression with only two categories (binary), being the event of interest represented by the corresponding category of arriving late to the first class (dummy = 1), we would see that the dist variable would not be statistically significant, to the 5% significance level, to explain the probability of arriving late to the first class, as has already been proven by means of the analysis of the graph in Fig. 14.46.

Then the analysis of Fig. 14.47, which shows the differences in the probabilities of arriving late to the first or second class in relation to not arriving late, by varying the number of traffic lights that are gone through on the trip to school, we can see that, up to the amount of approximately 12 traffic lights, the probability of arriving late to school is practically null. However, going on from this number, the probability of arriving late rises considerably, highlighting the probability of arriving late to the first class. Nonetheless, for quantities above approximately 17 traffic lights, the probability of arriving late to the second class comes to be the highest among the three possibilities of event occurrence, becoming almost absolute with quantities superior to 18 traffic lights. The behavior of these probabilities explains the fact that the sem variable is statistically significant, at the 5% significance level, for both model logits, being considered reference category of not arriving late, or rather, to explain the behavior of the occurrence possibilities for each of the three categories of the dependent variable.

Fig. 14.47
Fig. 14.47 Occurrence probabilities for each event × number of traffic lights.

Last, but not least, we will prepare, as we did in Section 14.4.1, the model requesting that the chances of occurrence be provided for each of the events of interest in altering the corresponding explanatory variable by one unit, maintaining all remaining conditions constant. In multinomial logistic regression models, as we discussed in Section 14.3.2, the odds ratio is also called relative risk ratio. As such, we should type the following command:

mlogit late dist sem, rrr

where the term rrr refers exactly to the term relative risk ratio. The outputs are presented in Fig. 14.48.

Fig. 14.48
Fig. 14.48 Multinomial logistic regression outputs in Stata—relative risk ratios.

The outputs in Fig. 14.48 are the same as those presented in Fig. 14.45, with exception of the relative risk ratios. Therefore, we can return to the last two questions asked at the end of Section 14.3.2:

On average, how much does one alter the chance of arriving late to the first class, in relation to not arriving late to school, in adopting a route that is 1 kilometer longer, maintaining the remaining conditions constant?

On average, how much does one alter the chance of arriving late to the second class, in relation to not arriving late to school, in going through 1 more traffic light, maintaining the remaining conditions constant?

The answers can now be given directly, or rather, while the chance of arriving late to the first class in relation to not arriving late to school, when taking a route that is 1 kilometer longer, is, on average and maintaining the remaining conditions constant, multiplied by a factor of 1.749 (74.9% higher). The chance of arriving late to the second class in relation to not arriving late, when going through 1 more traffic light on the route to school is, on average, multiplied by a factor of 18.081 (1708.1% higher), also maintaining the remaining conditions constant. These values are exactly the same as those calculated manually at the end of Section 14.3.2.

The Stata capability to estimate models and execute statistical tests is enormous. We believe that what was given here can be considered required for researches who have the intent of applying, correctly, the binary and multinomial logistic regression techniques.

We will not go on to the solution of the same examples using SPSS.

14.5 Estimation of Binary and Multinomial Logistic Regression Models in SPSS

We will not present the step-by-step process for preparing our examples by means of IBM SPSS Statistics Software. The reproduction of the images used in this section has the full authorization of the International Business Machines Corporation©.

Our objective is neither to again discuss the concepts inherent to the techniques, nor repeat that which has been explored in previous sections. The main objective of this section is to provide the researcher the opportunity to execute the binary and multinomial logistic regression in SPSS, given its user friendliness and practicality with which the software performs its operations and presents itself to the user. With each presentation of an output, we will mention the respective result obtained when prepared using Excel and Stata techniques, so as to allow the researcher to compare them and, in this way, decide which software to use in function of its individual characteristics and availability.

14.5.1 Binary Logistic Regression in SPSS

Following the same logic proposed when applying the models in Stata, we now go to the dataset built by our professor based on the questionnaires given to the 100 students. The data can be found in Late.sav file and, after opening, we will first click on Analyze → Regression → Binary Logistic …. The dialog box pictured in Fig. 14.49 will open.

Fig. 14.49
Fig. 14.49 Dialog box for estimation of binary logistic regression in SPSS.

We should choose the late variable and include it in the Dependent box. The remaining variables should be chosen and inserted into the Covariates box. We will maintain, at this time, the Method: Enter option. The Enter procedure, contrary to the Stepwise procedure (in SPSS, the binary logistic regression uses the analogous procedure known as Forward Wald), includes all the estimation variables, even those that are statistically equal to zero, and corresponds exactly to the standard procedure prepared by Excel (complete model presented in Fig. 14.4) and also by Stata when directly applied to the logit command. Fig. 14.50 presents the SPSS dialog box with the definition of the dependent variable and the explanatory variables to be inserted into the model.

Fig. 14.50
Fig. 14.50 Dialog box for estimation of binary logistic regression in SPSS with inclusion of dependent variable and explanatory variables and selecting the enter procedure.

In case the dataset does not present the dummy variables corresponding to the style categories variable, we can select the Categorical … button and include the original variable (style) in this option, including with the reference category definition. Since we already have two dummies (style2 and style3), there is no need for this procedure to be performed.

In the Options … button, we will only select the Iteration history and CI for exp(B) options, which correspond to the iteration history procedure for the maximization of the logarithmic likelihood function sum and the confidence intervals for the odds ratios for each parameter, respectively. The dialog box that is opened, by clicking on this option, is presented in Fig. 14.51, with the mentioned options having already been selected.

Fig. 14.51
Fig. 14.51 Options for estimation of binary logistic regression in SPSS.

We see, by means of Fig. 14.51, that the standard cutoff used by SPSS is equal to 0.5; however, it is in this dialog box that the researcher can alter the value to that desired so as to elaborate classifications of the observations existing in the dataset and predictions for other observations. In the dialog box for the Options … button, we can also impose that parameter α be equal to zero (by disabling the Include constant in equation option) and alter the significance level based on which the parameter of a determined explanatory variable can be considered statistically equal to zero (Wald z test) and, therefore, this variable will be excluded from the final model when using the Stepwise procedure. We will maintain the 5% standard for all significance levels and allow the constant to remain in the model (Include constant in equation option selected).

We can now select Continue and OK. The generated outputs are presented in Fig. 14.52.

Fig. 14.52
Fig. 14.52 Binary logistic regression outputs in SPSS—enter procedure.

Fig. 14.52 gives only the most important results obtained for analysis of the binary logistic regression. We will not again analyze all the generated outputs, since they are exactly equal to those obtained when estimating the binary logistic regression in Excel and Stata. It is worth mentioning that, while Stata presents the maximum obtained value of the sum of the logarithmic likelihood function, SPSS presents the double of this value, with the sign inverted. Then, while we obtained an LL of − 67.68585 for the null model (as can be verified in Figs. 14.7 and 14.27) and of − 29.06568 for the complete model (Figs. 14.4 and 14.27), SPSS presents a − 2LL value equal to 135.372 for the null model (initial) and a − 2LL equal to 58.131 for the complete model.

The other difference between the outputs generated by Stata and SPSS deals with the pseudo R2. While Stata presents the McFadden pseudo R2 already calculated, SPSS presents the Cox & Snell pseudo R2 and the Nagelkerke pseudo R2, which calculations can be obtained by means of Expressions (14.45) and (14.46).

pseudoRCox&Snell2=1eLL0eLL2N

si133_e  (14.45)

pseudoRNagelkerke2=1eLL0eLL2N1eLL02N=pseudoRCox&Snell21eLL02N

si134_e  (14.46)

Therefore, for our example, we have that:

pseudoRCox&Snell2=1eLL0eLL2N=1e67.68585e29.065682100=0.538

si135_e

pseudoRNagelkerke2=pseudoRCox&Snell21eLL02N=0.5381e67.685852100=0.725

si136_e

Similar to the McFadden pseudo R2, these two new statistics are limited in their analysis of the model predictive power and, therefore, as has been discussed, a sensitivity analysis should be projected with this end in mind.

The remaining results are equal to those obtained in Excel (Section 14.2) and Stata (Section 14.4). However, being that the variable style2 parameter did now show itself to be statistically different from zero, at the 5% significance level, we will go on to the final model estimation by means of the Forward Wald procedure (Stepwise). To execute this procedure, we should select the Method: Forward: Wald procedure in the main binary logistic regression dialog box in SPSS, as shown in Fig. 14.53.

Fig. 14.53
Fig. 14.53 Dialog box with forward Wald procedure selection.

In the Options … button, besides the options previously marked, let’s now select the Hosmer-Lemeshow goodness-of-fit option, as shown in Fig. 14.54. This being done, we should click on Continue.

Fig. 14.54
Fig. 14.54 Selection of Hosmer-Lemeshow test for verification of the final model adjustment quality.

The Save … button, finally, allows the variables referent to the estimated probability of event occurrence and classification for each observation to be generated in the actual dataset, based on their estimated probability and the previously defined cutoff. As such, in clicking on this option, a dialog box will be opened, as shown in Fig. 14.55. We should choose the Probabilities and Group membership (in Predicted Values) options.

Fig. 14.55
Fig. 14.55 Dialog box for generation of variables referent to estimated probability of event occurrence and classification of each observation.

By clicking on Continue and then OK, new outputs will be generated, as Fig. 14.56 shows. Notice that, besides the outputs, two new variables, named PRE_1 and PGR_1, have been created in the original dataset. They correspond to the estimated event occurrence probabilities and the new respective classifications, based on the 0.5 cutoff, respectively. Notice that the PRE_1 variable is exactly equal to that presented in column pi in Fig. 14.12 generated by Excel and to the phat variable generated by Stata after the estimated model presented in Fig. 14.28.

Fig. 14.56
Fig. 14.56
Fig. 14.56
Fig. 14.56 Binary logistic regression outputs in SPSS–Forward Wald procedure (Stepwise).

The first output generated (Iteration History) presents the values corresponding to the likelihood function in each step of the model prepared by the Forward Wald procedure, which is equivalent to the Stepwise procedure. We see that the final value of − 2LL is equal to 61.602, or rather, LL = − 30.801, which is exactly equal to the value obtained in the Excel modeling (Fig. 14.12) and Stata (Fig. 14.28). The Model Summary output also presents this statistic, based on which it is possible to calculate the χ2 statistic, which evaluates the existence of at least one statistically significant parameter to explain the probability of occurrence for the event under study. The Omnibus Tests of Model Coefficients output presents this statistic (χ2 = 73.77, Sig. χ2 = 0.000 < 0.05), already manually calculated in Section 14.2.2 and also presented in Fig. 14.28, by means of which we can reject the null hypothesis that all βj (j = 1, 2, …, 5) parameters are statistically equal to zero, at the 5% significance level. Then, at least one X variable is statistically significant to explain the probability of arriving late to school and, therefore, we have a statistically significant binary logistic regression model for prediction purposes.

Next, the results of the Hosmer and Lemeshow Test are presented as well as the respective contingency table that shows, based on the groups formed by the estimated probability deciles, the expected and observed frequencies by group observations. By analyzing the test result (for step 4, χ2 = 6.341, Sig. χ2 = 0.609 > 0.05), already presented by means of Fig. 14.30 when prepared in Stata, we can reject the null hypothesis that the expected and observed frequencies are equal, at the 5% significance level and, therefore, the final estimated model does not present problems in relation to the quality of the proposed adjustment.

The Classification Table presents the step-by-step evolution of the observation classifications. For the final model (step 4), we obtained a specificity value equal to 73.2%, a sensitivity value equal to 94.9%, and an overall model efficiency of 86.0%, for a 0.5 cutoff. These values correspond to those obtained in Table 14.11 and presented in Fig. 14.37. The crosstabulation can also be obtained by clicking on Analyze → Descriptive Statistics → Crosstabs …. In the dialog box that opens, we should insert the PGR_1 variable (predicted group) in Row(s) and the late variable in Column(s). We next click on OK. While Fig. 14.57 shows the dialog box, Fig. 14.58 presents the actual crosstabulation.

Fig. 14.57
Fig. 14.57 Dialog box for elaboration of crosstabulation.
Fig. 14.58
Fig. 14.58 Crosstabulation.

Returning to the analysis of the Fig. 14.56 outputs, the Forward Wald (Stepwise) procedure prepared by SPSS shows the step-by-step of the models that were developed, beginning with the inclusion of the most significant variable (greater Wald z statistic among other explanatories) to the inclusion of that with the least Wald z statistic, however with Sig. z < 0.05. As important as the analysis of the variables included in the final model is the list of excluded variables (Variables not in the Equation). As such, we can see that, by including only the explanatory variable per in model 1, the list of excluded variables includes all others. If, in the first step, there is any explanatory variable that has been excluded, but that presents itself as significant (Sig. z < 0.05), as occurs, for example, for the sem variable, this variable will be included in the next model step (model 2). The remaining variable on this lest, for our example, is the style2 variable, as we discussed when preparing the regression in Excel and Stata, and the final model (model 4 of the Forward Wald procedure), which is exactly that which was presented in Figs. 14.12 and 14.28, includes the dist, sem, per, and style3 explanatory variables. As such, based on the Variables in the Equation (step 4) of Fig. 14.56, we can write the final estimated probability expression that a student i will arrive too late to school:

pi=11+e30.933+0.204disti+2.920semi3.776peri+2.459style3i

si137_e

The Variables in the Equation output also presents the odds ratios for each estimated parameter (Exp(B)), which corresponds to those obtained by the Stata logistic command (Fig. 14.33), with the respective confidence intervals. In case we wanted to obtain the confidence intervals for the parameters, instead of those referents to the chances, we should not have marked option CI for exp(B) in the Options … dialog box (Fig. 14.54).

Finally, we can construct the ROC curve in SPSS. To do this, after the final model estimation, we should click on Analyze → ROC Curve …. A dialog box like that presented in Fig. 14.59 will be opened. We should insert the PRE_1 variable (predicted probability) in Test Variable and the late variable in State Variable, with a value equal to 1 in Value of State Variable. Besides this, in Display, we should click on the options ROC Curve and With diagonal reference line. Next, we should click on OK.

Fig. 14.59
Fig. 14.59 Dialog box for construction of ROC curve.

The ROC curve can be seen in Fig. 14.60.

Fig. 14.60
Fig. 14.60 ROC curve.

As we discussed when analyzing Fig. 14.42, the area below the ROC curve, of 0.938, is considered very good to define the model quality in terms of predicting the occurrence of an event for new observations.

14.5.2 Multinomial Logistic Regression in SPSS

We will not prepare the multinomial logistic regression model in SPSS using the same example used in Sections 14.3 and 14.4.2. The data can be found in the LateMultinomial.sav file and, after opening it, we will click on Analyze → Regression → Multinomial Logistic …. The dialog box in Fig. 14.61 will be opened.

Fig. 14.61
Fig. 14.61 Dialog box for estimation of multinomial logistic regression in SPSS.

Let’s include the late variable in Dependent and the explanatory quantitative variables dist and sem in the Covariate(s) box. The Factor(s) box should be filled in with qualitative explanatory variables, a fact that does not apply to our example. Fig. 14.62 presents this dialog box duly completed.

Fig. 14.62
Fig. 14.62 Dialog box for estimation of multinomial logistic regression in SPSS with inclusion of dependent variable and explanatory variables.

Notice that we should define the dependent variable reference category. As such, in Reference Category …, we should select the First Category option since the did not arrive late category presents values equal to zero in the dataset (Fig. 14.63). We could also select the Custom option, with Value equal to 0. This last option is most used when the researcher is interested in making the determined intermediate category of the dependent variable the model reference category.

Fig. 14.63
Fig. 14.63 Definition of dependent variable reference category.

By clicking on Continue, we can give sequence to the model estimation procedure. On the Statistics … button, we should click on Case processing summary, and in Model, we should select the Pseudo R-Square, Step summary, Model fitting information, and Classification table options. Finally, in Parameters, let’s select the Estimates option. Fig. 14.64 shows this dialog box.

Fig. 14.64
Fig. 14.64 Dialog box for selection of multinomial logistic regression statistics.

Finally, after clicking on Continue, we should select Save …. In this dialog box, let’s select the Estimated response probabilities and Predicted category options, as shown in Fig. 14.65. This procedure generates, for each sample observation, the occurrence probabilities for each of the three dependent variable categories and the expected classification for each observation defined based on these probabilities. Then, four new variables will be generated in the dataset (EST1_1, EST2_1, EST3_1, and PRE_1).

Fig. 14.65
Fig. 14.65 Dialog box for generation of variables referent to the estimated occurrence probabilities for each category and the classification of each observation.

Next, let’s click on Continue and then OK. The generated outputs can be found in Fig. 14.66.

Fig. 14.66
Fig. 14.66 Multinomial logistic regression outputs in SPSS.

By means of the outputs in Fig. 14.66, we can see that, based on the χ2 test (χ2 = 153.01, Sig. χ2 = 0.000 < 0.05 presented in the Model Fitting Information output), that the null hypothesis of all the βjm (j = 1, 2; m = 1, 2) parameters are statistically equal to zero can be rejected at the 5% significance level, or rather, that at least on X variable is statistically significant to explain the probability of occurrence of at least one of the events in the study. Now, the Pseudo R-Square output presents, unlike the binary logistic regression, the McFadden pseudo R2. The value of this statistic, just as the χ2 statistic, is exactly equal to that calculated manually in Section 14.3.2 and presented in Fig. 14.45 when estimating the model in Stata.

The final model can be obtained by means of the Parameter Estimates output and is exactly the same as that presented in Fig. 14.19 and obtained by means of the Stata mlogit command (Fig. 14.45). Based on this output, we can write the average estimated occurrence probability for each of the events represented by the dependent variable categories, namely:

Probability of a student i not arriving late (category 0):

pi0=11+e33.135+0.559disti+1.670semi+e62.292+1.078disti+2.895semi

si88_e

Probability of a student i arriving late to the first class (category 1):

pi1=e33.135+0.559disti+1.670semi1+e33.135+0.559disti+1.670semi+e62.292+1.078disti+2.895semi

si89_e

Probability of a student i arriving late to the second class (category 2):

pi2=e62.292+1.078disti+2.895semi1+e33.135+0.559disti+1.670semi+e62.292+1.078disti+2.895semi

si90_e

This same output also presents the relative risk ratios (Exp(B)) for each estimated parameter, which correspond to those obtained by means of the Stata rrr command (Fig. 14.48), with the respective confidence intervals.

Finally, the classification table (Classification output) shows, based on the highest estimated probability (pi0, pi1, or pi2) of each observation, the predicted and observed classification for each dependent variable category. As such, according to what was presented by means of Table 14.18, we arrive at a model that presents an overall percentage of hits of 89.0% (overall efficiency), possessing a percentage of hits of 95.9% when there is indication of not arriving late to school, of 75.0% when there is indication of arriving late to the first class and of 85.7% when the model indicates that there will be a late arrival to the second class.

14.6 Final Remarks

Estimation by maximum likelihood, even though little known by a great number of researchers, is quite useful to estimate parameters when a determined dependent variable presents itself, for example, in a qualitative manner.

The most adequate situation for applying the binary logistic regression happens when the phenomenon that one desires to study presents itself in a dichotomic way and the research has the intent of estimating an expression of the probability of an event occurrence defined between two possibilities in function of determined explanatory variables. The binary logistic regression model can be considered a unique case of the multinomial logistic regression model, whose variable also presents itself in a qualitative form, however now with more than two event categories and, for each category, an occurrence probability expression will be estimated for each category.

The development of any confirmatory model should be made by the correct and conscientious use of the chosen modeling software, based on the subjacent theory and the experience and intuition of the researcher.

14.7 Exercises

  1. (1) A lending institution that provides credit to individuals wants to evaluate the probability that its clients default on their payment obligations (probability of default). By means of a dataset of 2000 observations of company clients who have recently received credit, the institution intends to estimate a binary logistic regression model using, as explanatory variables, the age, gender (female = 0; male = 1), and monthly income ($) for each individual. The dependent variable refers to the actual default (not default = 0, default = 1). The Default.sav and Default.dta files give this data, and by means of binary logistic regression model estimation, the following is asked:
    1. (a) Analyze the significance level of the χ2 test. Is at least one of the variables (age, gender, and income) statistically significant to explain the probability of default, at the 5% significance level?
    2. (b) If the answer to the previous item is yes, analyze the significance level of each explanatory variable (Wald z tests). Is each of them statistically significant to explain the probability of default, at the 5% significance level?
    3. (c) What is the final estimated equation for the average probability of default?
    4. (d) On average, do individuals of the male gender present a higher probability of default in credit obtained for personal consumption, maintaining remaining conditions constant?
    5. (e) On average, do individuals with a higher age tend to present a higher probability of default for credit acquired for consumption, maintaining remaining conditions constant?
    6. (f) What is the average estimated probability of default for a 37-year-old male with a monthly income of $6850.00?
    7. (g) On average, how much does the chance of default increase in a unit, maintaining the remaining conditions constant?
    8. (h) What is the overall model efficiency with a cutoff of 0.5? What is the sensitivity and specificity for this same cutoff?
  1. (2) With the idea of studying client fidelity, a supermarket group conducted research with 3000 consumers at the time they were paying for their purchases. Being that the fidelity of a determined customer can be measured based on their return to the establishment, with their items paid for, within one year from the date of their previous purchase, monitoring is easy by means of their Social Security number. As such, if the SS number for a determined customer is in the store dataset, however no purchase is made under this same SS number within the period of a year, this customer will be classified as having no establishment fidelity. On the other hand, if the SS for another customer is in the store dataset and is identified in another purchase within the period of one year, they will be classified in the establishment fidelity category. So as to stipulate the criteria that increase the probability that a customer presents establishment fidelity, the supermarket group collected the following variables from each of the 3000 customers, and then followed them during a year from the date of that specific purchase.
VariableDescription
idVariable that substitutes the SS due to security measures. It is a string variable, varies between 0001 and 3000, and will not be used in the modeling
fidelityDependent binary variable that corresponds to the fact of the customer returning to the store or not to effect a new purchase during a period of less than a year (No = 0; Yes = 1)
genderCustomer gender (female = 0; male = 1)
ageCustomer age (years)
serviceQualitative variable with five categories corresponding to the perceived level of service provided by the establishment at time of the original purchase (terrible = 1; bad = 2; regular = 3; good = 4; excellent = 5)
assortmentQualitative variable with five categories corresponding to the perception of quality of the assortment of goods offered by the establishment at the time of the original purchase (terrible = 1; bad = 2; regular = 3; good = 4; excellent = 5)
accessibilityQualitative variable with five categories corresponding to the perception of quality of access to the establishment, such as parking and access to sales area (terrible = 1; bad = 2; regular = 3; good = 4; excellent = 5)
priceQualitative variable with five categories corresponding to product prices in relation to the competition at the time of the original purchase (terrible = 1; bad = 2; regular = 3; good = 4; excellent = 5)

By means of an analysis of the dataset present in the files Fidelity.sav and Fidelity.dta, answer the following:

  1. (a) As to the estimation of the complete binary logistic regression model with all the explanatory variables for the individual (gender and age) and all the corresponding (n − 1) dummies corresponding to the n categories of each of the qualitative variables, do any of these categories show themselves to be statistically not significant to explain the probability of event occurrence (fidelity to the establishment), at the 5% significance level?
  2. (b) If the answer to the previous item is yes, estimate the probability expression for event occurrence by means of the Stepwise procedure.
  3. (c) What is the overall model efficiency, with a 0.5 cutoff?
  4. (d) Wanting to establish criteria that equate the probability of hits for those who show fidelity to the establishment to the probability of hits for those who do not show fidelity, the company marketing director analyzed the model sensitivity curve. What is the approximate cutoff that equates these two probabilities of hits?
  5. (e) For the final estimated model, in relation to service considered terrible, how do, on average, the chances behave in regards to fidelity by customers who answered this question with bad, regular, good, and excellent, maintaining remaining conditions constant?
  6. (f) Prepare the previous item again, but now separately using the assortment, accessibility, and price variables.
  7. (g) Based on the analysis of chances, the establishment desires to invest in a single, perceptible variable to increase the probability that customers become faithful, causing them to change their negative perceptions and come, with higher frequency, to present positive perceptions regarding this question. What would this variable be?
  1. (3) The Health Department for a determined country wants to launch a campaign to improve the citizen LDL (mg/dL) cholesterol index by means of encouraging the practice of physical activities and the reduction of tobacco use. For such, research on 2304 individuals was conducted, from which the following variables were taken.
VariableDescription
cholesterolLDL (mg/dL) cholesterol index
cigaretteDummy variable corresponding to the fact that an individual smokes or not (nonsmoker = 0; smoker = 1)
SportNumber of times weekly a physical activity is practiced

Being that the cholesterol index is subsequently classified according the reference values, the Health Department has the idea of informing the population regarding the benefits of practicing physical activities and of tobacco abstinence to improve this classification. As such, the variable cholesterol will be transformed to the variable colestquali, to be described, which presents five categories and which will be a dependent model variable whose results will be published by the Health Department.

VariableDescription
colestqualiLDL (mg/dL) cholesterol level classification:

 Very high: above 189 mg/dL (reference category)

 High: 160–189 mg/dL

 Borderline: 130–159 mg/dL

 Near optimal: 100–129 mg/dL

 Optimal: below 100 mg/dL

Unlabelled Table

The dataset for this research is found in the Colestquali.sav and Colestquali.dta files. By means of a multinomial logistic regression model estimation with cigarette and sport as explanatory variables, answer the following:

  1. (a) Present the table of frequencies for the dependent variable categories.
  2. (b) By means of multinomial logistic regression model estimation, is it possible to know that at least one of the explanatory variables is statistically significant to compose the probability expression of the occurrence of at least one of the proposed LDL cholesterol index classifications, at the 5% significance level?
  3. (c) What are the final estimated equations for the average occurrence probabilities for the proposed LDL cholesterol index classifications?
  4. (d) What are the occurrence probabilities for each of the proposed classifications for an individual who does not smoke and who practices sporting activities once a week?
  5. (e) Based on the estimated model, prepare an occurrence probability graph for each event represented by the dependent variable in function of the number of times physical activities are practiced weekly. As of what weekly frequency does the practice of physical activity considerably increase the probability that the LDL cholesterol index go to the near optimal or optimal levels?
  6. (f) On average, how much does one alter the chance of having a cholesterol index considered high, in relation to a level considered very high, by increasing by one unit the number of times weekly physical activities are performed, maintaining all remaining conditions constant?
  7. (g) On average, how much does one alter the chance of having a cholesterol index considered optimal, in relation to a level considered hear optimal, by quitting smoking, maintaining all remaining conditions constant?
  8. (h) Prepare the classification table based on the estimated probability of each observation in the sample (predicted and observed classification for each category of the dependent variable).
  9. (i) What is the overall model efficiency? What is the percentage of hits for each category of the dependent variable?

Appendix: Probit Regression Models

A.1 A Brief Introduction

The probit regression models, whose name refers to the contraction of probability unit, can be used as an alternative to the binary logistic regression models, for cases where the probability curve for determined event probabilities adjust themselves more adequately to the cumulative density function of the standard normal distribution.

The idea of the probit regression was initially conceived by Bliss (1934a,b) who, in performing experiments with the goal of discovering the effectiveness of pesticides against insects that fed on grape leaves, graphically represented the answer to the insect problem for different levels of pesticide concentration. Being that the relation discovered between the pesticide dose and the response time followed a sigmoid function (or S curve), Bliss opted, at that time, to transform the dose-answer sigmoid curve into a linear expression, following the known linear regression model. Two decades later, Finney (1952), supporting his ideas on the Bliss experiments, made relative contributions in a book titled Probit Analysis. Still today, probit regression models are widely used to understand dose-response relations, when the respective probability curve for the occurrence of the event of interest, initially represented by a binary variable, follows a sigmoid function.

The dependent variable follows a Bernoulli distribution and, therefore, the objective function expression (logarithmic likelihood function) that has the objective of estimating the α, β1, β2, …, βk parameters of a determined probit regression model is exactly the same as Expression (14.15) deducted in this chapter for a binary logistic regression model, given as:

LL=i=1nYilnpi+1Yiln1pi=max

si141_e  (14.47)

What varies, however, between the binary logistic regression and probit regression models is the expression of the pi probabilities of occurrence of the event of interest. As we studied, the pi expression in the binary logistic regression, which represents the logistic distribution, is given as:

pi=11+eZi=11+eα+β1X1i+β2X2i++βkXki

si142_e  (14.48)

In the probit regression, the probabilities of occurrence of the event of interest, which present the normal distribution, can be expressed as:

pi=ΦZi=Φα+β1X1i+β2X2i++βkXki

si143_e  (14.49)

where Φ represents the cumulative density function of the standard normal distribution. In this sense, Expression (14.49) can be written as follows:

pi=Zi12πe12Z2dZ

si144_e  (14.50)

where, to facilitate calculation, can be rewritten in the following way:

pi=12+121e2Zi2π12forZ0

si145_e  (14.51)

and

pi=112+121e2Zi2π12forZ<0

si146_e  (14.52)

Based on Expressions (14.48), (14.51), and (14.52), we can construct Table 14.21, which presents the p values in function of the Z values, varying from − 5 to + 5, and makes possible the comparison of the logistic (logit) and probit curves of probabilities. Notice that the p values in the column referent to the logit regression are exactly equal to those already calculated and presented in Table 14.1. In case the researcher opts to prepare this table in Excel, it can be done by using the = NORM.S.DIST(Z; 1) function to determine the p values in the column referent to the probit regression model.

Table 14.21

Probability of Event Occurrence of an Event (p) in Function of Z for Logit and Probit Regression Models
ZiLogit RegressionProbit Regression
pi
− 50.010.00
− 40.020.00
− 30.050.00
− 20.120.02
− 10.270.16
00.500.50
10.730.84
20.880.98
30.951.00
40.981.00
50.991.00

Table 14.21

Based on Table 14.21, we can construct a graph of p = f(Z), as presented in Fig. 14.67. By means of this graph, we can see that, even though the estimated probabilities in function of the different values assumed by Z are situated between 0 and 1 for both cases, distinct parameters will be estimated for the logit and probit models, being that different Z values are necessary to arrive at the same probability of occurrence of an event of interest for a determined i observation.

Fig. 14.67
Fig. 14.67 Graph of p = f(Z) for logit and probit models.

As we can see in the graph in Fig. 14.67, the logit and probit functions are considerably distinct, especially for Z values near zero, being that the estimated parameters for each case follow the α, βlogit ≈ 1.6 ⋅ [αβprobit] relation, as discussed by Amemiya (1981). We will also prove this relation in an example to be considered in the next section.

In this sense, for a determined dataset, which is better, the logit or the probit? As Finney (1952) points out, the option to choose the probit model, in detriment of the logit model, is given, theoretically, by the adherence to the probabilities curve of the event of interest occurrence to the cumulative density function of the standard normal distribution. Practically, the decision can be made based on four criteria, whose concepts have already been discussed throughout this chapter.

  •  model with highest logarithmic likelihood function value;
  •  model with higher McFadden pseudo R2;
  •  model with highest significance level in the Hosmer-Lemeshow test (lowest χ2 statistic for this test);
  •  model with greatest area below the ROC curve.

Next, we will present an example of an estimated probit regression model, whose results are compared with those obtained by a binary logistic regression model.

A.2 Example: Probit Regression Model in Stata

We will use the Triathlon.dta dataset that represents data obtained by means of research conducted on 100 amateur athletes who participated in a triathlon event known as sprint. The research consisted in knowing if a determined athlete completed the race, or not, with the idea of evaluating if such fact was related to the amount of carbohydrates, in grams per kilo of body weight, ingested the previous day. For the dependent variable, being that the event of interest refers to Yes (race finished), this category presents values equal to 1 in the dataset, leaving the category No (race not finished) with values equal to 0. Our idea, therefore, is to estimate the Z parameters, which are given, for each athlete i, as:

Zi=α+β1carbohydratesi

si147_e

based on the maximization of the logarithmic likelihood function presented in Expression (14.47), where:

pi=ΦZi=Φα+β1carbohydratesi

si148_e

The proposed model for this example can be considered a dose-response relation, being that the quantity, or dose, of carbohydrates ingested the day prior to the triathlon race can be related to finishing the race.

In Stata, we can estimate the parameters for our probit regression model by means of typing in the following command:

probit triathlon carbohydrates

the outputs of which are found in Fig. 14.68.

Fig. 14.68
Fig. 14.68 Probit regression outputs in Stata.

Alternate to this command, we could have typed in the following command:

glm triathlon carbohydrates, family(binomial) link(probit)

which generates the exact same parameter estimators, being that the probit regression models are also part of the Generalized Linear Models group.

It is important to mention that a more curious researcher could obtain these same outputs by means of the Triathlon Probit Maximum Likelihood.xls file, using the Excel Solver tool, according to the standard adopted throughout this chapter and book. In this file, the Solver criteria have already been defined.

Based on the outputs in Fig. 14.68, we see that the estimated parameters are statistically different from zero, at 95% confidence, and the final estimated probability expression that and athlete i completes the race is given as:

pi=Φ1.642+0.379carbohydratesi

si149_e

In this sense, the average estimated probability of finishing the triathlon race by, for example, a participant who ingested 10 grams of carbohydrates per kilo of body weight the day prior to the race, can be obtained by means of typing in the following command:

mfx, at(carbohydrates = 10)

The output is presented in Fig. 14.69 and, by means of which, we can arrive at the answer of 0.984 (98.4%). This answer can also be obtained by means of the following expression:

pi=Φ1.642+0.37910=Φ2.148

si150_e

Fig. 14.69
Fig. 14.69 Calculation of estimated probability when carbohydrates = 10 - mfx command.

where the value of 2.148 represents the Z-score for the cumulative density function of the standard normal distribution, which results in a probability value of 0.984. So as to verify, the researcher can type in the display normal(2.148) command in Stata, or even the same = NORM.S.DIST(2.148; 1) in any Excel cell.

Besides this, we can see that, as with the binary logistic regression models, Stata also presents, in its outputs, the McFadden pseudo R2 when estimating the probit regression model outputs, whose calculation can also be done based on Expression (14.16) and whose use is restricted solely to cases where the researcher is interested in comparing two or more distinct models (higher McFadden pseudo R2 criteria).

If the researcher also desires to estimate the parameters for the corresponding binary logistic regression model, so as to compare them with those obtained by the probit regression modeling, the following sequence of commands can be typed:

eststo: quietly logit triathlon carbohydrates
predict prob1
eststo: quietly probit triathlon carbohydrates
predict prob2
esttab, scalars(ll) se pr2

Fig. 14.70 presents the main results obtained for each estimation.

Fig. 14.70
Fig. 14.70 Main results obtained in logit and probit estimations.

Based on the consolidated outputs, it is possible to see that, even though there are difference between the parameter estimations in each case, the values obtained by the logarithmic likelihood function (ll, or log likelihood) and the McFadden pseudo R2 are slightly higher for the probit model (model 2 in Fig. 14.70), which makes it preferable to the logit model for the data in our example.

In relation to the actual estimated parameters, we can also arrive at the following relations:

αlogitαprobit=2.7671.642=1.69

si151_e

βlogitβprobit=0.6420.380=1.69

si152_e

which are in agreement with that discussed by Amemiya (1981).

For the effects of interpretation, we can state that, while the ingestion of 1 gram more of carbohydrates per kilo of body weight increases the natural logarithm of odds of completing the triathlon race, on average, in 0.642 (logit model), the same fact causes that the Z-score for the cumulative standard normal distribution is increased, on average, in 0.380 (probit model).

Next, we will study and compare the significance levels of the Hosmer-Lemeshow test and the areas below the ROC curve for both models. To do this, we should type the following commands:

quietly logit triathlon carbohydrates
estat gof, group(10)
lroc, nograph
quietly probit triathlon carbohydrates
estat gof, group(10)
lroc, nograph

The new outputs are found in Fig. 14.71.

Fig. 14.71
Fig. 14.71 Hosmer-Lemeshow tests and areas below the ROC curve obtained in the logit and probit estimations.

Based on these outputs, we can see that the areas below the ROC curve are equal in both models. However, even though the estimations do not present problems in relation to the quality of the proposed adjustment, being that there is no rejection of the null hypothesis that the expected and observed frequencies are equal, at the 95% confidence level, the significance level of the Hosmer-Lemeshow test in the probit model (χ2 = 8.93, Sig. χ2 = 0.3479) is slightly higher than the logit model (χ2 = 9.14, Sig. χ2 = 0.3305), a fact that suggests that the first (probit) presents slightly better proposed adjustment.

Finally, we can construct a graph that relates the expected values (predicted) of the probability of each athlete finishing the triathlon race (variables already generated, prob1 and prob2 for the logit and probit models, respectively) with the carbohydrates variable. This graph is presented in Fig. 14.72, and the command to generate it is:

Fig. 14.72
Fig. 14.72 Probabilities of event occurrence (finish the race) in function of the carbohydrates variable, with logit and probit adjustments.

graph twoway scatter triathlon carbohydrates || mspline prob1 carbohydrates || mspline prob2 carbohydrates ||, legend(label(2 "LOGIT") label(3 "PROBIT"))

Even though this graph shows, for the date in our example, that no significant differences exist between the logit and probit adjustments, the criteria discussed favor adoption of the later.

It is recommended, for models where the dependent variable is binary, that the researcher justify the adoption of the determined estimation criteria, or at least investigate if there is a certain adherence to the probabilities curve of the event occurrence under analysis to the cumulative standard normal distribution. If this is the case, the probit regression models can be more adequate for the generation of predicted probabilities compatible with what is being studied.

References

Amemiya T. Qualitative response models: a survey. J. Econ. Lit. 1981;19(4):1483–1536.

Anderson J.A. Logistic discrimination. In: Krishnaiah P.R., Kanal L.N., eds. Handbook of Statistics. Amsterdam: North Holland; 1982:169–191.

Belfiore P., Fávero L.P. Pesquisa operacional: para cursos de administração, contabilidade e economia. Rio de Janeiro: Campus Elsevier; 2012.

Bliss C.I. The method of probits – a correction. Science. 1934b;79(2053):409–410.

Bliss C.I. The method of probits. Science. 1934a;79(2037):38–39.

Engle R.F. Wald, likelihood ratio, and lagrange multiplier tests in econometrics. In: Griliches Z., Intriligator M.D., eds. Handbook of Econometrics II. Amsterdam: North Holland; 1984:796–801.

Finney D.J. Probit Analysis. Cambridge: Cambridge University Press; 1952.

Swets J.A. Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers. Mahwah: Lawrence Erlbaum Associates; 1996.


1 It is worth mentioning that, throughout this chapter, we are considering that the relation between the proportion of observations defined as event and the proportion of observations defined as event not occurring in the sample under study are identical to the corresponding relation existing in the population. In the case, however, that this relation is known in the population and is significantly different than that considered in the sample under analysis, the estimated probability of the occurrence of the event under study for a determined sample observation can be considerably different than that observed in the population in general. In this sense, so that the model can be applied to a population whose proportion of observations is defined as event is substantially different than that used as to its estimation, it is necessary that a correction to the value of the estimated intercept in the sample model be applied. According to what Anderson (1982) suggests, the following expression can be used to correct the intercept:

αcorrected=αestimated+lnΠ1Π0n0n1

si59_e

where Π1 and Π0 represent the proportion of observations defined as event and the observations defined as event not occurring in the general population, respectively, and n0 and n1 represent the quantity of observations defined as event not occurring and the quantity of observations defined as event in the sample under study, being n0 + n1 = n (size of sample), respectively.

"To view the full reference list for the book, click here"

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset