15.5 Regression Model Estimation for Count Data in SPSS

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

15.5 Regression Model Estimation for Count Data in SPSS

We now present the step-by-step procedure to elaborate our examples using the IBM SPSS Statistics Software®. The reproduction of the images in this section has the authorization of the International Business Machines Corporation©.

As was done in the previous chapters, our objective is neither to again present the concepts inherent to the techniques, nor to repeat that which has already been explored in previous sections. The main objective of this section is to provide the researcher with the opportunity to estimate regression models for count data in SPSS, given its ease of use and the methodology with which the software performs its operations and its user friendliness. With each presentation of an output, we will mention the respective result obtained in performing the techniques in Excel and Stata so as to allow the researcher to compare them and, in this way, decide which software to use in function of their characteristics and accessibility.

15.5.1 Poisson Regression Model in SPSS

Following the same logic proposed in the application of the models in Stata, we will go to the dataset built by the professor by means of questionnaires given to 100 students. The data can be found in the HowLatePoisson.sav file and, after opening it, we will first click on Analyze → Descriptive Statistics → Frequencies …, so as to prepare the first diagnostic on the dependent variable distribution. The dialog box in Fig. 15.47 will be opened.

As shown in Fig. 15.47, we should insert the dependent variable late (number of late arrivals to school in the last week) in Variable(s). In the button Statistics …, we should mark the options Mean and Variance, as is shown in Fig. 15.48.

Fig. 15.48 Options for calculation of mean and variance of the dependent variable.

By clicking on Continue, we will return to the previous dialog box. On the Charts … button we will mark the Histograms option, as is shown in Fig. 15.49.

Next, let’s click on Continue and then OK. The outputs can be found in Fig. 15.50.

These outputs are the same as those presented in Table 15.3 and in Fig. 15.3 in Section 15.2.1 and also Figs. 15.18–15.20 in Section 15.4.1. By their means, we can see that, even though in a preliminary way, there are indices of overdispersion in the data, since the mean and variance are very close. We now go on to the estimation of the Poisson regression model and, based on its results, we will prepare the test to verify the existence of overdispersion.

Being thus, we will click on Analyze → Generalized Linear Models → Generalized Linear Models … . A dialog box will be opened and we should select, in the Type of Model folder, the Poisson loglinear option (in Counts), as is shown in Fig. 15.51.

Fig. 15.51 Initial dialog box for estimation of Poisson models in SPSS.

It is important to remember that the researcher can use this same dialog box if they desire to estimate, for example, a multiple regression model or a logistic regression model, being that these are also part of the Generalized Linear Models.

In the Response folder, we should include the variable late in the Dependent Variable box, as is shown in Fig. 15.52.

While in the Predictors folder, we should include the variables dist, sem, and per in the Covariates box, we should insert the same three variables in the Model folder, as shown in Figs. 15.53 and 15.54, respectively.

Fig. 15.53 Dialog box for selection of explanatory variables.

Fig. 15.54 Dialog box for inclusion of explanatory variables in the model estimation.

In the Statistics folder, besides the options already selected per the SPSS standard, we also choose the option Include exponential parameter estimates, as shown in Fig. 15.55.

Fig. 15.55 Dialog box for selection of Poisson regression model statistics.

Finally, as shown in Fig. 15.56, we choose, in the Save folder, only the first option, or rather, Predicted value of mean response, which will create a variable in the dataset corresponding to λ_i (predicted amount of weekly late arrivals per student).

Next, we should click on OK. Fig. 15.57 presents the main estimation outputs.

The first output in Fig. 15.57 (Goodness of Fit) presents the maximum sum of the logarithmic likelihood function for the proposed estimation, which is − 107.615 and is exactly equal to the value obtained when modeling in Excel (Table 15.5 and Fig. 15.6) and in Stata (Figs. 15.21 and 15.27). By means of the same output, we can also verify that the quality of adjustment for the estimated model is adequate, being that for χ_cal² = 67.717 (in SPSS it is called deviance), we have, for 96 degrees of freedom, that Sig. χ² > 0.05. There are no statistically significant differences between the predicted and observed values of the number of late arrivals that occur weekly. This part of the output corresponds to what was presented in Fig. 15.25 when estimating the model in Stata.

We can also see, based on the χ² test (likelihood-ratio Chi-square = 51.015, Sig. χ² = 0.000 < 0.05 presented in the Omnibus Test output), that in the null hypothesis that all β_j (j = 1, 2, 3) parameters are statistically equal to zero can be rejected at the 5% significance level, or rather, at least one X variable is statistically significant to explain the occurrence of late arrivals per week.

The estimated parameters are found in the Parameter Estimates output and are exactly equal to those calculated manually and presented in Fig. 15.6 (Excel) and also obtained by means of the poisson command in Stata (Fig. 15.21). This same output also presents the incidence rate ratios (or irr) for each explanatory variable, which SPSS calls Exp(B), as is also presented by means of Fig. 15.27. Being that all of the confidence intervals for the estimated parameters (95% Wald Confidence Interval) do not contain zero and, consequently, the Exp(B) do not contain 1, we have arrived at the final Poisson regression model (all the Sig. Wald Chi-Square < 0.05).

Therefore, the expression for the estimated mean amount of late arrivals per week for a determined student i can be written as:

$λ_{i} = e^{(- 4.380 + 0.222 \cdot {dist}_{i} + 0.165 \cdot {sem}_{i} - 0.573 \cdot {per}_{i})}$ $λ_{i} = e^{(- 4.380 + 0.222 \cdot {dist}_{i} + 0.165 \cdot {sem}_{i} - 0.573 \cdot {per}_{i})}$

with minimum and maximum expressions, at the 95% confidence level, equal to:

$λ_{i \min} = e^{(- 6.654 + 0.093 \cdot {dist}_{i} + 0.075 \cdot {sem}_{i} - 1.086 \cdot {per}_{i})}$ $λ_{i \min} = e^{(- 6.654 + 0.093 \cdot {dist}_{i} + 0.075 \cdot {sem}_{i} - 1.086 \cdot {per}_{i})}$

$λ_{i \max} = e^{(- 2.106 + 0.351 \cdot {dist}_{i} + 0.254 \cdot {sem}_{i} - 0.060 \cdot {per}_{i})}$ $λ_{i \max} = e^{(- 2.106 + 0.351 \cdot {dist}_{i} + 0.254 \cdot {sem}_{i} - 0.060 \cdot {per}_{i})}$

After estimating the Poisson regression, we need to execute the test to verify for the presence of overdispersion in the data. To do this, we will follow the same procedure studied in Sections 15.2.4 and 15.4.1. Thus, we first create a new variable, which we will call yasterisk. Therefore, in Transform → Compute Variable …, we should proceed as shown in Fig. 15.58. Notice that the expression to be typed in the Numeric Expression box refers to Expression (15.14) and, in SPSS, the double asterisk refers to the operating exponent. The MeanPredicted variable generated in the dataset after estimating the model refers to the predicted amount of weekly late arrivals for each student (λ_i).

Fig. 15.58 Generation of yasterisk variable to test overdispersion in the data.

After clicking on OK, the new variable, yasterisk, will appear in the dataset. We should now reduce it in function of the MeanPredicted variable, according to Expression (15.15). To do this, we will click on Analyze → Regression → Linear …, and insert the yasterisk variable in the Dependent box and the MeanPredicted variable in Independent(s), as is shown in Fig. 15.59.

Fig. 15.59 Auxiliary regression to test overdispersion in the data.

On the Options … button, we should unmark the Include constant in equation option, as is shown in Fig. 15.60. Next, we can click on Continue and then OK.

The output of interest to us is found in Fig. 15.61.

Being that the P-value (Sig.) of the t-test corresponding to the β parameter of the MeanPredicted (Predicted Value of Mean of Response) variable is greater than 0.05, we can state that the data for the dependent variable do not present overdispersion at the 5% significance level, meaning that the estimated Poisson regression model is adequate for the presence of equidispersion in the data. The output in Fig. 15.61 is equal to the outputs of Fig. 15.10 (Excel) and Fig. 15.23 (Stata).

Next, as was done in Section 15.4.1, we will compare the results of the Poisson regression model estimated by maximum likelihood with those obtained by a multiple log-linear regression model estimated by ordinary least squares (OLS). To do this, we will first generate an lnlate variable that corresponds to the natural logarithm of the late dependent variable, by clicking on Transform → Compute Variable …, as shown in Fig. 15.62.

Fig. 15.62 Generation of lnlate variable for estimation of a log-linear regression model.

As such, the ln(late_i) = α + β₁ ⋅ dist_i + β₂ ⋅ sem_i + β₃ ⋅ per_i model can be estimated by OLS. To do this, we will click on Analyze → Regression → Linear …, and insert the lnlate variable in the Dependent box and the dist, sem, and per variables in the Independent(s) box, as is shown in Fig. 15.63.

Fig. 15.63 Dialog box for estimation of the log-linear regression model.

In the Save … button, we should mark the option Unstandardized, in Predicted Values, as is shown in Fig. 15.64. Next, we can click on Continue and then OK. This procedure will create a new variable in the dataset called, by SPSS, PRE_1, which corresponds to the yhat variable generated when estimating in Stata (predicted values of the natural logarithm for the number of weekly late arrivals per student).

We will not present the results of this multiple regression estimated by SPSS since what interests us, at this time, is only generating another variable, based on the PRE_1 variable, which will represent the predicted values of the actual weekly number of late arrivals per student. This variable, which we will call eyhat, can be created by once again clicking on Transform → Compute Variable …, as is shown in Fig. 15.65.

Fig. 15.65 Generation of eyhat variable based on PRE_1 variable.

So as to prepare a graph similar to that presented in Fig. 15.30, or rather, a graph that allows for comparison, for each of the estimations, of the predicted values and the real values of the number of weekly late arrivals, we will not click on Graphs → Legacy Dialogs → Line … and, next, on the options Multiple and Summaries of separate variables, as presented in Fig. 15.66.

Fig. 15.66 Dialog box for construction a graph to compare estimates.

By clicking on Define, another dialog box will appear like that presented in Fig. 15.67. We should insert the MeanPredicted (predicted amount of weekly late arrivals for each student estimated by maximum likelihood for the Poisson regression model) and eyhat (predicted amount of weekly late arrivals for each student estimated by OLS for the multiple log-linear regression model) variables in the Lines Represent box and the late variable in the Category Axis. Next, we can click on OK.

The graph in Fig. 15.68 offers the ability to compare the behaviors of the predicted values with the real values of the dependent variable for each of the prepared estimations, from which we can see that they are different. As has been discussed, the fact that the dependent variable is quantitative is not sufficient condition to elaborate a multiple regression models with OLS estimation. Count data present unique distributions and the researcher must always be aware of this fact, so as to estimate adequate and consistent models for diagnosis and prediction.

15.5.2 Negative Binomial Regression Model in SPSS

Following the same logic proposed in the previous section, we will now open the HowLateBNeg.sav file, which gives the data regarding the monthly number of late arrival for the 100 students, the distance traveled, and the time of day when each student usually travels to school (morning or afternoon).

By clicking on Analyze → Descriptive Statistics → Frequencies …, we can first elaborate the diagnostic regarding the dependent variable distribution. In this dialog box, not presented again here, we should insert the late variable (number of late arrivals to school in the last month) in Variable(s) and, in the Statistics … button, we should mark the options Mean and Variance. Already being in the Charts … button, we will check the Histograms option and then, click on Continue and OK. The outputs are found in Fig. 15.69.

Fig. 15.69 Mean, variance, table of frequencies, and histogram of the dependent variable.

These outputs are the same as those presented in Table 15.11 and Fig. 15.12 in Section 15.3.1 as well as Figs. 15.32–15.34 in Section 15.4.2 and, by them, we can see that, even though in a preliminary way, there are indices of overdispersion in the data, since the variance is higher than the dependent variable mean.

It is recommended, therefore, that a Poisson regression model be first estimated to then, based on its results, prepare a test to verify for the presence of overdispersion in the data. We will not again show the windows for the estimation of this model in SPSS, as was done in the previous section. However, the steps to create it will be described.

Being thus, we will first click on Analyze → Generalized Linear Models → Generalized Linear Models … . In the dialog box that will be opened, we should select, in the Type of Model folder, the Poisson loglinear option (in Counts). Then in the Response folder, we should include the late variable in the Dependent Variable box. While in the Predictors folder, we should include the dist, sem, and per variables in the Covariates box, in the Model folder, we should insert these same three variables in the Model box. In the Statistics folder, besides the options already selected per SPSS default, we should also select the Include exponential parameter estimates option and, finally, in the Save folder, select only the option Predicted value of mean response. By clicking on OK, the outputs of the estimation of the Poisson regression model will be generated. These outputs will not be shown in their entirety here.

Fig. 15.70 presents only the output of interest to us at this time (Goodness-of-Fit) and, by means of it, we can see that the estimated model adjustment quality is not adequate, being that for a χ_cal² = 145.295 (Deviance), we have, for 96 degrees of freedom, that Sig. χ² < 0.05, or rather, there are statistically significant differences between the predicted values for the Poisson model and the observed number of late arrivals that occur per month. This very important part of the output corresponds to what was presented in Fig. 15.36 when estimating the model by Stata.

The goodness-of-fit for the estimated Poisson regression model may not have been adequate due to the presence of overdispersion in the dependent variable data and, therefore, we will now perform the test to verify the existence of this phenomenon. According to what was seen in the previous section, we need to create a new variable, which we will also call yasterisk. To do this, we will click on Transform → Compute Variable … . The expression to be typed in the Numeric Expression box refers to Expression (15.14) and, in SPSS, will be the same as that presented in Fig. 15.58, or rather, (((late-MeanPredicted)⁎⁎2)-late)/MeanPredicted, where the MeanPredicted variable, generated in the dataset after Poisson regression model estimation, refers to the predicted amount of monthly late arrivals for each student. We will not present here the figures disposed in the previous section.

After clicking on OK, the new variable, yasterisk, will appear in the dataset. Let’s estimate yasterisk in function of the MeanPredicted variable, according to Expression (15.15). To do this, we should click on Analyze → Regression → Linear …, and insert the yasterisk variable in the Dependent box and the MeanPredicted variable in Independent(s). Finally, in the Options … button, we should unmark the Include constant in equation option and, next, click on Continue and then OK. The output that interests us can be found in Fig. 15.71.

Fig. 15.71 Result of the overdispersion test in SPSS.

Being that the P-value (Sig.) of the t-test corresponding to the β parameter of the MeanPredicted (Predicted Value of Mean of Response) variable is lower than 0.05, we can state that the data of the dependent variable present overdispersion at the 5% significance level, causing the estimated Poisson regression model to be inadequate. The output in Fig. 15.71 is equal to that of Fig. 15.35 (estimated by Stata).

We will now go to estimation of the negative binomial regression model. To do this, we should click on Analyze → Generalized Linear Models → Generalized Linear Models … and, in the dialog box that will open, select, in the Type of Model folder, the Custom option. In this same folder, we should also select the Negative binomial (in Distribution), Log (in Link function) and Estimate value (in Parameter) options. This last option refers to the ϕ parameter and, therefore, an NB2 regression model will be estimated. Fig. 15.72 shows how this folder will look after selecting the options.

Fig. 15.72 Initial dialog box for estimation of NB2 models in SPSS.

For the remaining folders, the researcher can opt to maintain the same options that had already been selected when estimating the initial Poisson regression model. The outputs generated by means of the present negative binomial regression model are found in Fig. 15.73.

The first output in Fig. 15.73 (Goodness of Fit) presents the sum value of the logarithmic likelihood function for the NB2 model, which is − 151.012 and is exactly equal to the value obtained when modeling in Excel (Table 15.12 and Fig. 15.14) and Stata (Figs. 15.37, 15.39, and 15.41). By means of the same output, we can also see that the adjustment quality for the estimated model is now adequate, being that, for an χ_cal² = 105.025 (Deviance), we have, for 96 degrees of freedom, that Sig. χ² > 0.05 (being that χ_c² = 119.871 for 96 degrees of freedom and the 5% significance level), or rather, there are no statistically significant differences between the predicted and observed values for the number of monthly late arrivals to school. This part of the output corresponds to the Deviance presented by Stata when estimating the negative binomial regression model obtained by the glm..., family(nbinomial ml) command (Fig. 15.39).

We can also see, based on the χ² (likelihood-ratio Chi-square = 63.249, Sig. χ² = 0.000 < 0.05 presented in the Omnibus Test test) test, that the null hypothesis that all β_j (j = 1, 2, 3) parameters are statistically equal to zero can be rejected at the 5% significance level, or rather, at least one X variable is statistically significant to explain the occurrence of monthly late arrivals.

The estimated parameters are found in the Parameter Estimates output and are exactly equal to those calculated manually and presented in Fig. 15.14 (Excel) and also obtained by means of the nbreg or glm..., family(nbinomial ml) commands in Stata (Figs. 15.37 and 15.39, respectively). This same output also presents the incidence rate ratios (or irr) for each explanatory variable, which SPSS calls Exp(B), as was also presented by means of Fig. 5.41. Being that the confidence intervals for the estimated parameters (95% Wald Confidence Interval) do not contain zero and, consequently those from Exp(B) do not contain 1, we have already arrived at the final negative binomial regression model (all the Sig. Wald Chi-Square < 0.05).

Then, the mean estimated amount of monthly late arrivals for a determined student can be written as:

$u_{i} = e^{(- 4.997 + 0.308 \cdot {dist}_{i} + 0.197 \cdot {sem}_{i} - 0.927 \cdot {per}_{i})}$ $u_{i} = e^{(- 4.997 + 0.308 \cdot {dist}_{i} + 0.197 \cdot {sem}_{i} - 0.927 \cdot {per}_{i})}$

Besides this, also based on the final output in Fig. 15.73, the estimated number of monthly late arrivals presents, at the 95% confidence level, minimum and maximum expressions equal to:

$u_{i_{\min}} = e^{(- 7.446 + 0.168 \cdot {dist}_{i} + 0.100 \cdot {sem}_{i} - 1.431 \cdot {per}_{i})}$ $u_{i_{\min}} = e^{(- 7.446 + 0.168 \cdot {dist}_{i} + 0.100 \cdot {sem}_{i} - 1.431 \cdot {per}_{i})}$

$u_{i_{\max}} = e^{(- 2.549 + 0.447 \cdot {dist}_{i} + 0.294 \cdot {sem}_{i} - 0.424 \cdot {per}_{i})}$ $u_{i_{\max}} = e^{(- 2.549 + 0.447 \cdot {dist}_{i} + 0.294 \cdot {sem}_{i} - 0.424 \cdot {per}_{i})}$

Finally, the lower part of the final output in Fig. 15.73 presents the estimation of ϕ (negative binomial). As we can see, the confidence interval for ϕ does not contain zero, or rather, at the 95% confidence level, we can state that ϕ is statistically different from zero and has an estimated value equal to 0.255, as has already been calculated in Section 15.3.1 by means of Excel Solver (Fig. 15.14) and in Section 15.4.2 by means of Stata (Figs. 15.37, 15.39, and 15.41). This proves the existence of overdispersion in the data, with the variance in the dependent variable presented in the following expression:

$Var (Y) = u + 0.255 \cdot u^{2}$ $Var (Y) = u + 0.255 \cdot u^{2}$

Finally, we will prepare a graph similar to that presented in Fig. 15.45, however also including the values estimated by OLS for a multiple log-linear regression model. In other words, we will prepare a graph that allows the comparison of the predicted values and real values for monthly late arrivals for each of the estimated models (negative binomial, Poisson, and log-linear regression by OLS).

Being that the predicted values for the estimated Poisson and negative binomial models are already in the dataset (MeanPredicted and MeanPredicted_1 variables, respectively), at this time we need to estimate the multiple log-linear regression by OLS, whose results will not be presented here, even though the procedure will be.

Therefore, let’s generate a variable called lnlate that corresponds to the natural logarithm of the late dependent variable, by clicking on Transform → Compute Variable … . The expression that should be typed in the Numeric Expression box is ln(late) so that, in this way, the model ln(late_i) = α + β₁ ⋅ dist_i + β₂ ⋅ sem_i + β₃ ⋅ per_i can be estimated by OLS.

Next, we will click on Analyze → Regression → Linear …, and insert the variable lnlate in the Dependent box and the variables dist, sem, and per in the Independent(s) box. In the Save … button, we should mark the option Unstandardized, under Predicted Values and, finally, we can click on Continue and OK. This procedure will create a new variable called, by SPSS, PRE_1 (predicted values of the natural logarithm for the number of monthly late arrivals).

However, the variable that we want to create refers to the predicted values of the number of predicted monthly late arrivals, and not the predicted valued of the natural logarithm for the number of monthly late arrivals. Therefore, we need to again click on Transform → Compute Variable … and create a variable called eyhat, which expression to be typed into the Numeric Expression box is exp(PRE_1).

In this way we can produce the desired graph, clicking on Graphs → Legacy Dialogs → Line … and, next, on the options Multiple and Summaries of separate variables. By clicking on Define, a dialog box will appear where we should enter the MeanPredicted variables (values predicted by the Poisson model), MeanPredicted_1 (values predicted by the negative binomial model), and eyhat (predicted values for the log-linear regression estimated by OLS) in the Lines Represent box and the late variable in Category Axis. Next, we can click on OK.

The generated graph can be edited by means of a double click, which opts for the presentation of the Spline type of interpolation, as shown in Fig. 15.74. The final graph is seen in Fig. 15.75.

Fig. 15.74 Definition of spline type interpolation for graph construction.

By analyzing the graph in Fig. 15.75, we can see that the variance in the predicted amounts of monthly late arrivals is much greater for the negative binomial regression model, whose estimation actually succeeds in capturing the existence of overdispersion in the data, especially for a greater number of monthly late arrivals.

This confirms the fact that count data distributions with greater observed value amplitudes can increase the variance of the variable under study in a higher proportion than its average, which can cause an overdispersion in the data. While we did not verify for the presence of overdispersion for weekly count data, with less possibilities of occurrence, this phenomenon presented itself when the count data came to represent itself on a monthly scale, or rather, with more ample occurrence possibilities. As we studied in this chapter, while the first case was approached by means of a Poisson regression model estimation, the data from the second case came to present a better adjustment when a negative binomial regression model was estimated.

15.6 Final Remarks

Regression models’ estimation, where the dependent variable is composed of count data, presents innumerable applications. They are, however, little explored due either to the lack of knowledge of existing models, or the common belief that, even though incorrect, if the dependent variable is quantitative, it is up to OLS estimation, independent of its distribution.

The Poisson and negative binomial regression models are log-linear models (or semilogarithmic to the left) and represent the best-known count data models, being estimated by maximum likelihood. While the correct estimation of a Poisson regression model demands the nonexistence of the overdispersion phenomenon in the dependent variable data, estimation by a negative binomial regression model allows that the variance for a dependent variable be statistically higher than its mean.¹

It is recommended that, before defining the most adequate and consistent regression model when there is count data, a diagnostic regarding the dependent variable distribution and the estimation of a Poisson regression model be performed and, from that point, execute a test to verify for the existence of overdispersion in the data. Should this be the case, a negative binomial regression model should be estimated, being recommended the NB2 type model.

The Poisson and negative binomial regression models should be estimated by means of the correct use of the chosen software. The inclusion of potential explanatory variables for the phenomenon under study should always be done based on the subjacent theory and the researcher’s intuition.

15.7 Exercises

(1) The finance department for a large appliance retailer wants to know if consumer income and age explain the use of financing, by means of installment closed-end credit, when purchasing goods such as cellular telephones, tablets, laptops, televisions, videogames, DVD/Blu-ray players, etc., so as to develop a marketing campaign for this form of financing based on client profile. To do this, the finance department marketing area randomly chose a sample of 200 consumers from its total client base, with the following variables:

Variable	Description
id	A string variable that varies between 001 and 200 and that identifies the consumer
purchases	Dependent variable corresponding to the amount of durable good purchases made using installment closed-end credit in the last year per consumer (count data)
income	Monthly consumer income (US$)
age	Consumer age (years)

By means of analyzing the dataset present in the Financing.sav and Financing.dta files, the following is requested:

(a) Prepare a preliminary diagnostic regarding the existence of overdispersion in the purchases variable data. Present its mean and variance, prepare its histogram.
(b) Estimate a Poisson regression model and, based on the results, execute the test to verify for the existence of overdispersion in the data. What was the test result at the 5% significance level?
(c) Execute an χ² test to compare the distribution of observed and predicted probabilities for the incidence of the annual use of installment closed-end credit. Does the test result, at the 5% significance level, indicate a quality of adjustment for the Poisson regression model?
(d) If the answer to the previous question is yes, present the final expression for the mean estimated quantity of annual use of installment closed-end credit when purchasing durable goods, in function of the explanatory variables that show themselves to be statistically significant, at the 95% confidence level.
(e) What is the expected average quantity of installment closed-end credit use per year for a consumer with a monthly income of US$2600.00 and who is 47 years old?
(f) On average, how much does the annual incidence rate of installment closed-end credit use change by increasing the average consumer monthly income by US$100.00, maintaining the remaining conditions constant?
(g) On average, how does the annual incidence rate of installment closed-end credit use change by increasing the average consumer age by 1 year, maintaining the remaining conditions constant?
(h) Construct a graph (mspline in Stata or Spline in SPSS) that shows the predicted value for the annual incidence of installment closed-end credit use in function of consumer monthly income. Provide a brief discussion.
(i) Estimate a multiple log-linear regression model by OLS and compare the results predicted in this model with those estimated by the Poisson model.
(j) In the case there is interest in increasing financing by means of installment closed-end credit, what target public needs to be targeted in this financing market campaign?

(2) With the idea of studying if the proximity to parks and green areas and malls and shopping centers causes a reduction in the intent to sell an apartment, a real-estate agency decided to mark the location of each one of 276 pieces of property for sale in a determined municipality, as shown in the following figure.

To facilitate the study, the real-estate agency elaborated a grid over the municipality map, with the intent of identifying the characteristics of each microregion. By means of the grid, 100 squares (10 × 10) with equal dimensions were identified according to the following figure.

To better see the quantity of property for sale in each microregion, it was decided to hide the municipality map in the following figure.

The following variables from each of the municipality microregions were therefore developed, defined by the squares:

Variable	Description
square	A string variable that identifies the microregion (square). It is named by a number i followed by a letter j, where the number i varies from 1 to 10 and the letter j, from A to J
property	Dependent variable corresponding to the amount of residential property for sale per square (count data)
distpark	Distance from the square to the main municipal park (in meters)
mall	Binary variable that indicates if there are malls or shopping centers in the square (No = 0; Yes = 1)

The data can be found in the Realestate.sav and Realestate.dta files. Answer the following:

(a) Execute a preliminary diagnostic regarding the existence of overdispersion in the data for the property variable. Present its mean, its variance, and its histogram.
(b) Next, estimate the Poisson regression model and, based on its results, prepare the test to verify for the existence of overdispersion in the data. What is the test conclusion, at the 5% significance level? Also, prepare an χ² test to compare the distribution of observed and predicted probabilities for the amount of property for sale in each square. Does the test result, at the 5% significance level, indicate the existence of quality of adjustment for the Poisson regression model? Justify your answer.

${property}_{ij} = e^{(α + β_{1} \cdot {distpark}_{ij} + β_{2} \cdot {mall}_{ij})}$ ${property}_{ij} = e^{(α + β_{1} \cdot {distpark}_{ij} + β_{2} \cdot {mall}_{ij})}$

(c) Estimate a type NB2 negative binomial regression model.
(d) Can it be said, at the 95% confidence level, that the ϕ parameter (inverse of the shape parameter of the Gamma distribution) is statistically different from zero? If yes, should one opt for the negative binomial estimation model?

The next seven items refer to the NB2 negative binomial regression model.

(e) What is the expression for the mean estimated amount of property for sale in a determined square ij?
(f) What is the mean amount of expected property for sale in a microregion (square) that is 820 m from the park and does not have any shopping centers?
(g) On average, how much does the rate of incidence of property for sale per square alter when there is an average distance from the park, maintaining the remaining conditions constant?
(h) On average, how much does the rate of incidence of property for sale alter when there is a shopping center or mall in the microregion (square), maintaining the remaining conditions constant?
(i) Construct a graph (mspline in Stata or Spline in SPSS) that shows the behavior of the predicted amount of property for sale per square in function of distance to the park.
(j) Construct the same graph, however now stratifying the squares that have shopping centers from those that do not.
(k) Can it be said that the proximity to parks and green areas and malls and shopping centers hinders the intent to place residential property up for sale?

Also, answer the following:

(l) Compare the Poisson and negative binomial regression models by means of a graph that presents the distribution of observed and predicted probabilities for the incidents of property for sale per square.
(m) Also, compare the quality of adjustment for both models (Poisson and negative binomial) by means of analyzing the maximum differences between the distribution of the observed and predicted probabilities that occur in both cases. Besides this, prepare this analysis comparing the total Pearson values for both estimations.
(n) Estimate a multiple log-linear regression model by OLS and compare the predicted results of this model with those estimated by the Poisson and negative binomial models.

Appendix: Zero-Inflated Regression Models

A.1 Brief Introduction

As part of the Generalized Linear Models, the count data regression models are used for cases where the phenomenon under study presents itself in the form of a quantitative variable, however with only discrete and non-negative values, as we studied throughout the chapter. However, it is common that some variables with count data present an excessive amount of zeros, which can cause that the estimated parameters, when estimating the traditional Poisson and negative binomial regression models, be biased since they cannot capture the exacerbated presence of null counts. In these situations, the zero-inflated regression models can be used and, in this Appendix, we will study these models focusing on the Poisson and negative binomial types.²

The zero-inflated regression models, according to Lambert (1992), are considered a combination between a model for count data and a model for binary data, since they are used to investigate the reasons that lead to a determined number of occurrences (count) for a phenomenon, as well as lead (or not) to the actual occurrence of this phenomenon, independent of the amount of observed count.

In this sense, while the Poisson zero-inflated model is estimated based on a combination of a Bernoulli distribution with a Poisson distribution, a zero-inflated negative binomial model is estimated by means of the combination of a Bernoulli distribution with a Poisson-Gamma distribution. The choice of one or the other obeys what we studied throughout the chapter, or rather, it goes by the existence of overdispersion in the data, or rather, by the analysis of the inverse of the shape parameter of the Gamma distribution and the corresponding likelihood-ratio test for the referred parameter. We will return to discuss this question later, when preparing an example in Stata.

The actual definition regarding the existence or not of an excessive amount of zeros in the Y dependent variable is prepared by means of a specific test, known as the Vuong test (1989), which will represent the first output to be analyzed when estimating the zero-inflated regression models.

Specifically in relation to the zero-inflated Poisson regression models, we can define that, while the p probability of occurrence of no count for a given observation i (i = 1, 2, …, n, where n is the sample size), or rather, p(Y_i = 0), is calculated taking into consideration the sum of a dichotomic component and, therefore, should define the p_{logit_i} probability of not occurring of any count due exclusively to the dichotomic component, the occurrence of a determined m count occurring (m = 1, 2, …), or rather, p(Y_i = m), follows the same expression of the Poisson distribution, multiplied by (1 − p_{logit_i}).

Therefore, using Expressions (14.10) and (15.1), we have that:

$\{\begin{cases} p (Y_{i} = 0) = p_{{logit}_{i}} + (1 - p_{{logit}_{i}}) \cdot e^{- λ_{i}} \\ p (Y_{i} = m) = (1 - p_{{logit}_{i}}) \cdot \frac{e^{- λ_{i}} \cdot λ_{i}^{m}}{m!}, m = 1, 2, \dots \end{cases}$ $\{\begin{cases} p (Y_{i} = 0) = p_{{logit}_{i}} + (1 - p_{{logit}_{i}}) \cdot e^{- λ_{i}} \\ p (Y_{i} = m) = (1 - p_{{logit}_{i}}) \cdot \frac{e^{- λ_{i}} \cdot λ_{i}^{m}}{m!}, m = 1, 2, \dots \end{cases}$

si93_e (15.32)

being Y ~ ZIP(λ, p_{logit_i}), where ZIP means zero-inflated Poisson, and knowing that:

$p_{{logit}_{i}} = \frac{1}{1 + e^{- (γ + δ_{1} \cdot W_{1 i} + δ_{2} \cdot W_{2 i} + \dots + δ_{q} \cdot W_{qi})}}$ $p_{{logit}_{i}} = \frac{1}{1 + e^{- (γ + δ_{1} \cdot W_{1 i} + δ_{2} \cdot W_{2 i} + \dots + δ_{q} \cdot W_{qi})}}$

si94_e (15.33)

and

$λ_{i} = e^{(α + β_{1} \cdot X_{1 i} + β_{2} \cdot X_{2 i} + \dots + β_{k} \cdot X_{ki})}$ $λ_{i} = e^{(α + β_{1} \cdot X_{1 i} + β_{2} \cdot X_{2 i} + \dots + β_{k} \cdot X_{ki})}$

(15.34)

We can see that, if p_{logit_i} = 0, the distribution of probabilities for Expression (15.32) is clearly summarized in the Poisson distribution, including for cases where Y_i = 0. In other words, the zero-inflated Poisson regression models present two processes generating zeros, being one due to the binary distribution (in this case, the so-called structural zeros are generated) and the other due to the Poisson distribution (in this case count data is generated, among which are the so-called sampling zeros).³

Based on Expressions (15.33) and (15.34), we can, therefore, define that, while the occurrence of structural zeros is influenced by a vector of explanatory variables W₁, W₂, …, W_q, the occurrence of a determined m count is influenced by a vector of X₁, X₂, …, X_k. In some cases, the researcher may insert the same variable in two vectors, in the case the desire to investigate if this variable influences, simultaneously, the occurrence of the event and, if yes, the quantity of occurrences (counts) of the referred phenomenon.

Based on Expression (15.32), and following the logic defined for the logarithmic likelihood function presented in Expression (15.7), we can arrive at the following objective function, which has as its objective to estimate the α, β₁, β₂, …, β_k and γ, δ₁, δ₂, …, δ_k parameters for a determined zero-inflated Poisson regression model:

$\begin{array}{l} LL = \sum_{Y_{i} = 0} ln [p_{{logit}_{i}} + (1 - p_{{logit}_{i}}) \cdot e^{- λ}] + \\ \sum_{Y_{i} > 0} [ln (1 - p_{{logit}_{i}}) - λ_{i} + (Y_{i}) \cdot ln (λ_{i}) - ln (Y_{i}!)] = max \end{array}$ $\begin{array}{l} LL = \sum_{Y_{i} = 0} ln [p_{{logit}_{i}} + (1 - p_{{logit}_{i}}) \cdot e^{- λ}] + \\ \sum_{Y_{i} > 0} [ln (1 - p_{{logit}_{i}}) - λ_{i} + (Y_{i}) \cdot ln (λ_{i}) - ln (Y_{i}!)] = max \end{array}$

si96_e (15.35)

which solution, as has been presented throughout the chapter, can be obtained by means of linear programming tools.

In relation to the zero-inflated negative binomial regression models, we can define that, while the p probability of the occurrence of no count for a given i observation, or rather, p(Y_i = 0), is also calculated taking into consideration the sum of a dichotomic component with a count component, the p probability of the occurrence of a determined m count (m = 1, 2, …), or rather, p(Y_i = m), now follows the probability expression of the Poisson-Gamma distribution. In this sense, using Expressions (14.10) and (15.25), we have that:

$\{\begin{cases} p (Y_{i} = 0) = p_{{logit}_{i}} + (1 - p_{{logit}_{i}}) \cdot {(\frac{1}{1 + ϕ \cdot u_{i}})}^{\frac{1}{ϕ}} \\ p (Y_{i} = m) = (1 - p_{{logit}_{i}}) \cdot [(\begin{array}{c} m + ϕ^{- 1} - 1 \\ ϕ^{- 1} - 1 \end{array}) \cdot {(\frac{1}{1 + ϕ \cdot u_{i}})}^{\frac{1}{ϕ}} \cdot {(\frac{ϕ \cdot u_{i}}{ϕ \cdot u_{i} + 1})}^{m}], m = 1, 2, \dots \end{cases}$ $\{\begin{cases} p (Y_{i} = 0) = p_{{logit}_{i}} + (1 - p_{{logit}_{i}}) \cdot {(\frac{1}{1 + ϕ \cdot u_{i}})}^{\frac{1}{ϕ}} \\ p (Y_{i} = m) = (1 - p_{{logit}_{i}}) \cdot [(\begin{array}{c} m + ϕ^{- 1} - 1 \\ ϕ^{- 1} - 1 \end{array}) \cdot {(\frac{1}{1 + ϕ \cdot u_{i}})}^{\frac{1}{ϕ}} \cdot {(\frac{ϕ \cdot u_{i}}{ϕ \cdot u_{i} + 1})}^{m}], m = 1, 2, \dots \end{cases}$

si97_e (15.36)

Being Y ~ ZINB(ϕ, u, p_{logit_i}), where ZINB means zero-inflated negative binomial and ϕ represents the inverse of the shape parameter of a determined Gamma distribution, and that, analogous to that presented for the zero-inflated Poisson regression models, that:

si94_e (15.37)

and

$u_{i} = e^{(α + β_{1} \cdot X_{1 i} + β_{2} \cdot X_{2 i} + \dots + β_{k} \cdot X_{ki})}$ $u_{i} = e^{(α + β_{1} \cdot X_{1 i} + β_{2} \cdot X_{2 i} + \dots + β_{k} \cdot X_{ki})}$

(15.38)

We can again see that, if p_{logit_i} = 0, the probability distribution for Expression (15.36) is restricted to the Poisson-Gamma distribution, including in cases where Y_i = 0. Then, the zero-inflated negative binomial regression models also present two processes generating zeros, resulting from the binary distribution and the Poisson-Gamma distribution.

Therefore, based on Expression (15.36), and based on the logarithmic likelihood function defined in Expression (15.29), we arrive at the following objective function, which has as its intent to estimate the ϕ, α, β₁, β₂, …, β_k and γ, δ₁, δ₂, …, δ_k parameters for a determined zero-inflated negative binomial regression model:

$\begin{array}{l} LL = \sum_{Y_{i} = 0} ln [p_{{logit}_{i}} + (1 - p_{{logit}_{i}}) \cdot {(\frac{1}{1 + ϕ \cdot u_{i}})}^{\frac{1}{ϕ}}] \\ + \sum_{Y_{i} > 0} [ln (1 - p_{{logit}_{i}}) + Y_{i} \cdot ln (\frac{ϕ \cdot u_{i}}{1 + ϕ \cdot u_{i}}) - \frac{ln (1 + ϕ \cdot u_{i})}{ϕ} \\ + ln Γ (Y_{i} + ϕ^{- 1}) - ln Γ (Y_{i} + 1) - ln Γ (ϕ^{- 1})] = max \end{array}$ $\begin{array}{l} LL = \sum_{Y_{i} = 0} ln [p_{{logit}_{i}} + (1 - p_{{logit}_{i}}) \cdot {(\frac{1}{1 + ϕ \cdot u_{i}})}^{\frac{1}{ϕ}}] \\ + \sum_{Y_{i} > 0} [ln (1 - p_{{logit}_{i}}) + Y_{i} \cdot ln (\frac{ϕ \cdot u_{i}}{1 + ϕ \cdot u_{i}}) - \frac{ln (1 + ϕ \cdot u_{i})}{ϕ} \\ + ln Γ (Y_{i} + ϕ^{- 1}) - ln Γ (Y_{i} + 1) - ln Γ (ϕ^{- 1})] = max \end{array}$

si100_e (15.39)

whose solution can also be obtained by means of linear programming tools.

Next, we will present an example prepared in Stata where the parameters for a Poisson and a negative binomial regression model are estimated, both with inflated zeros. First, the significance of the amount of zeros in the Y dependent variable (Vuong test) to then, after, evaluate the significance of the parameter ϕ (likelihood-ratio test for ϕ), or rather, the existence of overdispersion in the data. Box 15.2 presents the relation between the regression models for count data and the existence of overdispersion and the excess of zeros in the data of the dependent variable.

Box 15.2

Regression Models for Count Data, Overdispersion, and Excess of zeros in the Data of the Dependent Variable

Verification	Regression Model for Count Data
Verification	Poisson	Negative Binomial	Zero-Inflated Poisson (ZIP)	Zero-Inflated Negative Binomial (ZINB)
Overdispersion in the data of the dependent variable	No	Yes	No	Yes
Excessive amount of zeros in the data of the dependent variable	No	No	Yes	Yes

Unlabelled Table

In this way, while the zero-inflated models of the Poisson and negative binomial types are more appropriate when there is an excessive amount of zeros in the dependent variable, the use of these last two is even more recommended when there is overdispersion in the data.

A.2 Example: Zero-Inflated Poisson Regression Model in Stata

So as to prepare zero-inflated regression models, we will use the Accidents.dta dataset. To prepare this dataset, the amount of traffic accidents that occurred in 100 cities in a determined country was investigated, which represents a dependent variable with count data. Besides this, the average of inhabitants with a current driver’s license and the fact that the municipality had adopted a dry law for after 10:00 pm was inserted into the urban population base. The desc command allows us to study the dataset characteristics, as is shown in Fig. 15.76.

Fig. 15.76 Description of the Accidents.dta dataset.

In this example, we will define the pop variable as the X variable, and the age and drylaw variables as the W₁ and W₂ variables. In other words, our goal is to see if the probability or not of accidents, or rather, the occurrence of structural zeros, is influenced by the average age of drivers and the fact of having a dry law after 10:00 p.m. in the municipalities and, besides this, if the occurrence of a determined accident count in the week under study is influenced by the population of each municipality i (i = 1, …, 100). Therefore, for the zero-inflated Poisson regression model, the parameters of the following expressions should be estimated.

$p_{{logit}_{i}} = \frac{1}{1 + e^{- (γ + δ_{1} \cdot {age}_{i} + δ_{2} \cdot {drylaw}_{i})}}$ $p_{{logit}_{i}} = \frac{1}{1 + e^{- (γ + δ_{1} \cdot {age}_{i} + δ_{2} \cdot {drylaw}_{i})}}$

si101_e

and

$λ_{i} = e^{(α + β \cdot {pop}_{i})}$ $λ_{i} = e^{(α + β \cdot {pop}_{i})}$

First, let’s analyze the distribution of the accidents variable, typing in the following commands:

tab accidents
hist accidents, discrete freq

Figs. 15.77 and 15.78 present the table of frequencies and the histogram, respectively, and, by their means, it is possible to see that, for the country under study, 58% of the municipalities analyzed did not present any traffic accident in the week researched, which indicated, even though preliminarily, the existence of an excessive amount of zeros in the dependent variable.

Fig. 15.77 Frequency distribution for count data of accidents variable.

Fig. 15.78 Histogram of accidents dependent variable.

To elaborate the zero-inflated Poisson regression model, we should type in the following command:

zip accidents pop, inf(age drylaw) vuong nolog

where the X dependent variable (pop) should come immediately after the dependent variable (accidents) and the W₁ and W₂ variables (age and drylaw) should come in parentheses, immediately after the term inf, which means inflate and corresponds to the inflation of structural zeros. The term vuong causes the Vuong test (1989) to be executed, which verifies the adequacy of the zero-inflated model in relation to the specified traditional model (in this case, Poisson), or rather, its goal is to verify for the existence of an excessive amount of zeros in the dependent variable. The term nolog omits the outputs referent to the modeling iterations so that the maximum value of the logarithmic likelihood function is presented.

Besides this, it is important to mention that the command presented implicitly offers, as standard, the logit model probabilities expression to verify for the existence of structural zeros referent to the Bernoulli distribution. However, in case the researcher opts to work with the probit model probabilities expression, studied in the Appendix of Chapter 14, the term probit should be added to the end of the command.

The outputs are found in Fig. 15.79.

The first result that should be analyzed refers to the Vuong test, whose statistic is normally distributed, with positive and significant values indicating the adequacy of the zero-inflated Poisson model, and with negative and significant values indicating the adequacy of the traditional Poisson model. For the data in our example, we can see that the Vuong test indicates the better adequacy of the zero-inflated model over the traditional model, being that z = 4.19 and Pr > z = 0.000.

Before analyzing the remaining outputs, it is important to mention that Desmarais and Harden (2013) propose a correction to the Vuong test, based on the Akaike information criterion (AIC) and Bayesian (Schwarz) information criterion (BIC) statistics and which should be elaborated so as to eliminate eventual biases that can affect the decision regarding choosing the more adequate model. To do this, one only need substitute the zip with the term zipcv (which means zero-inflated Poisson with corrected Vuong), and the new command will be as follows:

zipcv accidents pop, inf(age drylaw) vuong nolog

However, before its elaboration in Stata, we should install the command zipcv, typing in findit zipcv and clicking on the link st0319 from http://www.stata-journal.com/software/sj13-4. Next, we should click on click here to install.

The new outputs are found in Fig. 15.80.

For the data in our example, while the Vuong test statistic is z = 4.19, the AIC and BIC corrected statistics are z = 4.13 and z = 4.04, respectively, or rather, all present Pr > z = 0.000. In other words, the results of the Vuong test with AIC and BIC correction continue to allow, in this case, that we state that the zero-inflated model is the most appropriate.

Notice that the remaining outputs presented in Figs. 15.79 and 15.80 are exactly the same. Based on these outputs, we can see that the estimated parameters are statistically different from zero, at 95% confidence, and the final expressions of p_{logit_i} and of λ_i are given by:

$p_{{logit}_{i}} = \frac{1}{1 + e^{- (- 11.729 + 0.225 \cdot {age}_{i} + 1.726 \cdot {drylaw}_{i})}}$ $p_{{logit}_{i}} = \frac{1}{1 + e^{- (- 11.729 + 0.225 \cdot {age}_{i} + 1.726 \cdot {drylaw}_{i})}}$

si103_e

and

$λ_{i} = e^{(0.933 + 0.504 \cdot {pop}_{i})}$ $λ_{i} = e^{(0.933 + 0.504 \cdot {pop}_{i})}$

A more curious researcher could obtain these same outputs by means of the Accidents ZIP Maximum Likelihood.xls file, using the Excel Solver tool, as has been the standard adopted throughout this chapter and book. In this file, the Solver criteria have already been defined.

Therefore, using Expression (15.32) and the estimated parameters, we can algebraically calculate, in the following way, the average expected weekly traffic accidents in a municipality of 700,000 inhabitants, with an average driver age of 40 and that does not adopt a dry law for after 10:00 p.m.

$λ_{inflate} = \{1 - \frac{1}{1 + e^{- [- 11.729 + 0.225 \cdot (40) + 1.726 \cdot (0)]}}\} \cdot \{e^{[0.933 + 0.504 \cdot (0.700)]}\} = 3.39$ $λ_{inflate} = \{1 - \frac{1}{1 + e^{- [- 11.729 + 0.225 \cdot (40) + 1.726 \cdot (0)]}}\} \cdot \{e^{[0.933 + 0.504 \cdot (0.700)]}\} = 3.39$

si105_e

The researcher can find the same result by typing the following command, which output is found in Fig. 15.81:

mfx, at(pop = 0.7 age = 40 drylaw = 0)

Finally, by means of a graph, we can compare the predicted values for the mean number of weekly traffic accidents obtained by the zero-inflated Poisson regression model with those obtained by a traditional Poisson regression model, without considering, therefore, the variables that influence the occurrence of structural zeros, or rather, the dichotomic component (age and drylaw variables). To do this, we can type in the following sequence of commands:

quietly zipcv accidents pop, inf(age drylaw) vuong nolog
predict lambda_inf
quietly poisson accidents pop
predict lambda
graph twoway scatter accidents pop || mspline lambda_inf pop || mspline lambda pop ||, legend(label(2 "ZIP") label(3 "Poisson"))

The generated graph is found in Fig. 15.82 and, by its means, we can see that the predicted values for the zero-inflated Poisson regression model (ZIP) were adjusted more adequately to the excessive amount of zeros in the dependent variable.

Next, we will analyze, based on the same dataset, the results obtained by means of the zero-inflated negative binomial regression model.

A.3 Example: Zero-Inflated Negative Binomial Regression Model in Stata

Following the same logic, we will again use the Accidents.dta dataset; however, we will now focus on the estimation of a zero-inflated negative binomial model. Therefore, the parameters for the following expressions will be estimated.

$p_{{logit}_{i}} = \frac{1}{1 + e^{- (γ + δ_{1} \cdot {age}_{i} + δ_{2} \cdot {drylaw}_{i})}}$ $p_{{logit}_{i}} = \frac{1}{1 + e^{- (γ + δ_{1} \cdot {age}_{i} + δ_{2} \cdot {drylaw}_{i})}}$

si101_e

and

$u_{i} = e^{(α + β \cdot {pop}_{i})}$ $u_{i} = e^{(α + β \cdot {pop}_{i})}$

As has been done throughout the chapter, we first analyze the mean and variance of the accidents variable, typing in the following command.

tabstat accidents, stats(mean var)

Fig. 15.83 presents the generated result.

As we can see, the dependent variable variance is about 14 times greater than its mean, which gives a strong indication of the existence of overdispersion in the data. Let’s, therefore, go on to estimate the zero-inflated negative binomial regression model. To do this, we should type in the following command:

zinbcv accidents pop, inf(age drylaw) vuong nolog zip

which follows the same logic as the command used to estimate the ZIP model. Notice that we opted to use the term zinbcv (zero-inflated negative binomial with corrected Vuong) instead of the term zinb, being that, even the estimated parameter was exactly equal, the first presents the Vuong test with AIC and BIC correction. Besides this, the term zip at the end of the command causes that the likelihood-ratio test for the ϕ (alpha in Stata) parameter be verified, or rather, it provides a comparison of the ZINB adequacy in relation to the ZIP model. The outputs are presented in Fig. 15.84.

Fig. 15.84 Zero-inflated negative binomial regression model outputs in Stata.

First, we can see that the confidence interval for the ϕ parameter, which is the inverse of the shape parameter ψ of the Gamma distribution and that Stata calls alpha, does not contain zero, or rather, for the 95% confidence level, we can state that ϕ is statistically different from zero and has an estimated value equal to 1.271. By means of the likelihood-ratio test for the ϕ parameter, we can conclude that the null hypothesis that this parameter is statistically equal to zero can be rejected at the 5% significance level (Sig. χ² = 0.000 < 0.05), which proves the existence of overdispersion in the data and indicates that the ZINB model is preferable to the ZIP model.

Besides this, the Vuong test with AIC and BIC correction, by presenting significant z statistics at the 95% confidence level, indicates that the zero-inflated negative binomial regression model is preferable to the traditional negative binomial model for it proves the existence of an excessive amount of zeros.

We can also see that the estimated pop variable is statistically different from zero at a 95% confidence level, or rather, this variable is significant to explain the behavior of the weekly amount of traffic accidents (count component). In the same way, the age and drylaw variables are statistically significant to explain the excessive amount of zeros (structural zeros) in the accidents variable (dichotomic component).

Based on these outputs, we come to the final expressions for p_{logit_i} and for u_i, given by:

$p_{{logit}_{i}} = \frac{1}{1 + e^{- (- 16.237 + 0.288 \cdot {age}_{i} + 2.859 \cdot {drylaw}_{i})}}$ $p_{{logit}_{i}} = \frac{1}{1 + e^{- (- 16.237 + 0.288 \cdot {age}_{i} + 2.859 \cdot {drylaw}_{i})}}$

si108_e

and

$u_{i} = e^{(0.025 + 0.866 \cdot {pop}_{i})}$ $u_{i} = e^{(0.025 + 0.866 \cdot {pop}_{i})}$

Then, a curious researcher can obtain these same outputs by means of the Accidents ZINB Maximum Likelihood.xls file, using the Excel Solver tool, according the standard adopted throughout this chapter and book. In this file, the Solver criteria have been previously defined.

Using Expression (15.36) and the estimated parameters, we can again calculate, algebraically, the average expected amount of weekly traffic accidents for a municipality of 700,000 inhabitants, with an average age of 40 and that does not have a dry law for after 10:00 p.m., according to what follows:

$u_{inflate} = \{1 - \frac{1}{1 + e^{- [- 16.237 + 0.288 \cdot (40) + 2.859 \cdot (0)]}}\} \cdot \{e^{[0.025 + 0.866 \cdot (0.700)]}\} = 1.86$ $u_{inflate} = \{1 - \frac{1}{1 + e^{- [- 16.237 + 0.288 \cdot (40) + 2.859 \cdot (0)]}}\} \cdot \{e^{[0.025 + 0.866 \cdot (0.700)]}\} = 1.86$

si110_e

The researcher can also find the same result by typing in the following command, whose output is presented in Fig. 15.85.

mfx, at(pop = 0.7 age = 40 drylaw = 0)

Theoretically, the modeling could be, at this time, finalized. However, if the researcher is also interest in estimating the parameters for a ZIP model, so as to compare them with those obtained by the ZINB model, the following sequence of commands can be typed:

eststo: quietly zip accidents pop, inf(age drylaw) vuong
prcounts lambda_inflate, plot
eststo: quietly zinb accidents pop, inf(age drylaw) vuong
prcounts u_inflate, plot
esttab, scalars(ll) se

which generates the outputs presented in Fig. 15.86.

These consolidated outputs allow us to see, besides the differences between the estimated parameters for both models, that the value obtained for the logarithmic likelihood function (ll) is considerably higher for the ZINB model (model 2 in Fig. 15.86), which is another indication of the better adequacy of this model over the ZIP model for the data in our example.

Another way to compare the ZINB and ZIP estimations is by means of analyzing the distributions of the observed and predicted probabilities of weekly accident occurrences for the two estimations, analogous to what we discussed throughout the chapter, using the generated variables in the elaboration of the prcounts commands. To do this, we must enter the following command, which will generate the graph in Fig. 15.87:

Fig. 15.87 Observed and predicted probability distributions of weekly traffic accidents for the ZINB and ZIP models.

graph twoway (scatter u_inflateobeq u_inflatepreq lambda_inflatepreq u_inflateval, connect (1 1 1))

where the variables u_inflatepreq and lambda_inflatepreq correspond to the predicted occurrence probabilities of accidents of 0 to 9 obtained, respectively, by the ZINB and ZIP models. Besides this, while the variable u_inflateobeq corresponds to the observed probabilities of the dependent variable and, therefore, presents the same probability distribution presented in Fig. 15.77 for up to 9 traffic accidents, the variable u_inflateval presents the actual values of 0–9, which will be related to the observed probabilities.

By means of analyzing the graph in Fig. 15.87, we see that the estimated distribution (predicted) for the ZINB model probabilities is better adjusted to the observed distribution than the estimated probability distribution for the ZIP model, for a count of up to 9 traffic accidents per week.

Alternately, as we have discussed throughout the chapter, this fact can also be verified by applying the countfit command, which offers, besides the observed and predicted probabilities for each count (from 0 to 9) of the dependent variable, the error terms resulting of the difference between the probabilities obtained by the ZINB and ZIP models. To do this, we can type the following command.

countfit accidents pop, zip zinb noestimates

which generates the outputs in Fig. 15.88 and the graph in Fig. 15.89.

Figs. 15.88 and 15.89 show us, once again, that the ZINB adjustment is better than the ZIP model adjustment, for the following reasons:

– While the maximum difference between the observed and predicted probabilities for the ZIP model is, in module, equal to 0.070, for the ZINB model it is, in module, equal to 0.016.
– The average of these differences is of 0.024 for the ZIP model and of 0.006 for the ZINB model.
– The total Pearson value is lower in the ZINB model (1.789) than in the ZIP model (61.233).

The graph in Fig. 15.89 allows that a comparative analysis between the terms of error generated to be manually performed, giving due highlight to the ZINB model adjustment, being that the error curve is consistently closer to zero.

As was done previously, we can also graphically compare the predicted values of the mean quantity of weekly traffic accidents obtained by the ZIP and ZINB models with those obtained by the corresponding traditional Poisson and negative binomial regression models (nbreg command), without consideration of the variables that influence the occurrence of structural zeros (age and drylaw variables). To do this, we should type in the following sequence of commands.

quietly poisson accidents pop
predict lambda
quietly nbreg accidents pop
predict u
graph twoway mspline lambda_inflaterate pop || mspline u_inflaterate pop || mspline lambda pop || mspline u pop ||, legend(label(1 "ZIP") label(2 "ZINB") label(3 "Poisson") label(4 "Negative Binomial"))

The generated graph is found in Fig. 15.90.

Two considerations can be made in relation to this graph. The first speaks regarding the variance in the amount of predicted weekly traffic accidents, which causes the ZINB and negative binomial curves to be more elongated at the upper right side of the graph than those generated by the corresponding ZIP and Poisson models, which are not able to capture the existence of overdispersion in the data. Besides this, we can also see that the predicted values generated by the ZINB and ZIP models are better adjusted to the excessive amount of zeros than the Poisson and negative binomial models, being that they present smaller inclinations, especially for a lower number of expected accidents.

As such, it is important for the researcher to have a complete notion of the regression models for count data, so as to estimate, in the best manner possible, the model parameters while always considering the nature and behavior of the dependent variable that represents the phenomenon under study.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15.5 Regression Model Estimation for Count Data in SPSS

Create new playlist

Sign In

Sign Up

15.5 Regression Model Estimation for Count Data in SPSS

15.5.1 Poisson Regression Model in SPSS

15.5.2 Negative Binomial Regression Model in SPSS

15.6 Final Remarks

15.7 Exercises

Appendix: Zero-Inflated Regression Models

A.1 Brief Introduction

A.2 Example: Zero-Inflated Poisson Regression Model in Stata

A.3 Example: Zero-Inflated Negative Binomial Regression Model in Stata

Table of Contents for
15.5 Regression Model Estimation for Count Data in SPSS