3.2. Logistic Models for Dichotomous Data with Two Observations Per Person

We begin with the relatively simple situation in which the response variable is a dichotomy and there are exactly two observations for each individual. As in chapter 2, we let yit be the value of the response variable for individual i on occasion t, but now y is constrained to have a value of either 0 or 1. In this section, t = 1 or 2. Let pit be the probability that yit= 1. It is convenient to assume that the dependence of pit on possible predictor variables is described by a logistic regression model


where zi is a column vector of variables that describe the individuals but do not vary over time, and xit is a column vector of variables that vary both over individuals and over time for each individual. In this equation, μt is an intercept that is allowed to vary with time, and β and γ are row vectors of coefficients. As in chapter 2, αi represents all differences between persons that are stable over time and not otherwise accounted for by zi. Again, we regard these as fixed parameters, one per person. Additionally, we assume that for a given individual i (and hence a given value of αi), yi1 and yi2 are independent. That is,


Our goal is to estimate μt and β while controlling for all time-invariant covariates (both measured and unmeasured). To accomplish that, we use only variation within individuals to estimate these parameters. When there are two occasions per individual, we can use a method that is very similar to the difference score method used for quantitative response variables. Let's first consider those individuals who do not change from time 1 to time 2—that is, yi1 = 0 and yi2 = 0, or yi1 = 1 and yi2 = 1. Because there is no within-individual variation on the response variable, such observations contain no information about the parameters μ and β and thus can be discarded from the analysis. That leaves individuals who change from 0 to 1 and those who change from 1 to 0. According to equations (3.2), the probability of those two outcomes is


We then take the logarithm of the ratio of these probabilities to get


Substituting from equation (3.1) and rearranging terms gives


As we found for the linear model, both zi and αi have been "differenced out" of the equation. This result suggests the following method for estimating the parameters:

  • Eliminate all individuals who do not change on the response variable.

  • Create difference scores for all the time-varying predictors.

  • Use maximum likelihood to estimate the logistic regression predicting yi2, with the difference scores as predictor variables.

This procedure is called conditional logistic regression. I'll have more to say about its properties and justification in the next section.

If there are no covariates, we have the sort of data that we saw in Table 3.1. Here's how to estimate the logistic regression model for those tabular data:

PROC LOGISTIC DATA=smoking DESC;
   WHERE baseline NE oneyear;
   FREQ count;
   MODEL oneyear= / EXPB;
RUN;

The WHERE statement eliminates those observations that did not change from time 1 to time 2. The DESC option (short for "descending") forces PROC LOGISTIC to model the probability that the dependent variable is equal to "yes" rather than equal to "no." Although the formulas used in this chapter assume that the dependent variable y is either 1 or 0, PROC LOGISTIC can actually handle any two values, whether numeric or character. The default is to model the probability of the lower value, in this case lower in the alphabet. The DESC option reverses that to model the higher value. The EXPB option computes the exponentiated value of the coefficients, which can be interpreted as odds ratios.

Results in Output 3.2 are consistent with what we saw in the previous section. The odds ratio of 2.423 is just the simple ratio of the off-diagonal counts. The chi-square of 14.14 is close but not identical to the McNemar statistic of 15.38. While both statistics are testing the same null hypothesis, McNemar's test is a traditional Pearson chi-square calculated under the assumption that the two expected frequencies are the same. PROC LOGISTIC reports a Wald chi-square, which is the squared ratio of the coefficient to its estimated standard error.

Table 3.3. Output 3.2 PROC LOGISTIC Estimates for Smoking Data
Response Profile
Ordered ValueoneyearTotal Frequency
1yes63
2no26
Analysis of Maximum Likelihood Estimates
ParameterDFEstimateStandard ErrorWald Chi-SquarePr>ChiSqExp(Est)
Intercept10.88500.233114.41610.00012.423

Now let's consider an example with predictor variables. The sample consists of 1151 girls from the National Longitudinal Survey of Youth (www.bls.gov/nls) who were interviewed annually for nine years, beginning in 1979. For this example, we'll use data only from year 1 and year 5. The response variable POV has a value of 1 if the girl's household was in poverty (as defined by U.S. federal standards) in a given year, and otherwise has a value of 0. The predictor variables are:


AGE

Age in years at the first interview


BLACK

1 if respondent is black, otherwise 0


MOTHER

1 if respondent currently had a least one child, otherwise 0


SPOUSE

1 if respondent is currently living with a spouse, otherwise 0


INSCHOOL

1 if respondent is currently enrolled in school, otherwise 0


HOURS

Hours worked during the week of the survey

The first two variables are time-invariant, whereas the last four may differ at each interview.

The data set MY.TEENPOV has one record for each of the 1151 respondents, with different variable names for the same variable measured in different years. For simplicity, the data set contains only respondents who have no missing data on any of the variables. Let's first check the joint distribution of the dependent variables:

PROC FREQ DATA=my.teenpov;
   TABLES pov1*pov5 / NOROW NOCOL NOPCT AGREE;
RUN;

We see from Output 3.3 that although 445 girls changed status during the five-year period, there was only a slight increase in the proportion in poverty. This increase is not statistically significant, according to the McNemar statistic.

Table 3.4. Output 3.3 Contingency Table for Poverty in Years 1 and 5
Table of pov1 by pov5
pov1 pov5
Frequency 01Total
 0516234750
 1211190401
Total 7274241151
McNemar's Test
Statistic (S)1.1888
DF1
Pr > S0.2756

To do the logistic regression analysis, the first step is to create a new data set that excludes those girls whose poverty status was the same in years 1 and 5, and defines new variables that are differences between the values for year 5 and for year 1:

DATA teendif;
   SET my.teenpov;
   IF pov1=pov5 THEN DELETE;
   mother=mother5-mother1;
   spouse=spouse5-spouse1;
   inschool=inschool5-inschool1;
   hours=hours5-hours1;
RUN;

Next, we estimate a logistic regression with POV5 as the dependent variable, and difference scores and time-invariant predictors as independent variables:

PROC LOGISTIC DATA=teendif DESC;
   MODEL pov5=mother spouse inschool hours black age;
RUN;

Output 3.4 gives the results. Although the time-varying predictors are expressed as difference scores, their coefficients should be interpreted as they appear in equation (3.1)—that is, as the effect of the value of the variable in a given year on the probability of poverty in that same year. Thus, the odds ratio for MOTHER tells us that the odds of being in poverty were twice as high in years when girls had children as compared with years in which they did not have children (controlling for other variables). On the other hand, when girls lived with husbands, their odds of poverty were only 35% as large as when they did not live with husbands. Each additional hour of work per week reduced the odds of poverty by 100(1 − .967) = 3.3%.

Table 3.5. Output 3.4 PROC LOGISTIC Output for Regression on Difference Scores
Response Profile
Ordered Valuepov5Total Frequency
11234
20211
Analysis of Maximum Likelihood Estimates
ParameterDFEstimateStandard ErrorWald Chi-SquarePr>ChiSq
Intercept14.89931.64388.88290.0029
mother10.74360.25388.58620.0034
spouse1−1.03170.291812.50140.0004
inschool10.33940.21782.42870.1191
hours1−0.03390.0062329.7027<.0001
black1−0.52630.21645.91540.0150
age1−0.25770.10296.27390.0123
OddsRatioEstimates
EffectPoint Estimate95% Wald ConfidenceLimits
mother2.1031.2793.459
spouse0.3560.2010.631
inschool1.4040.9162.152
hours0.9670.9550.978
black0.5910.3870.903
age0.7730.6320.945

The coefficients (and odds ratios) for BLACK and AGE must be interpreted somewhat differently. According to equation (3.3), as time-invariant predictors these variables shouldn't even be in the model. In fact, they represent interactions between time-invariant predictor variables and time itself, so that the rate of change in the odds of poverty depends on the value of these variables. More specifically, for a girl whose predictor variables did not change from year 1 to year 5, the change in the log-odds of poverty over the five-year period can be expressed as

4.8993 − .5263×BLACK − .2577×AGE.

Thus, for a 14-year-old girl who was not black and who did not change on any of the other predictors, the predicted change in the log-odds is +1.29. Equivalently, her odds of being in poverty increase by a factor of exp(1.29) = 3.63. We conclude that blacks and girls who were older at year 1 had a lower rate of increase in poverty.

As in the linear difference score model of chapter 1, we can also test for constancy in the effect of the time-varying predictors by including their values for year 1 in the model. The coefficients for the variable for year 1 represent the difference between the effects of each variable in year 1 and year 5. We see in Output 3.5 that the only time-varying predictor whose effects change significantly from year 1 to year 5 is INSCHOOL. The implied coefficient for year 5 is .6389. The implied coefficient for year 1 is .6389 − 1.1838 = −.54449. It therefore appears that school enrollment is associated with an increased risk of poverty in the later year and a reduced risk in the earlier year. Incidentally, the proportion of girls attending school is about 89% in year 1 and 80% in year 5.

Table 3.6. Output 3.5 Difference Regression with Variables for Year 1 Added
Analysis of Maximum Likelihood Estimates
ParameterDFEstimateStandard ErrorWald Chi-SquarePr>ChiSq
Intercept13.05181.82612.79280.0947
mother10.90930.269611.37700.0007
mother110.45650.45960.98680.3205
spouse1−1.02200.300911.53910.0007
spouse110.44220.72600.37100.5425
inschool10.63890.25116.47220.0110
inschool111.18380.47076.32540.0119
hours1−0.03390.0067725.0932<.0001
hours11−0.002380.01280.03430.8531
black1−0.66170.22648.54420.0035
age1−0.19610.11063.14400.0762

In chapter 2, we saw that the results from a difference score regression could be replicated by creating a separate record for each person at each point in time, and then estimating a regression model that includes a dummy variable for every person (except one). Let's try that for the logistic regression. The first step is to restructure the data set so there's a separate record for each person in each year:

DATA teenyrs2;
   SET my.teenpov;
     year=1;
     pov=pov1;
     mother=mother1;
     spouse=spouse1;
     inschool=inschool1;
     hours=hours1;
     OUTPUT;
     year=2;
     pov=pov5;
     mother=mother5;
     spouse=spouse5;
     inschool=inschool5;
     hours=hours5;
     OUTPUT;
   KEEP id year black age pov mother spouse inschool hours;
RUN;

The TEENYRS2 data set has 2,302 records, two for each of the 1151 girls. The time-varying covariates are given the same names for each of the two records.

Now we're ready to estimate the logistic regression model:

PROC LOGISTIC DATA=teenyrs2 DESC;
   CLASS id / PARAM=REF;
   MODEL pov=year mother spouse inschool hours year*black
         year*age id;
RUN;

The CLASS statement tells PROC LOGISTIC to create a set of dummy variables, one for each value of ID except for the highest. The PARAM=REF option says to make one of the ID numbers the reference category, by default the highest ID number. Note that in the MODEL statement, BLACK and AGE are entered as interactions with YEAR, but with no corresponding main effects.

This model took about 1.5 minutes to estimate on my PC. The printed output was extremely voluminous because LOGISTIC reported (a) a 1050 × 1050 matrix describing the coding of the dummy variables, (b) coefficients for the 1050 dummy variables, and (c) odds ratios contrasting each person with the reference person. In Output 3.6, I've excluded everything but the coefficient information from the other predictor variables.

Not only is this method cumbersome, but it also gives the wrong results. In Output 3.6, we find that every coefficient is exactly twice as large as the corresponding coefficient in Output 3.4, obtained with conditional logistic regression. This is a quite general result (Abrevaya 1997). Whenever you do logistic regression with dummy variables for individuals and exactly two observations for each individual, the coefficients will be twice as large as the coefficients from conditional logistic regression. The chi-squares and standard errors in Output 3.6 are also incorrect. The chi-squares are exactly twice as large as those in Output 3.4, and the standard errors are times those in Output 3.4.

Table 3.7. Output 3.6 Logistic Regression Estimates with Dummy Variables for Persons
Analysis of Maximum Likelihood Estimates
ParameterDFEstimateStandard ErrorWald Chi-SquarePr>ChiSq
Intercept1−21.39312293.30.00010.9926
year19.79902.324817.7667<.0001
mother11.48720.358917.1744<.0001
spouse1−2.06350.412625.0059<.0001
inschool10.67890.30804.85770.0275
hours1−0.06790.0088159.4107<.0001
year*black1−1.05260.306011.83210.0006
year*age1−0.51540.145512.54850.0004

When there are more than two observations per person and/or varying numbers of observations per person, there won't be such a neat scaling of the coefficients and chi-squares. In any case, logistic regression with dummy variables for individuals will generally give biased coefficient estimates. I'll have more to say about the reasons for this in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset