Logistic Models for Dichotomous Data with Two Observations Per Person

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3.2. Logistic Models for Dichotomous Data with Two Observations Per Person

We begin with the relatively simple situation in which the response variable is a dichotomy and there are exactly two observations for each individual. As in chapter 2, we let y_it be the value of the response variable for individual i on occasion t, but now y is constrained to have a value of either 0 or 1. In this section, t = 1 or 2. Let p_it be the probability that y_it= 1. It is convenient to assume that the dependence of p_it on possible predictor variables is described by a logistic regression model

where z_i is a column vector of variables that describe the individuals but do not vary over time, and x_it is a column vector of variables that vary both over individuals and over time for each individual. In this equation, μ_t is an intercept that is allowed to vary with time, and β and γ are row vectors of coefficients. As in chapter 2, α_i represents all differences between persons that are stable over time and not otherwise accounted for by z_i. Again, we regard these as fixed parameters, one per person. Additionally, we assume that for a given individual i (and hence a given value of α_i), y_i₁ and y_i₂ are independent. That is,

Our goal is to estimate μ_t and β while controlling for all time-invariant covariates (both measured and unmeasured). To accomplish that, we use only variation within individuals to estimate these parameters. When there are two occasions per individual, we can use a method that is very similar to the difference score method used for quantitative response variables. Let's first consider those individuals who do not change from time 1 to time 2—that is, y_i₁ = 0 and y_i₂ = 0, or y_i₁ = 1 and y_i₂ = 1. Because there is no within-individual variation on the response variable, such observations contain no information about the parameters μ and β and thus can be discarded from the analysis. That leaves individuals who change from 0 to 1 and those who change from 1 to 0. According to equations (3.2), the probability of those two outcomes is

We then take the logarithm of the ratio of these probabilities to get

Substituting from equation (3.1) and rearranging terms gives

As we found for the linear model, both z_i and α_i have been "differenced out" of the equation. This result suggests the following method for estimating the parameters:

Eliminate all individuals who do not change on the response variable.
Create difference scores for all the time-varying predictors.
Use maximum likelihood to estimate the logistic regression predicting y_i₂, with the difference scores as predictor variables.

This procedure is called conditional logistic regression. I'll have more to say about its properties and justification in the next section.

If there are no covariates, we have the sort of data that we saw in Table 3.1. Here's how to estimate the logistic regression model for those tabular data:

PROC LOGISTIC DATA=smoking DESC;
   WHERE baseline NE oneyear;
   FREQ count;
   MODEL oneyear= / EXPB;
RUN;

The WHERE statement eliminates those observations that did not change from time 1 to time 2. The DESC option (short for "descending") forces PROC LOGISTIC to model the probability that the dependent variable is equal to "yes" rather than equal to "no." Although the formulas used in this chapter assume that the dependent variable y is either 1 or 0, PROC LOGISTIC can actually handle any two values, whether numeric or character. The default is to model the probability of the lower value, in this case lower in the alphabet. The DESC option reverses that to model the higher value. The EXPB option computes the exponentiated value of the coefficients, which can be interpreted as odds ratios.

Results in Output 3.2 are consistent with what we saw in the previous section. The odds ratio of 2.423 is just the simple ratio of the off-diagonal counts. The chi-square of 14.14 is close but not identical to the McNemar statistic of 15.38. While both statistics are testing the same null hypothesis, McNemar's test is a traditional Pearson chi-square calculated under the assumption that the two expected frequencies are the same. PROC LOGISTIC reports a Wald chi-square, which is the squared ratio of the coefficient to its estimated standard error.

Table 3.3. Output 3.2 PROC LOGISTIC Estimates for Smoking Data
Response Profile
Ordered Value	oneyear	Total Frequency
1	yes	63
2	no	26

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr>ChiSq	Exp(Est)
Intercept	1	0.8850	0.2331	14.4161	0.0001	2.423

Now let's consider an example with predictor variables. The sample consists of 1151 girls from the National Longitudinal Survey of Youth (www.bls.gov/nls) who were interviewed annually for nine years, beginning in 1979. For this example, we'll use data only from year 1 and year 5. The response variable POV has a value of 1 if the girl's household was in poverty (as defined by U.S. federal standards) in a given year, and otherwise has a value of 0. The predictor variables are:

AGE: Age in years at the first interview
BLACK: 1 if respondent is black, otherwise 0
MOTHER: 1 if respondent currently had a least one child, otherwise 0
SPOUSE: 1 if respondent is currently living with a spouse, otherwise 0
INSCHOOL: 1 if respondent is currently enrolled in school, otherwise 0
HOURS: Hours worked during the week of the survey

The first two variables are time-invariant, whereas the last four may differ at each interview.

The data set MY.TEENPOV has one record for each of the 1151 respondents, with different variable names for the same variable measured in different years. For simplicity, the data set contains only respondents who have no missing data on any of the variables. Let's first check the joint distribution of the dependent variables:

PROC FREQ DATA=my.teenpov;
   TABLES pov1*pov5 / NOROW NOCOL NOPCT AGREE;
RUN;

We see from Output 3.3 that although 445 girls changed status during the five-year period, there was only a slight increase in the proportion in poverty. This increase is not statistically significant, according to the McNemar statistic.

Table 3.4. Output 3.3 Contingency Table for Poverty in Years 1 and 5
Table of pov1 by pov5
pov1		pov5
Frequency		0	1	Total
	0	516	234	750
	1	211	190	401
Total		727	424	1151

McNemar's Test
Statistic (S)	1.1888
DF	1
Pr > S	0.2756

To do the logistic regression analysis, the first step is to create a new data set that excludes those girls whose poverty status was the same in years 1 and 5, and defines new variables that are differences between the values for year 5 and for year 1:

DATA teendif;
   SET my.teenpov;
   IF pov1=pov5 THEN DELETE;
   mother=mother5-mother1;
   spouse=spouse5-spouse1;
   inschool=inschool5-inschool1;
   hours=hours5-hours1;
RUN;

Next, we estimate a logistic regression with POV5 as the dependent variable, and difference scores and time-invariant predictors as independent variables:

PROC LOGISTIC DATA=teendif DESC;
   MODEL pov5=mother spouse inschool hours black age;
RUN;

Output 3.4 gives the results. Although the time-varying predictors are expressed as difference scores, their coefficients should be interpreted as they appear in equation (3.1)—that is, as the effect of the value of the variable in a given year on the probability of poverty in that same year. Thus, the odds ratio for MOTHER tells us that the odds of being in poverty were twice as high in years when girls had children as compared with years in which they did not have children (controlling for other variables). On the other hand, when girls lived with husbands, their odds of poverty were only 35% as large as when they did not live with husbands. Each additional hour of work per week reduced the odds of poverty by 100(1 − .967) = 3.3%.

Table 3.5. Output 3.4 PROC LOGISTIC Output for Regression on Difference Scores
Response Profile
Ordered Value	pov5	Total Frequency
1	1	234
2	0	211

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr>ChiSq
Intercept	1	4.8993	1.6438	8.8829	0.0029
mother	1	0.7436	0.2538	8.5862	0.0034
spouse	1	−1.0317	0.2918	12.5014	0.0004
inschool	1	0.3394	0.2178	2.4287	0.1191
hours	1	−0.0339	0.00623	29.7027	<.0001
black	1	−0.5263	0.2164	5.9154	0.0150
age	1	−0.2577	0.1029	6.2739	0.0123

OddsRatioEstimates
Effect	Point Estimate	95% Wald Confidence	Limits
mother	2.103	1.279	3.459
spouse	0.356	0.201	0.631
inschool	1.404	0.916	2.152
hours	0.967	0.955	0.978
black	0.591	0.387	0.903
age	0.773	0.632	0.945

The coefficients (and odds ratios) for BLACK and AGE must be interpreted somewhat differently. According to equation (3.3), as time-invariant predictors these variables shouldn't even be in the model. In fact, they represent interactions between time-invariant predictor variables and time itself, so that the rate of change in the odds of poverty depends on the value of these variables. More specifically, for a girl whose predictor variables did not change from year 1 to year 5, the change in the log-odds of poverty over the five-year period can be expressed as

4.8993 − .5263×BLACK − .2577×AGE.

Thus, for a 14-year-old girl who was not black and who did not change on any of the other predictors, the predicted change in the log-odds is +1.29. Equivalently, her odds of being in poverty increase by a factor of exp(1.29) = 3.63. We conclude that blacks and girls who were older at year 1 had a lower rate of increase in poverty.

As in the linear difference score model of chapter 1, we can also test for constancy in the effect of the time-varying predictors by including their values for year 1 in the model. The coefficients for the variable for year 1 represent the difference between the effects of each variable in year 1 and year 5. We see in Output 3.5 that the only time-varying predictor whose effects change significantly from year 1 to year 5 is INSCHOOL. The implied coefficient for year 5 is .6389. The implied coefficient for year 1 is .6389 − 1.1838 = −.54449. It therefore appears that school enrollment is associated with an increased risk of poverty in the later year and a reduced risk in the earlier year. Incidentally, the proportion of girls attending school is about 89% in year 1 and 80% in year 5.

Table 3.6. Output 3.5 Difference Regression with Variables for Year 1 Added
Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr>ChiSq
Intercept	1	3.0518	1.8261	2.7928	0.0947
mother	1	0.9093	0.2696	11.3770	0.0007
mother1	1	0.4565	0.4596	0.9868	0.3205
spouse	1	−1.0220	0.3009	11.5391	0.0007
spouse1	1	0.4422	0.7260	0.3710	0.5425
inschool	1	0.6389	0.2511	6.4722	0.0110
inschool1	1	1.1838	0.4707	6.3254	0.0119
hours	1	−0.0339	0.00677	25.0932	<.0001
hours1	1	−0.00238	0.0128	0.0343	0.8531
black	1	−0.6617	0.2264	8.5442	0.0035
age	1	−0.1961	0.1106	3.1440	0.0762

In chapter 2, we saw that the results from a difference score regression could be replicated by creating a separate record for each person at each point in time, and then estimating a regression model that includes a dummy variable for every person (except one). Let's try that for the logistic regression. The first step is to restructure the data set so there's a separate record for each person in each year:

DATA teenyrs2;
   SET my.teenpov;
     year=1;
     pov=pov1;
     mother=mother1;
     spouse=spouse1;
     inschool=inschool1;
     hours=hours1;
     OUTPUT;
     year=2;
     pov=pov5;
     mother=mother5;
     spouse=spouse5;
     inschool=inschool5;
     hours=hours5;
     OUTPUT;
   KEEP id year black age pov mother spouse inschool hours;
RUN;

The TEENYRS2 data set has 2,302 records, two for each of the 1151 girls. The time-varying covariates are given the same names for each of the two records.

Now we're ready to estimate the logistic regression model:

PROC LOGISTIC DATA=teenyrs2 DESC;
   CLASS id / PARAM=REF;
   MODEL pov=year mother spouse inschool hours year*black
         year*age id;
RUN;

The CLASS statement tells PROC LOGISTIC to create a set of dummy variables, one for each value of ID except for the highest. The PARAM=REF option says to make one of the ID numbers the reference category, by default the highest ID number. Note that in the MODEL statement, BLACK and AGE are entered as interactions with YEAR, but with no corresponding main effects.

This model took about 1.5 minutes to estimate on my PC. The printed output was extremely voluminous because LOGISTIC reported (a) a 1050 × 1050 matrix describing the coding of the dummy variables, (b) coefficients for the 1050 dummy variables, and (c) odds ratios contrasting each person with the reference person. In Output 3.6, I've excluded everything but the coefficient information from the other predictor variables.

Not only is this method cumbersome, but it also gives the wrong results. In Output 3.6, we find that every coefficient is exactly twice as large as the corresponding coefficient in Output 3.4, obtained with conditional logistic regression. This is a quite general result (Abrevaya 1997). Whenever you do logistic regression with dummy variables for individuals and exactly two observations for each individual, the coefficients will be twice as large as the coefficients from conditional logistic regression. The chi-squares and standard errors in Output 3.6 are also incorrect. The chi-squares are exactly twice as large as those in Output 3.4 , and the standard errors are times those in Output 3.4.

Table 3.7. Output 3.6 Logistic Regression Estimates with Dummy Variables for Persons
Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr>ChiSq
Intercept	1	−21.3931	2293.3	0.0001	0.9926
year	1	9.7990	2.3248	17.7667	<.0001
mother	1	1.4872	0.3589	17.1744	<.0001
spouse	1	−2.0635	0.4126	25.0059	<.0001
inschool	1	0.6789	0.3080	4.8577	0.0275
hours	1	−0.0679	0.00881	59.4107	<.0001
year*black	1	−1.0526	0.3060	11.8321	0.0006
year*age	1	−0.5154	0.1455	12.5485	0.0004

When there are more than two observations per person and/or varying numbers of observations per person, there won't be such a neat scaling of the coefficients and chi-squares. In any case, logistic regression with dummy variables for individuals will generally give biased coefficient estimates. I'll have more to say about the reasons for this in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Logistic Models for Dichotomous Data with Two Observations Per Person

Create new playlist

Sign In

Sign Up

3.2. Logistic Models for Dichotomous Data with Two Observations Per Person

Table of Contents for
Logistic Models for Dichotomous Data with Two Observations Per Person