5.6. Fixed Effects Event History Methods for Nonrepeated Events

Fixed effects Cox regression requires that at least some of the individuals in a sample experience more than one event, so that within-individual comparisons are possible. Obviously, then, the method cannot be applied to a nonrepeatable event like death. Nevertheless, under certain conditions, it may be possible to do a fixed effects analysis for nonrepeatable events by treating time as discrete and applying conditional logistic regression. In the epidemiological literature, this type of analysis is called a case-crossover study (Maclure 1991), although the implementation I describe here is a little different from the way that epidemiologists usually do it.

As usual, I begin with an empirical example. Suppose we want to answer the following question: Does the death of a wife increase the hazard for the death of her husband? That's a difficult question to answer with confidence, because any association between husband's death and wife's death could be due to the effects of common environmental characteristics. Most husbands and wives will have lived in the same house in the same neighborhood for substantial periods of time. Moreover, they will tend to have come from similar social and economic backgrounds and have similar lifestyles. Unless we can control for those commonalities, any observed association between the death of one spouse and the death of the other could be spurious. Hence, a fixed effects analysis is highly desirable as a way to control for all stable unmeasured covariates.

To answer our question, we have data on 49,990 married couples in which both spouses were alive and at least 68 years old on January 1, 1993. Death dates for both spouses are available through May 30, 1994. During that 17-month interval, there were 5,769 deaths of the husband and 1,918 deaths of the wife. We regard time as consisting of discrete units, in this case days, which we can enumerate as t = 1, 2, 3, .... Let pit be the probability that husband i dies on day t, given that he was still alive on the preceding day, and let Wit= 1 if the wife i was alive on day t, and otherwise 0.

We'll represent the effect of the wife's vital status on the probability of the husband's death by a logistic regression model


where γt represents a linear effect of time on the log-odds on death, and αi represents the fixed effects of all unmeasured variables that are constant over time. Note that no time-invariant covariates are included in the model because their effects are absorbed into the αi term.

We will estimate the model by the method of conditional maximum likelihood, described in chapter 3, which eliminates the αi terms from the estimating equations. Here's how it's done. For men who died, a separate observational record is created for each day that the couple is observed, from day 1 (January 1, 1993) until the day of death or the day of censoring. For each of these couple-days, the dependent variable Yit is coded 0 if the man remained alive on that day, and coded 1 if he died on that day. Thus a man who died on June 1, 1993, would contribute 152 couple-days; 151 of those would have a value of 0 on Yit, while the last would have a value of 1. The predictor variable Wit is coded 0 for all days on which the wife was alive and 1 for all days on which she was dead. No observations are created for men who did not die because, in a fixed effects analysis for dichotomous outcomes, individuals who do not change contribute nothing to the likelihood function. The model can then be estimated with PROC LOGISTIC as described in chapter 3.

Below is a SAS program to create the couple-day data set and estimate the model with PROC LOGISTIC. The original data set (MY.COUPLE) has one record per couple, with the following variables:


HDEAD

1 if husband died, otherwise 0


WDEAD

1 if wife died, otherwise 0


HDTIME

day of husband's death or day of censoring


WDTIME

day of wife's death or day of censoring


COUPLEID

a unique ID number for each couple

Here is the code for constructing the couple-day data set:

DATA coupleday;
   SET my.couple;
   WHERE hdead=1;
   DO day=1 TO hdtime;
      IF day = hdtime THEN husdead=1;
      ELSE husdead=0;
      IF wdtime<day then wifedead=1;
      ELSE WIFEDEAD=0;
      OUTPUT;
   END;
KEEP husdead wifedead day coupleid;
RUN;

The new data set, COUPLEDAY, has one record per couple per day, for a total of 1,377,282 records. Note that couples in which the husband did not die are excluded (as explained above). Next, we estimate a conditional logistic regression with PROC LOGISTIC:

PROC LOGISTIC DATA=coupleday DESC;
   MODEL husdead=wifedead day;
   STRATA coupleid;
RUN;

Unfortunately, this program produces the following warning messages:

WARNING: NRRIDG Optimization cannot be completed.

WARNING: The LOGISTIC procedure continues in spite of the above
         warning. Results shown are based on the last maximum

likelihood iteration. Validity of the model fit is
         questionable.

WARNING: The information matrix is singular and thus the
         convergence is questionable

The reason for the convergence failure is that each couple's sequence of observations consists of a string of 0's on the dependent variable, followed by a 1. That is, the event always occurs at the last observation unit. As a consequence, any monotonically increasing function of time will perfectly predict the outcome for that person, making it impossible to get maximum likelihood estimates for that covariate or any other covariate in the model. In the logistic regression literature, this problem is known as complete separation (Albert and Anderson 1984; Allison 2003). Obviously, the problem would also occur if the covariate was the square root of time or the logarithm of time. On the other hand, it is possible to include non-monotonic functions of time such as sin(2π t/365), which would vary periodically over the course of a year.

Actually, for our mortality example, the problem of nonconvergence is not confined to the DAY variable. If we remove DAY from the model, we get the results in Output 5.5.

Table 5.5. Output 5.5 Fixed Effects Event History Analysis with No Dependence on Time
Testing Global Null Hypothesis: BETA=0
TestChi-SquareDFPr>ChiSq
Likelihood Ratio259.45141<.0001
Score190.15881<.0001
Wald0.043610.8346
Analysis of Maximum Likelihood Estimates
ParameterDFEstimateStandard ErrorWald Chi-SquarePr>ChiSq
wifedead115.344173.50730.04360.8346

This time we don't get a warning message that the model has not converged, but that's misleading. The coefficient for WIFEDEAD is extremely large, with an even larger standard error, a telltale indication of convergence problems. Another warning sign is the huge disparity between the likelihood ratio chi-square and the Wald chi-square. The reason for these problems is the same as before. Because WIFEDEAD may increase with time but never decrease, it perfectly predicts the occurrence of a death on the last day. Consequently, the coefficient for WIFEDEAD gets larger at each iteration of the estimation algorithm.

One way to circumvent this problem is to redefine WIFEDEAD to be an indicator of whether the wife died within, say, the previous 60 days. This covariate increases from 0 to 1 when the wife dies, but then goes back to 0 after 60 days (if the husband is still alive). Estimating the model with varying windows of time can give useful information about how the effect of the wife's death starts, peaks and stops.

Here's the new code for a window of 60 days:

DATA coupleday;
   SET my.couple;
   WHERE hdead=1;
   DO day=1 to hdtime;
      IF day = hdtime THEN husdead=1;
      ELSE husdead=0;
      IF 0<day-wdtime<60 THEN wifedead60=1;
      ELSE wifedead60=0;
      OUTPUT;
   END;
KEEP husdead wifedead60 day coupleid;
RUN;
PROC LOGISTIC DATA=coupleday desc;
   MODEL husdead=wifedead60;
   STRATA coupleid;
RUN;

Based on the output from this code using several different windows of time, Table 5.1 gives estimated odds ratios for the effect of the wife's death on the husband's death. In all cases, the odds ratios exceed 1.0, and they are statistically significant for the 60-day interval and the 30-day interval. For the latter, the odds of the husband's death on a day in which the wife died during the previous 30 days are nearly double the odds if the wife did not die during that interval.

Table 5.6. Table 5.1 Odds Ratios for Predicting Husband's Death from Wife's Death within Varying Intervals of Time
Wife Died Within
 15 days30 days60 days90 days120 days
Odds Ratio1.261.961.611.271.26
p-value.54.006.03.24.25

Although these results are certainly intriguing, the danger is that there is no control for change over time. This is not merely a technical problem, but one that can seriously compromise any conclusions drawn from a case-crossover study (Suissa 1995, Greenland 1996). For our example, if there is any tendency for the incidence of wife death to increase over the period of observation, this can produce a spurious relationship between the wife's death (however coded) and the husband's death. Intuitively, the reason is that the husband's death always occurs at the end of the sequence of observations for each couple, so any variable that tends to increase over time will appear to increase the hazard of the husband's death.

We now consider an alternative fixed effects method that appears to solve the problems that arise from uncontrolled dependence on time. Introduced by Suissa (1995), who called it the case-time-control design, the key innovation in this approach is the computational device of reversing the dependent and independent variables in the estimation of the conditional logit model. This makes it possible to introduce a control for time, something that cannot be done with the case-crossover method.

As is well known, when both the dependent and independent variables are dichotomous, the odds-ratio is symmetric: reversing the dependent and independent variables yields the same result, even when there are other covariates in the model. (This symmetry is exact when the model is saturated in the control covariates, but only approximate for unsaturated models.) In the case-time-control method, the working dependent variable is the dichotomous covariate—in our case, whether or not the wife died during the preceding specified number of days. Independent variables are the dummy variable for the occurrence of an event (the husband's death) on a given day and some appropriate representation of time, such as a linear function. Again a conditional logistic regression is estimated with each couple treated as a separate stratum. Under this formulation there is no problem with including time as a covariate, because the working dependent variable is not a monotonic function of time.

In Suissa's formulation of the method, it is critically important to include data from all individuals, both those who experienced the event and those who are censored. However, his model was developed for data with only two points in time for each individual, an event period and a control period. In that scenario, the covariate effect and the time effect are perfectly confounded if the sample is restricted to those who experienced events. On the other hand, censored individuals provide information about the dependence of the covariate on time, information that is not confounded with the occurrence of the event.

By contrast, our data set (and presumably many others) has multiple controls at different points in time for each individual. That eliminates the complete confounding of time with the occurrence of the event (the husband's death), making it possible to apply the case-time-control method to uncensored cases only. That's a real boon in situations where it is difficult or impossible to get information for those who did not experience the event. The only restriction is that when the model is estimated without the censored cases, one cannot estimate a model with a completely arbitrary dependence on time—that is, with dummy variables for every point in time.

Of course, if the censored cases are available (as in our data set), more precise estimates of the dependence on time can be obtained by including them. But even if censored cases are available, there is a potential advantage to limiting the analysis to those who experienced the event. The case-time-control method has been criticized for assuming that the dependence of the covariate on time is the same among those who did and did not experience the event (Greenland 1996). This criticism has no force if the data are limited to those individuals who experience events.

For the mortality data, the working data set is the same as before, with one record for each day of observation from the origin until the time of the husband's death or censoring. Because conditional logistic regression requires variation on the dependent variable for each conditioning stratum, we can eliminate couples in which the wife did not die before the husband, with no loss of information. Here is the DATA step to produce the observations for a 60-day window:

DATA coupleday2;
SET my.couple;
WHERE hdead=1 and wdead=1 AND wdtime<hdtime;
DO day=1 TO hdtime;
   IF day = hdtime THEN husdead=1;

ELSE husdead=0;
   IF 0<day-wdtime<60 THEN wifedead60=1;
   ELSE wifedead60=2;
   OUTPUT;
END;
KEEP husdead wifedead60 day coupleid;
RUN;

This DATA step produced 39,942 couple-days, which came from only 126 couples. This is the number of couples in which the husband died and the wife died before the husband. Although this is a tiny fraction of the original sample of 49,990 couples, it's the only group that contains information about the effect of the wife's death on the husband's death using a fixed effects approach.

The working model is defined as follows. Let Hit be a dummy variable for the death of husband i on day t, and let Pit be the probability that the wife's death occurred within a specified number of days prior to day t. The logistic regression model is


This model allows for a quadratic dependence on time, although other functions could be used instead. Here is the program to do the estimation:

PROC LOGISTIC DATA=coupleday2 DESC;
   MODEL wifedead60=husdead day day*day;
   STRATA coupleid;
RUN;

Table 5.2 gives estimates of the odds ratios for varying windows of time. Results are quite similar to those in Table 5.2, which used the case-crossover method. Again, the evidence suggests that the effects of the wife's death on the hazard of the husband's death are limited in time, with considerable fading after about two months.

Although our working dependent variable is the wife's death, the odds ratios must be interpreted as the effect of the wife's death on the odds of the husband's death. That's because of the time ordering of the observations—the wife's death always precedes the husband's death. If the goal were to estimate the effect of the husband's death on the wife's mortality, we would have to construct a completely different data set that would include couple-days prior to the wife's death, but not thereafter.

Table 5.7. Table 5.2 Odds Ratios for Predicting Husband's Death from Wife's Death within Varying Intervals of Time, Case-Time-Control Method
 15 days30 days60 days90 days120 days
Odds Ratio1.262.081.741.281.11
p-value.54<.004.01.25.63

In this example, we estimated the effect of a single dichotomous covariate (the wife's death within a specified number of days) on the occurrence of a nonrepeated event (the husband's death). The method enabled us to control for all stable covariates. But suppose we want to control for time-varying covariates, like smoking status. Simulation studies (Allison and Christakis 2000) indicate that additional covariates can simply be included as predictor variables in the logistic regression model specified in equation (5.6). Although the coefficients for any additional covariates would not be unbiased estimates of their effects on the husband's death, the introduction of such covariates will yield approximately unbiased estimates for the effect of the wife's death on the husband's death (β in equation (5.6)). If we want to estimate the effect of smoking status on the husband's death, then we must make the probability of smoking be the dependent variable in equation (5.6), possibly including the wife's vital status as a covariate. This procedure could work even if smoking status had more than two categories, in which case equation (5.6) would need to be specified as a multinomial logistic regression. However, I know of no way to generalize the case-time-control method to quantitative covariates (except as control variables).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset