Chapter 1
Missing Data Concepts and Motivating Examples

1.1 Overview of the Missing Data Problem

Data are the fundamental building blocks of valid statistical inference for biomedical and social sciences research. Unfortunately, for many reasons, more often than not we will be missing some observations. Data are sometimes missing by design, such as in two-stage case-cohort designs. There are situations when missing data are not relevant to the analysis and therefore can be safely ignored. So, it is important to understand what we mean by missing data in this book. According to Little et al. (2012b, missing data are defined as values that are not available, but otherwise would be meaningful for analysis if they were observed. Even in the case of missing data, the goal remains to make inferences about the population targeted by the complete sample. Unfortunately, there is no universal method for handling a missing data problem. This is because the selection of subjects for a study is usually known, but the process by which observations on those subjects become missing—the missingness mechanism—is usually unknown, and the data alone cannot definitively inform us about this process. Therefore, with missing data, additional assumptions are required in order to proceed with analysis, and the validity of these assumptions cannot be determined from the observed data alone. For this reason, assessing the sensitivity of conclusions to the assumptions should play a central role in any analysis of data with missing values. In fact, any analysis should principally include the hypothesis under investigation, the information on the observed data, and the reasons for the missing data. When data are missing, information is lost, and the value of what remains depends on whether we can identify plausible reasons for missing the data and on the sensitivity of the study conclusions to different assumptions on missingness mechanisms.

Over the years, our view on missing data problems has evolved substantially as we have obtained new insights into the problem and learned to deal with it. In the beginning, missing values were merely an inconvenience as holes in a data matrix that cause statistical software to crash. Therefore, early work on the problem was largely on computation (Afifi and Elashoff, 1966; Dempster et al., 1977). Later we realized that 3 even if data were not there, information could still be obtained from the way data were missing or from the relationship between the missing and observed values. This led to the missingness mechanism categorization and imputation based on auxiliary variables. We found that augmenting the observed data with the extracted additional information, along with appropriate inference adjustment, can improve the statistical analysis (Little and Rubin, 2002; Schafer, 1997b). Later researchers further realized that missing values are part of a more general concept of coarsened data, which include numbers that have been grouped, aggregated, rounded, censored, or truncated, resulting in partial loss of information (Heitjan and Rubin, 1991). Missing data methods may be extended to a broader research on analysis of partially observed data that encompassed a wide range of advanced statistical topics such as double robustness, causal inference, and semiparametric theory (Robins and Rotnitzky, 1995; Tsiatis, 2006).

The statistical literature on the missing data problem is now quite extensive with many excellent textbooks available. Little and Rubin (2002) provided a good overview of likelihood methods and an introduction to multiple imputation. Allison (2001) presented a less technical overview intended for technically less sophisticated audience. Schafer (1997b) covered the EM algorithm and data augmentation for multivariate normal and general location models from a Bayesian perspective. Molenberghs and Kenward (2007) focused on intervention-based clinical studies, whereas Daniels and Hogan (2008) focused on longitudinal studies with a Bayesian emphasis. Two recent textbooks provide excellent discussion of a particular statistical method for analyzing and drawing inferences from incomplete data, namely, multiple imputation (MI) (Carpenter and Kenward, 2013; van Buuren, 2013).

In this book, we are trying to strike a balance between theory and application. This book is intended for researchers and graduate students in the biomedical fields who are moderately sophisticated in terms of statistical skills. We cover a wide range of statistical methods for dealing with missing data problems for those who are interested in theories. For those who wish to analyze incomplete data, we provide detailed working examples, using mostly freely available statistical software R (R Core Team, 2013) or the commercial software Stata, although many approaches are readily available in other statistical software packages.

1.2 Patterns and Mechanisms of Missing Data

Before the explosion of data in the modern age of the Internet and informatics and before the age of clinical trials, data mainly came from agriculture and survey studies. Such studies measured all variables of interest for each unit (or case, observation, subject, depending on the context). Measured variables may have included continuous variables (e.g., weight and height), categorical variables (e.g., gender, race, and level of education), and/or repeated measures on the same variable over time. Such data are often organized in a rectangular format, where the rows of the data matrix represent the units or subjects and the columns represent variables measured for each unit. Since statistical methods developed for the missing data setting refer to the necessary shape of the data, we will continuously refer to the data structure, or the data matrix, of the study as well.

In the initial stages of development, most missing data techniques were developed specifically for the missing data in this type of study with data in a rectangular form; see, for example, the seminal work of Little and Rubin (2002). But missing data may also occur in studies where, by design, not all variables are measured on each subject, or where subjects are not measured at the same time points, giving this type of data a different shape. For example, in studies involving survival following transplantation, patients often have unequal length of follow-up and a different number of follow-up visits. So for today's ever-growing complex study designs, special efforts are required to reshape the data before we can apply the missing data techniques.

Central to the analysis of incomplete data is an understanding of why the data are missing and its implications for the analysis. In the context of the missing data problem, it is necessary to distinguish between patterns of missing data and mechanisms of missing data. Missing data patterns describe which values are observed in the data matrix and which values are missing, whereas the missing data mechanism describes how the process underlying the missingness is related to the values in the data matrix. As the readers will see later, certain missing data patterns such as a monotone pattern may lead to more efficient algorithms for imputation. On the other hand, many statistical techniques rely on correct assumptions regarding the missing data mechanisms, which can sometimes be inferred by examining missing data patterns.

1.2.1 Missing Data Patterns

We illustrate typical missing data patterns using some examples. The first four examples come from the R package mice (van Buuren and Groothuis-Oudshoorn, 2011).

1.2.2 Missing Data Mechanisms

Understanding the missing data mechanism is important for the analysis of the missing data. The rationale of using a method for dealing with a missing data set relies on the missingness mechanism. To explain the missing data mechanisms, we first define a vector of indicator variables R that identify which variables are observed and which are missing. For example, R = (0, 1)img indicates that the first variable is missing and the second variable is observed. There are two underlying requirements for missing data indicators. First, missingness indicators contain information about the missing values. Clearly, they are useless otherwise. Second, the values that indicators hide must be meaningful. Sometimes the missingness indicators hide undefined values and the problem should not be considered a missing data problem.

The missing data mechanism describes how the missingness indicator vector R is related with the vector of study variables Y, or in other words, the conditional distribution of R given Y, say f(R|Y). Here, we regard R as a random vector, which formalizes the role of the missing data mechanism in modern missing data analysis. It is best to view the distribution of R as a mathematical device to describe the rates and patterns of missing values and to capture roughly possible relationships between the missingness and the values of the missing items themselves (Schafer and Graham, 2002.

Since some of the variables Y may be missing, we partition Y into two parts, Yobs and Ymis, the observed part and the missing part, respectively. Below we introduce the three missing data mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

The data are called MCAR if missingness does not depend on the value of data Y, missing or observed. Statistically,

It means that the missingness indicator is conditionally independent of the study variables, so data on the study variables, regardless of missing or observed, have nothing to do with the mechanism that the data are missing. Consider an experiment involving random sampling. Under MCAR, people with missing values are really just a random sample of the study population. As a result, people with observed data still make up a valid random sample from the study population for the ultimate inference, though with some loss in efficiency because of the smaller sample size. Therefore, when data are MCAR, we can ignore the units with missing values and proceed with the complete-data analysis, and still make valid inferences.

MAR is a less restrictive assumption on the missing data mechanism than MCAR. Data are called MAR when missingness depends only on the observed components Yobs of Y, but not on the missing components Ymis. Statistically,

Despite its name, MAR does not mean that the missing data are a simple random sample of all data values, which is the case for MCAR. MAR lessens MCAR in that the missing values can behave like a random sample of all values within subpopulations defined by the observed data rather than the entire data values.

The mechanism is called missing not at random if the distribution of R depends on the missing values in Y. Strictly speaking, MNAR comes in two forms: (i) missingness depends on the missing value itself or (ii) missingness depends on an unobserved variable. The latter is common in latent variable models or hidden Markov models.

Notably, the classification has been criticized because of the confusing nature of its terminology (Schafer and Graham, 2002. We encourage readers to focus more on the implications than on the names. Practically, MNAR can be difficult to distinguish from MAR owing to the lack of information from missing values. This in part explains the popularity of the MAR assumption in the development of missing data techniques. But the situation can become really complex if a multivariate data set is considered, even more so if the data set contains variables with missing observations, which are MCAR, MAR, or MNAR.

We demonstrate the three mechanisms and their consequences through the following simple example.

The scatter plots of the three samples are presented in Figure 1.1 ac, respectively. We conduct the linear regression analysis based on three subsets of the samples: the full data, the complete-case data (i.e., only containing subjects with y observed), and the missing-case data (i.e., only containing subjects with y missing and it is assumed that the unknown y values are known to us). The regression lines are presented in Figure 1.1, and the regression results for the full data and the complete-case data are summarized in Table 1.1. It can be seen that under MAR or MNAR, results based only on the complete-case data may be biased.

img

Figure 1.1 The solid circles are the observed sample points, and the x symbols are sample points with y values missing. The solid line is the regression line of y on x estimated from the full data, while the dotted and dashed regression lines are estimated from the subjects with completely observed and missing values, respectively.

Table 1.1 Regression Results Under Different Missing Data Mechanisms, the Estimates with Standard Deviations of the Intercept and the Coefficients of x, the R-Squared and Adjusted R-Squared, and the Number of Observations

Full Data MCAR MAR MNAR
(Intercept) 0.08 (0.06) 0.01 (0.08) −0.07 (0.13) 0.44 (0.09)
x 1.02 (0.06) 1.01 (0.08) 1.15 (0.12) 0.70 (0.09)
R2 0.53 0.54 0.36 0.23
Adjusted R2 0.53 0.53 0.35 0.23
Number of observations 300 148 163 201

1.3 Data Examples

To illustrate the different analytic approaches and to assist the readers in applying the methods described in the book, we use data from a wide range of clinical studies, including cross-sectional data, longitudinal data, and survival data from both randomized clinical trials and observational cohort studies. In the following, we provide some detailed description of the studies and briefly outline some of the missing data problems. Detailed missing data patterns will be examined in the subsequent chapters.

1.3.1 Improving Mood and Promoting Access to Collaborative Treatment (IMPACT) Study

1.3.1.1 Background

In one of the largest treatment trials for depression in adults to date, a team of researchers supported by the Hartford Foundation followed 1801 depressed, older adults from 18 diverse primary care clinics across the United States from 1998 to 2003 (Unützer et al.2002). The 18 participating clinics were associated with eight health care organizations in the states of Washington, California, Texas, Indiana, and North Carolina. The clinics included diverse health care systems: several Health Maintenance Organizations (HMOs), traditional fee-for-service clinics, an Independent Provider Association (IPA), an inner-city public health clinic, and two Veterans Administration (VA) clinics.

One half of the enrolled study participants were randomized to receive IMPACT collaborative care management, and the other half received the care normally available in their primary care clinics (including referral to specialty mental health care). The study examined the effects of the IMPACT intervention on depression, quality of life, costs, physical functioning, and other areas of physical and mental health.

IMPACT care is characterized by five essential elements: (i) it is collaborative care involving primary care physicians, a care manager, and a psychiatrist; (ii) the depression care manager plays a crucial role in coordinating and implementing the treatment plan; (iii) there is a designated psychiatrist; (iv) outcome measures are frequently assessed and monitored; and (v) stepped care allows us to adjust the treatment according to clinical outcomes.

Since the end of the trial, a number of organizations in the United States and abroad have adopted and implemented the IMPACT program with diverse patient populations. As reported in the December 11, 2002 issue of the Journal of the American Medical Association, the IMPACT model of depression care more than doubles the effectiveness of depression treatment for older adults in primary care settings.

At 12 months, about half of the patients receiving IMPACT care reported at least a 50% reduction in depressive symptoms, compared with only 19% of those in usual care. Analysis of data from the survey conducted 1 year after IMPACT shows that the benefits of the IMPACT intervention persist after 1 year. IMPACT patients experienced more than 100 additional depression-free days over a 2-year period than those treated with usual care.

1.3.1.2 Missing Data Problem

For the IMPACT study, one of the primary outcomes is the depression as measured by the Symptom Checklist (SCL-20) scores assessed at 0 (baseline), 3, 6, 12, 18, and 24 months. Table 1.2 describes the missing data pattern of SCL-20 scores. Of the total subjects, 70% completed all six assessments (pattern 1). A monotone missing data pattern was observed in 20% of the subjects (patterns 2–6), and the remaining 10% of subjects had intermittent missing data (pattern 7).

Table 1.2 Missing Data Pattern for SCL-20 Scores Over Time

Month
Pattern 0 3 6 12 18 24 Number Percent
1 img img img img img img 1269 70
2 img img img img img · 60 3
3 img img img img · · 60 3
4 img img img · · · 64 4
5 img img · · · · 69 4
6 img · · · · · 117 6
7 ? ? ? ? ? ? 162 10

1.3.2 National Alzheimer's Coordinating Center Minimum Data Set

1.3.2.1 Background

The National Alzheimer's Coordinating Center (NACC) was established by the National Institute of Aging (NIA, U01 AG016976) in 1999 to facilitate collaborative research. Collecting data from the 29 NIA-funded Alzheimer's Disease Centers (ADCs) across the United States, NACC has developed and maintained a large relational database of standardized clinical and neuropathological research data (Beekly et al., 2004, 2007. Two large databases are being maintained by the NACC study: one cross-sectional data collection (MDS) and one longitudinal data collection (UDS). More information about the NACC study can be found at the web site http://www.alz.washington.edu.

The NACC Minimum Data Set (MDS) consists of all subjects enrolled in the ADC's Clinical and Satellite Cores (1984–2005). The MDS contains one record for every subject enrolled and is analyzed as cross-sectional data. To demonstrate missing covariate(s) problem in a cross-sectional data analysis, we included a total of 17,403 deceased patients. The outcome variable is the mini-mental state examination (MMSE), which is a brief 30-point questionnaire test used to screen for cognitive impairment. The MMSE score can range from 0 to 30, with a lower score indicating a more severe impairment. We would like to examine the risk factors that affect the mean MMSE score, and examine whether the effects are modified by the patients' true Alzheimer's disease (AD) status. However, the gold standard ascertainment of AD, based on brain autopsy, is available only for about 31% of the sample. The missingness may be due to the patients' or their family's decision not to provide autopsy or unavailability of such a test. We believe that their decision of disease verification may be associated with the demographic characteristics (e.g., age, gender, and race), but it is unlikely to be correlated with their true AD status. So the MAR assumption seems to be reasonable. The risk factors we extracted from the database are age (continuous variable indicating age at the MMSE test), gender (binary variable with 1 indicating male gender), race (binary variable with 1 indicating white race), and marital status (binary variable with 1 indicating married); and clinical diagnosis of AD (binary variable with 1 indicating clinically diagnosed with AD), stroke (binary variable with 1 indicating having stroke before), Parkinson's disease (binary variable with 1 indicating presence of the disease), and depression (binary variable with 1 indicating presence of the disease). We included clinical diagnosis of AD as a risk factor and view it as a variable that captures patient history, clinical test results, and physician's clinical decision.

1.3.2.2 Missing Data Problem

In this example, only one covariate variable (AD status) has missing values, whereas all other variables including outcome are completely observed. Among 17,403 subjects, 11,945 (68.6%) had the AD status missing, which is considered an important predictor for cognitive impairment.

1.3.3 National Alzheimer's Coordinating Center Uniform Data Set

1.3.3.1 Background

The Uniform Data Set (UDS) at NACC is a multicenter longitudinal study that began in September 2005. Participants came from 29 ADCs across North America. The participants were referred or self-referred for evaluation of possible dementia, or recruited specifically for clinical research. Most subjects underwent clinical assessment and neuropsychological tests for cognitive impairment upon enrollment. Approximately once each year, the patients received periodical reevaluation and cognitive tests until their death or dropout from the study. Among these cognitive tests, the MMSE is a brief 30-point questionnaire test that is widely used to screen for cognitive impairment.

We used the same data set as in Monsell et al. (2012), who compared the decline of different measures of cognitive function in the NACC UDS data. Our focus is only on the MMSE score, and the aim is to investigate the risk factors affecting its rate of decline. Subjects satisfying all the following criteria were included in the analysis: (i) at least 65 years or older, (ii) with a clinical diagnosis of amnestic mild cognitive impairment (aMCI), and (iii) with at least one follow-up visit after the first diagnosis of aMCI. As a result, the analyzed data set had a total of 2830 subjects with 8752 visits. The covariates included age at the first diagnosis of aMCI, gender, clinical dementia rating sum of boxes (CDRSB), education (categorized as 0–12, 13–16, and 17+ years), Hachinski ischemic score (HIS) (categorized as 0, 1, and 2+ points), and apolipoprotein E (APOE) e4 alleles (yes/no). CDRSB and HI are time-varying covariates, whereas others are baseline covariates. More detailed background information could be found in Monsell et al. (2012).

1.3.3.2 Missing Data Problem

In this example, missingness occurs in both subject and observation levels: 3.4% of the visits had missing HIS; 4.6% of the visits had missing MMSE; 33.5% subjects had missing APOE e4; and 0.1% subjects had missing education.

1.3.4 The Pathways Study

1.3.4.1 Background

These data are used to illustrate missing data methods when dealing with missing covariates in survival analysis.

To study the association of depression with diabetes outcomes, investigators from the University of Washington and the Group Health Research Institute initially developed the Pathways study, which is a prospective, population-based, observational study (Lin et al., 2004). Subjects were recruited into the study in March 2001. Baseline surveys were conducted, and the subjects were followed up for 10 years. The data set contains 4128 subjects. For illustrative purposes, we only chose 1437 female non-Hispanic white patients into our analysis data set. The outcome of interest is time to death, which may be censored due to the disenrollment or study termination.

1.3.4.2 Missing Data Problem

Eight covariates are considered in the analysis: age, education level, smoking, duration of diabetes, hemoglobin A1c, major depression, estimated glomerular filtration rate (eGFR), and microalbuminuria. The first six covariates are completely observed, whereas the eGFR and microalbuminuria have some missing values. Among the 1437 selected patients, there are 94 patients with missing eGFR and 400 with missing microalbuminuria. In total, there are 437 patients with missing eGFR or microalbuminuria.

1.3.5 Randomized Trial on Vitamin A Supplement

1.3.5.1 Background

Vitamin A is a crucial nutrient required for maintaining immune function, eye health, growth, and survival in human beings. Lack of vitamin A can lead to death. Sommer and Zeger (1997) reported a randomized clinical trial on the effectiveness of a vitamin A supplement in reducing mortality in children in rural Indonesia. In this trial, villages were randomized to either the treatment or the control group. All children in a village assigned to the treatment group would receive vitamin A supplements, and all children in a control village did not receive any vitamin A supplements. Due to various reasons, however, not all children in the treatment villages received a vitamin A supplement. The outcome of interest is the mortality rate within 12 months of the start of the study; that is, whether the children survived at month 12 after the start of the study. In this example, we ignore any clustering within villages.

1.3.5.2 Missing Data Problem

When estimating the causal effect of vitamin A supplement, we do not consider which treatment patients would have taken and what outcomes the patients would have had if they had been given treatments different from the ones they have been assigned to. We only observe patient's outcome and actual treatment received under the treatment assignment the patient actually receives.

1.3.6 Randomized Trial on Effectiveness of Flu Shot

1.3.6.1 Background

McDonald et al. (1992) studied influenza vaccine efficacy in reducing morbidity in high-risk adults, using a computer-generated reminder for flu shots. The study was conducted over a 3-year period (1978–1980) in an academic primary care practice affiliated with a large urban public teaching hospital. Physicians in the practice were randomly assigned to either an intervention or a control group at the beginning of the study. Since physicians at the clinic each care for a fixed group of patients, their patients were similarly classified. During the study period, physicians in the intervention group received a computer-generated reminder when a patient with a scheduled appointment was eligible for a flu shot. Since a patient whose physician got a computer reminder for flu shot did not need to get a flu shot, some patients in the intervention group did not receive a flu shot. Similarly, some patients in the control group got their flu shots, even though their doctors did not receive a reminder for a flu shot. Hence, noncompliance exists in this data set. This data set has one additional problem of missing data for the outcome variable.

1.3.6.2 Missing Data Problem

The reminder information is available on everyone, and we know whether the patients had flu shot. However, we do not know what will happen if the patient had been given a treatment assignment different from the actual treatment assignment the patient received. In addition, flu-related hospitalizations of some patients were not observed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset