Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10
Imputation and Adjustment

10.1 Missing Data

Missing values may occur in a dataset either because they were not measured or because they got removed somewhere during data processing, for example, because values were deemed erroneous and deleted. In either case, the term ‘imputation’ is used to indicate the practice of completing a dataset that contains empty values.

At first sight, imputation is not all that different from any predictive modeling task. One constructs a model of an incomplete dataset, trains it on a subset of reliable data, and predicts unobserved values. There are some elements that set imputation apart from ‘general’ predictive modeling, however. The first is that imputation tasks are often specifically done with the purpose of inference in mind. That is, one is usually less interested in the specific value for a single record than in the properties of the mean, variance, or covariance structure of the whole dataset or population. Secondly, and this is arguably related to the first issue, there are a number of methods that are commonly used in imputation but rarely in predictive modeling. Examples include nearest-neighbor imputation or expectation–maximization-based techniques.

The literature on imputation methodology is vast and extensive, and excellent text books and review papers are available [See, e.g., Anderson (2002); Andridge and Little (2010); Donders et al. (2006); Kalton and Kasprzyk (1986); Schafer and Graham (2002); Zhang (2003)]. The current chapter therefore summarizes some of the most common issues and imputation methodology and focuses on methods available in R.

10.1.1 Missing Data Mechanisms

One commonly distinguishes three missing data mechanisms [attributed to Rubin (1976), but see Little and Rubin (2002) or Schafer and Graham (2002)] that determine the basic probability structure of the missing value locations in a data record. To set up the mechanisms, consider three random variables: the variable of interest $c010-math-001$ , an auxiliary variable $c010-math-002$ , and a missing value indicator $c010-math-003$ . Here, $c010-math-004$ is a binary variable that indicates whether a realization of $c010-math-005$ is observed or not. One can imagine a dataset with realizations of these variables and model the joint probability $c010-math-006$ . Three models are distinguished, each of which we state in two equivalent representations as follows:

10.1

10.2

10.3

In the first model (MCAR: missing completely at random), it is assumed that the distribution of the missing value indicator is independent of both $c010-math-010$ and $c010-math-011$ . If the dataset is created as a simple random sample, the missing values merely make that sample smaller, possibly with a different fraction for $c010-math-012$ and $c010-math-013$ .

In the second model, labeled $c010-math-014$ (missing at random), it is assumed that the value of $c010-math-015$ gives no information about whether it is observed or not. However, the distribution of $c010-math-016$ depends on the value of $c010-math-017$ (and $c010-math-018$ 's distribution could depend on the value of $c010-math-019$ ). Gelman and Hill (2006) distinguish further between dependence of $c010-math-020$ on observed and unobserved variables $c010-math-021$ .

The last case, labeled NMAR (not missing at random), is not really a model at all. The two expressions on the right-hand side are simply rewrites of $c010-math-022$ using the expression for conditional probabilities. In this case, the distribution of the missing value indicator may depend on both the values of $c010-math-023$ and $c010-math-024$ .

Although these models give a clear classification of probability models for the missing/observed status of a variable, it is in practice not possible to distinguish between them based on observed data. Suppose that one constructs a dataset with $c010-math-025$ and $c010-math-026$ independent (so $c010-math-027$ ). Now, remove all values of $c010-math-028$ above a certain threshold. Clearly, this is the not missing at random case, since the distribution of $c010-math-029$ depends on the value of $c010-math-030$ . However, an analyst who has no access to the missing realizations of $c010-math-031$ will not be able to detect the correlation between the values of $c010-math-032$ and $c010-math-033$ . Indeed, it is impossible for the analyst to distinguish the NMAR situation from MCAR. Now suppose that $c010-math-034$ and $c010-math-035$ are correlated so that larger values of $c010-math-036$ co-occur with larger values of $c010-math-037$ . In that case, $c010-math-038$ will be correlated with $c010-math-039$ , and the analyst observes a MAR situation.

The approach that is commonly taken is rather practical. To accommodate for MCAR and MAR situations, many popular imputation methods simply attempt to leverage all the auxiliary information available (e.g., MICE, missForest, to be discussed later). To correct for the NMAR case, one needs to make assumptions about or somehow model the data collection process. To assess, in approximation, the effect of a possible NMAR missing data mechanism on outcomes, one may have to resort to sensitivity analyses using a simulation of the missing value mechanism.

10.1.2 Visualizing and Testing for Patterns in Missing Data Using R

Effective summarization of missing values across a multivariate dataset can provide incentives for investigating the missing value mechanism. Ideally (apart from having complete data), missing values are distributed as MCAR. If there are patterns in missing data pointing to a MAR situation, those patterns require an explanation or, when relevant and possible, the data collection process can be altered to prevent such patterns from occurring.

The VIM package (Templ et al., 2012 2016) offers a number of visualizations and aggregations for pattern discovery in missing data. As an example we will use the retailers dataset that comes with the validate package. The VIM function aggr aggregates missing value patterns per variable and per record.

  data("retailers",package="validate")
  VIM::aggr(retailers[3:9], sortComb=TRUE, sortVar=TRUE, only.miss=TRUE)
  ##
  ##  Variables sorted by number of missings:
  ##     Variable      Count
  ##    other.rev 0.60000000
  ##  staff.costs 0.16666667
  ##        staff 0.10000000
  ##  total.costs 0.08333333
  ##       profit 0.08333333
  ##     turnover 0.06666667
  ##    total.rev 0.03333333

The result is an overview of the fraction of missing values per variable. Here, the option to sort variables from high to low fractions of missing values is used (sortVar=TRUE). As a side effect, a plot of the aggregates, shown in Figure 10.1, is created. The visualization contains a barplot showing fractions of missing values per variable in panel (a) and a rectangular, space-filling plot indicating the occurrence of missing value combinations in panel (b). In the latter plot, each column in a space-filling grid of squares represents a variable, and each row represents a single occurring missing data pattern. A value that is missing is colored gray, and the observed values are colored light gray (by default). On the right there is a vertical bar chart that indicates for each row how often every pattern occurs. In this example, we also sort the graph by variables (decreasing left to right in fraction of missing values) and combination (sortComb=TRUE). Also, we specify that the bar chart heights should be relative to the number of patterns that contain at least one missing (only.miss=TRUE). The total fraction of complete records represented is printed at the right. The graph shows that the variable other revenue is missing most often, while the combination other revenue and staff is the most often missing combination in this dataset.

Image described by caption and surrounding text. — **Figure 10.1** Percentages of missing values per variable (a) and occurrence of missing data patterns (b) in the `retailers` dataset, plotted with `VIM::aggr`.

To detect whether a variable's missing value mechanism is MAR with respect to a second variable, the missingness indicator of the first variable can be used to split the dataset into two groups. The observed distributions of the second variable for the two groups can then be compared to distinguish between MCAR and MAR. In the case of MCAR, one expects the distributions to be similar. With VIM::pbox, one chooses a single numerical variable and compares its distribution with respect to the missingness indicator of all other variables. Here, we compare the distribution of staff against the status (missing or present) of other variables.

  VIM::pbox(retailers[3:9], pos=1)

The result is shown in Figure 10.2. The leftmost boxplot shows the distribution of staff, with numbers indicating that there are 60 observations of which 6 are missing. The other boxplots, occurring in pairs, compare the distributions of staff, split according to the missingness of another variable. For example, the distribution of staff in the case where other.rev is observed (shown in light gray) appears to differ from the case where other.rev is missing. This indicates a possible MAR situation for other.rev with respect to staff. The widths of the boxplots indicate the number of observations used in producing the boxplot: a (very) thin boxplot indicates that the difference in distributions is supported by little evidence. The number of observations (top) and missing values (bottom) per group are printed below the boxes.

Illustration of Parallel boxplots, comparing the distribution of staff conditional on the missingness of other variables. — **Figure 10.2** Parallel boxplots, comparing the distribution of *staff* conditional on the missingness of other variables.

To confirm or reject our suspicion that the locations of the distributions differ significantly, we perform the Student $c010-math-040$ -test. Here, we use log-transformed data since economic data tends to follow highly skewed distribution (typically close to log-normal).

  t.test(log(staff) ∼ is.na(other.rev), data=retailers)
  ##
  ##   Welch Two Sample t-test
  ##
  ## data:  log(staff) by is.na(other.rev)
  ## t = 2.7464, df = 46.014, p-value = 0.008572
  ## alternative hypothesis: true difference in means is not equal to 0
  ## 95 percent confidence interval:
  ##  0.1985149 1.2880867
  ## sample estimates:
  ## mean in group FALSE  mean in group TRUE
  ##            2.329996            1.586695

The low $c010-math-041$ value indicates that the null hypotheses (means are equal across groups) may be rejected with a low probability ( $c010-math-042$ ) of error. The conclusion is that missingness of other.rev is to be treated as MAR with respect to staff.

One may wonder whether the reverse is also true: is the missingness of staff MAR with respect to the values observed in other.rev? A quick insight into this question can be obtained by drawing a so-called marginplot (here, using log-transformed variables to accommodate for their skew distributions).

  dat <- log10(abs(retailers[c(3,5)]))
  VIM::marginplot(dat, las=1, pch=16)

The marginplot (Figure 10.3) shows a scatterplot of cases where both variables are observed. The margins show, in light gray, boxplots of the variable depicted on the respective axis. These are contrasted with boxplots (in gray) of the same variable, but for the case where the other variable is missing. So from the boxplots in the margins of the $c010-math-043$ -axis, we read off that other.rev may be MAR with respect to staff. Similarly, from the boxplots in the $c010-math-044$ -axis, we read off that staff might be MAR with respect to other.rev. The actual values used to produce the dark gray boxplots are also represented in the margins as dark gray dots. The number of missing values per variable and the number of co-occurring missing values are denoted in the margins as well.

Illustration of Marginplot of other.rev against staff . — **Figure 10.3** Marginplot of *other.rev* against *staff*.

Exercises for Section 10.1

10.2 Model-Based Imputation

The literature on imputation methodology is extensive. Besides a broad range of general imputation methods, many methods have been developed with specific applications in mind. Examples include methodology for longitudinal data (Fitzmaurice et al., 2008) or social network data (see Huisman (2009) and references therein). Here, we focus on a number of well-established methods covering a broad range of applications that are readily available in R.

In a predictive model, the target variable $c010-math-046$ is described by an estimating function or algorithm $c010-math-047$ depending on one or more predictors $c010-math-048$ and one or more parameters $c010-math-049$ .

10.4

where $c010-math-051$ is the residual of the model, the part of variation in $c010-math-052$ that is not described by $c010-math-053$ . The predictor variables may be real-valued or categorical. In the latter case, a categorical variable taking $c010-math-054$ values is represented by $c010-math-055$ binary dummy variables. Likewise, the predicted variable $c010-math-056$ can be real-valued or categorical. If $c010-math-057$ is a binary variable, the possible values can be coded as 0 and 1. The form of $c010-math-058$ can be chosen to estimate the probability of $c010-math-059$ taking the value 1, so it only takes values in the range $c010-math-060$ . If $c010-math-061$ can take $c010-math-062$ different values, we may label them as $c010-math-063$ . One then determines $c010-math-064$ model functions $c010-math-065$ , each estimating the probability $c010-math-066$ .

The values for $c010-math-067$ are estimated by minimizing a loss function such as the negative log-likelihood over known values of $c010-math-068$ . Once the estimates $c010-math-069$ are obtained, imputed values $c010-math-070$ for numerical $c010-math-071$ are determined as

10.5

with $c010-math-073$ a vector of observed values for $c010-math-074$ and $c010-math-075$ a chosen residual value. A common choice is to set $c010-math-076$ , so the imputed value is the best estimate of $c010-math-077$ given $c010-math-078$ , $c010-math-079$ , and the loss function. If one is interested in individual predictions only, setting $c010-math-080$ is the common choice. When dealing with imputation problems, one is often interested in reconstructing the (co)variance structure of a dataset. So, the other options include sampling $c010-math-081$ from $c010-math-082$ , where $c010-math-083$ is the estimated variance of $c010-math-084$ or sampling $c010-math-085$ (uniformly) from the observed set of residuals. Methods where $c010-math-086$ is sampled are referred to as stochastic imputation methods. If $c010-math-087$ is a categorical variable, the actual predicted value may be the one that is assigned the highest probability, so $c010-math-088$ . Alternatively, one can sample a value from $c010-math-089$ assuming the $c010-math-090$ as probability distribution over the domain of $c010-math-091$ .

If $c010-math-092$ is a real-valued variable, a common model is the linear model

Here, $c010-math-093$ usually (but not necessarily) represents the intercept, so $c010-math-094$ . Given a set of observations $c010-math-095$ , the most popular loss function to estimate $c010-math-096$ the sum of squares, so

10.6

where $c010-math-098$ represents the $c010-math-099$ th row in $c010-math-100$ . The resulting estimator can be interpreted as the conditional expectation of $c010-math-101$ given $c010-math-102$ or $c010-math-103$ . There are several variations on the quadratic loss function including ridge regression (Hoerl and Kennard, 1970), lasso regression (Tibshirani, 1996), and their generalization: elasticnet regression (Zou and Hastie, 2005). Each of these methods forms an attempt to cope with high variability in the training data by adding terms that penalize the size of the $c010-math-104$ ( $c010-math-105$ ). Other robust alternatives include the class of $c010-math-106$ -estimators. There, the loss function is adapted to decrease the contribution of highly influential records in the training set [see, e.g., Huber (2011) or Maronna et al. (2006)].

If we denote the matrix $c010-math-107$ , where the $c010-math-108$ are columns of observed values (possibly including the intercept ‘variable’ $c010-math-109$ ), the imputed values can be written as

10.7

It was demonstrated by Kalton and Kasprzyk (1986) and more extensively by de Waal et al. (2011, Chapter 7) that a surprising number of common imputation methods can be written in this form when the choices for the $c010-math-111$ and $c010-math-112$ are appropriately adapted. Methods that can be written in this form include (group) mean imputation, linear regression and ratio imputation, nearest-neighbor imputation, the deductive imputation method of Section 9.3.2, and several forms of stochastic imputation. If the (regularized) quadratic loss function is also allowed to vary, even more imputation methods can be written in this form. In particular, if the quadratic loss function is replaced with the least absolute deviation (LAD), we get

10.8

One can show that in this case [See, e.g., Koenker (2005); Chen et al. (2008)]

Thus, imputing the conditional (group-wise) median can also be summarized under this notation.

Exercises for Section 10.2

10.3 Model-Based Imputation in R

The large amount of literature on imputation methodology is reflected in the large number of R packages implementing them. At the time of writing there are dozens of packages mentioning ‘impute’ or ‘imputation’ in their description. Here, we will demonstrate a number of imputation methods using the simputation package. The reason for choosing this particular package is it offers a consistent and (to R-users) familiar interface to many imputation models. The package relies mostly on other packages for computing the models and generating predictions. In some cases the backend can be chosen (e.g., one can make simputation use VIM for certain types of hotdeck imputations).

10.3.1 Specifying Imputation Methods with `simputation`

With the simputation package the specification of an imputation method always has the following form:

  impute_<model-abbreviation>(dat, formula, [model-specific options], …)

where <model-abbreviation> is replaced with an abbreviated name for the predictive model to be used (e.g., lm for linear models), dat is the dataset to be imputed, and formula specifies the relation between imputed and predicting variables. Depending on the method there may be some simputation-specific options, and all extra arguments (…) are passed to the underlying modeling functions.

The formula object is an expression of the form

 imputed_variables ∼ predicting_variables [ | grouping_variables ]

where imputed_variables specifies what variables should be imputed and predicting_variables specifies the combination of variables to be used as predictors. The terms enclosed in brackets are optional. The grouping_variables term can be used to specify a split-apply-combine strategy for imputation. The dataset is split according to the value combinations of grouping variables, the imputation model is estimated for each subset, values are imputed, and the dataset recombined.

Contrary to most modeling functions in R, the specification of imputed (dependent) variables is flexible and can contain multiple variables. The simputation package will simply loop over all variables to be imputed, estimating models as needed. For example, the specification

  y ∼ foo + bar

specifies that variable y should be imputed, using foo and bar as predictors. To impute multiple variables, one can just add variables on the left-hand side.

 y1 + y2 + y3 ∼ foo + bar

Here, y1, y2, and y3 are imputed using foo and bar as predictors. The dot (.) stands for ‘every variable not mentioned earlier’, so

 . ∼ foo + bar

is to be interpreted as impute every variable, using foo and bar as predictors. It depends on the imputation method whether that means that foo and bar can also be imputed. The simputation package will remove predictors from the list of imputed variables when necessary. Finally, it is also possible to remove variables. The formula

 . - x ∼ foo + bar

means impute every variable except x using foo and bar as predictors.

The form that predicting_variables can take depends on the chosen imputation model. For example, in linear modeling, the option to model interaction effects (e.g., foo:bar) is relevant, while for other models it is not.

10.3.2 Linear Regression-Based Imputation

Linear regression imputation can be applied to impute numerical variables, using numerical and/or categorical variables and possibly their interaction effects as predictors. The model function is given by $c010-math-115$ , where $c010-math-116$ is estimated with Eq. (10.6). In particular, we can partition the vector of $c010-math-117$ -values as $c010-math-118$ , where $c010-math-119$ indicates where $c010-math-120$ is observed and $c010-math-121$ indicates where $c010-math-122$ is missing. Accordingly, the matrix $c010-math-123$ with predictor values can be partitioned in rows $c010-math-124$ where $c010-math-125$ is observed and rows $c010-math-126$ where $c010-math-127$ is missing. The value of $c010-math-128$ is then estimated over the observed values of $c010-math-129$

10.9

after which the missing values can be estimated as

With the simputation package, linear model imputation can be performed with the impute_lm function. In the following paragraphs a few columns of the retailers dataset from the validate package will be used.

  library(simputation)
  library(magrittr) # for convenience
  data(retailers, package="validate")
  retl <- retailers[c(1,3:6,10)]
  head(retl, n=3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75       NA        NA      1130  NA
  ## 2  sc3     9     1607        NA      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

We will be interested in imputing the values for turnover, other.rev, and total.rev. The simputation package relies on lm for estimating linear models, which means that several classes of imputation methods can be specified with ease.

In mean imputation, missing values are replaced by the column mean. To impute turnover, other.rev, and total.rev with their respective means, we specify

  impute_lm(retl, turnover + other.rev + total.rev ∼ 1) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 20279.48  4218.292      1130  NA
  ## 2  sc3     9  1607.00  4218.292      1607  NA
  ## 3  sc3    NA  6886.00   -33.000      6919  NA

It is well known that mean imputation leads to a gross underestimation of the variance of estimated means (when computed over the imputed dataset).

A slightly better procedure is to impute the group mean. Here, we impute missing variables using size (a size classification) as grouping variable.

  impute_lm(retl, turnover + other.rev + total.rev ∼ 1 | size) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 1420.375    315.50      1130  NA
  ## 2  sc3     9 1607.000   6169.25      1607  NA
  ## 3  sc3    NA 6886.000    -33.00      6919  NA

By specifying size after the vertical bar, we make sure that simputation does the split-apply-combine work over the grouping variable. The same result can be achieved as follows:

  impute_lm(retl, turnover + other.rev + total.rev ∼ size) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 1420.375    315.50      1130  NA
  ## 2  sc3     9 1607.000   6169.25      1607  NA
  ## 3  sc3    NA 6886.000    -33.00      6919  NA

where we let lm estimate a model with size as predictor. The latter method is slightly less robust. When one of the groups contains only missing values for one of the predicted variables, lm will stop, while the split-apply-combine procedure of simputation can handle such cases.

Ratio imputation uses the model $c010-math-131$ , where $c010-math-132$ is the ratio of the mean of $c010-math-133$ and the mean of $c010-math-134$ . It is equivalent to a linear model with a single predictor, no abscissa, weighted according to the reciprocal of the predictor. Here, the three variables are imputed with staff (the number of employees) as predictor.

  impute_lm(retl, turnover + other.rev + total.rev ∼ staff - 1
   , weight=1/retl$staff) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 26187.55 26426.132      1130  NA
  ## 2  sc3     9  1607.00  3171.136      1607  NA
  ## 3  sc3    NA  6886.00   -33.000      6919  NA

Ratio imputation is often used as a growth estimate, that is, in cases where a current value as well as a past value is known.

In linear regression imputation, one or more predictors may be used to impute a value based on a linear model. Below, the number of staff and turnover reported for value-added tax (vat) are used as predictors.

  impute_lm(retl, turnover + other.rev + total.rev ∼ staff + vat
            )%>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75       NA        NA      1130  NA
  ## 2  sc3     9     1607        NA      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

Observe that in the first three rows, nothing is imputed. The reason is that in those cases, the predictor vat is missing. This illustrates a general property of simputation. The package will leave values untouched when one of the predictors is missing and return the partially imputed dataset.

Each of these models imputes the expected value, given zero or more predictors. They can be made stochastic by adding to each estimated value a random residual $c010-math-135$ , as denoted in Eq. (10.5). For model-based imputation methods, simputation supports three options: $c010-math-136$ , this is the default; $c010-math-137$ with $c010-math-138$ , the estimated variance of the residuals; and $c010-math-139$ , sampled from the observed residuals. They can be specified with the add_residual option.

  # make results reproducible
  set.seed(1)
  # add normal residual
  impute_lm(retl
     , turnover + other.rev + total.rev ∼ staff
     , add_residual = "normal") %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 1100.659  11737.95      1130  NA
  ## 2  sc3     9 1607.000 -12914.62      1607  NA
  ## 3  sc3    NA 6886.000    -33.00      6919  NA
  # add observed residual
  impute_lm(retl
     , turnover + other.rev + total.rev ∼ staff
     , add_residual = "observed") %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 7439.802  -659.214      1130  NA
  ## 2  sc3     9 1607.000  1015.605      1607  NA
  ## 3  sc3    NA 6886.000   -33.000      6919  NA

10.3.3 $c010-math-140$ -Estimation

The $c010-math-141$ -estimation method aims to reduce the influence of outliers on linear model coefficients by replacing the loss function of Eq. (10.9) with a suitable function $c010-math-142$ so that

where $c010-math-143$ is the number of observed records $c010-math-144$ . To find $c010-math-145$ , one solves the system of equations for $c010-math-146$

Here, we defined the so-called influence function $c010-math-147$ . It determines the relative influence of each observation to the solution. If we set $c010-math-148$ , then $c010-math-149$ (up to an unimportant additive constant) and Eq. (10.9) is returned.

The influence function is chosen so that it is less sensitive for increasing values of $c010-math-150$ as the standard quadratic loss function. A few popular choices are proposals by Huber et al. (1964), Hampel et al. (1986), and Tukey's bisquare function, which are also available in R through the MASS package.

The parameters $c010-math-151$ , $c010-math-152$ , $c010-math-153$ , $c010-math-154$ , and $c010-math-155$ are tuning parameters that determine the rate of increase as a function of $c010-math-156$ and the locations where $c010-math-157$ levels off. Figure 10.4 shows the shape of the $c010-math-158$ and $c010-math-159$ functions of Huber, Tukey, and Hampel. The constants were chosen so that the regression estimators have an efficiency of 95% as described, for example, by Koller and Mächler (2016), that is, $c010-math-160$ , $c010-math-161$ , $c010-math-162$ , $c010-math-163$ , and $c010-math-164$ , where $c010-math-165$ .

With the simputation package, imputation based on $c010-math-171$ -estimated linear regression parameters can be done with the impute_rlm function. It uses the $c010-math-172$ function of the MASS package for coefficient estimation.

  impute_rlm(retl, turnover + other.rev + total.rev ∼ staff) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 12927.56 159.16574      1130  NA
  ## 2  sc3     9  1607.00  16.29018      1607  NA
  ## 3  sc3    NA  6886.00 -33.00000      6919  NA

The default is to use Huber's $c010-math-173$ function with $c010-math-174$ . Extra arguments are passed through to rlm. For example, rlm has the option to set method="MM". This sets a number of options ensuring that the regression estimator has a high breakdown point (qualitatively, the fraction of outliers that may be present in the data before the estimator gives unacceptable results).

  impute_rlm(retl, turnover + other.rev + total.rev ∼ staff
   , method="MM") %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 13529.59  75.34900      1130  NA
  ## 2  sc3     9  1607.00  13.24304      1607  NA
  ## 3  sc3    NA  6886.00 -33.00000      6919  NA

10.3.4 Lasso, Ridge, and Elasticnet Regression

With impute_en, elasticnet regression is used to compute imputations. In linear elasticnet regression, the parameters are estimated as

where $c010-math-175$ is the familiar Euclidean norm and $c010-math-176$ the $c010-math-177$ -norm or the sum over absolute values of the coefficients of its argument. The penalty term is defined in terms of $c010-math-178$ , which denotes all coefficients except the intercept (when present). The parameter $c010-math-179$ allows one to shift smoothly from ridge regression ( $c010-math-180$ ) to lasso regression ( $c010-math-181$ ), while $c010-math-182$ determines the overall strength of the penalty. The characteristic difference between lasso and ridge regression is that in the case of correlated predictors, lasso regression tends to push one or more coefficients to zero, while ridge regression spreads the value of coefficients over multiple correlated variables.

The simputation package implements elasticnet imputation through the impute_en function, which depends on the glmnet package of Friedman et al. (2010).

  impute_en(retl, turnover + other.rev + total.rev ∼ staff + size
     , s=0.005, alpha=0.5) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 13360.82 -8354.709      1130  NA
  ## 2  sc3     9  1607.00 10476.443      1607  NA
  ## 3  sc3    NA  6886.00   -33.000      6919  NA

Here, the predictor size is added since the glmnet package does not accept models with less than two predictors. The parameter s determines the size of the overall penalty factor $c010-math-183$ used in predictions, and we (arbitrarily) set alpha=0.5.

10.3.5 Classification and Regression Trees

Decision tree models can be used in situations where the predicted variable is numeric and its dependence on the predictors is highly nonlinear, or when the predicted variable is categorical with probabilities that have complex dependence on predictor variables. In both the cases, the predictors can be numeric or categorical. Because decision tree models can be used to predict quantitative as well as qualitative variables, they are often referred to as classification and regression trees or CART, for short. The term was introduced by Breiman et al. (1984), and it also refers to a specific method for setting up the decision tree (which will be treated below).

Consider again a predicted variable $c010-math-184$ (to be imputed) and a set of predictors $c010-math-185$ . The idea is to partition the set of all possible value combinations $c010-math-186$ into disjunct regions, such that the corresponding value of $c010-math-187$ is as homogeneous as possible in each region. For numeric $c010-math-188$ , homogeneous usually means ‘a small variance’; for categorical data, it means ‘a high proportion of a single category’. Given a particular record $c010-math-189$ , one follows a binary decision tree $c010-math-190$ to see in what region the record falls. The predicted value for numeric $c010-math-191$ is then the mean value within the region (although more robust estimators are sometimes used as well), and the predicted value for categorical $c010-math-192$ is the category with the highest prevalence.

Before considering how such a decision tree is constructed, consider the example of Figure 10.5. Here, we used the rpart package to build a predictive model for the staff variable in the retailers dataset. Depicted are two representations of the resulting model. In Figure 10.5(a), the decision tree is shown. Each nonterminal node contains two numbers and a decision rule. The root node represents 100% of the records, and the mean value for staff over all those records equals 12 (rounded). If we partition the dataset according to the rule total.rev < 3464, we get 83% records for which this rule holds, with a mean number of staff of 7.7 and 17% records for which this total.rev >= 3464, with a mean number of staff of 31. The latter case ends in a terminal node and thus corresponds with a single partition of the feature space. The group of records falling into this category are on the right of the second vertical line in Figure 10.5(b). The group of records on the left (with total.rev < 3464) are subdivided by having smaller or larger total.rev of 968. The latter are subdivided once more based on whether the size variable equals "sc2" (depicted in white in panel (b)). The terminal nodes of the tree contain the actual predictions made for each partition along with the size of the partition.

When comparing this procedure with Eq. (10.5), we see that here, the model function is a procedure $c010-math-193$ , that is parameterized by a decision tree $c010-math-194$ . Based on the values of $c010-math-195$ , the tree is traversed until a terminal node is reached and the prediction returned.

The tree itself is built up iteratively. Given a $c010-math-196$ dataset $c010-math-197$ and a set of values $c010-math-198$ , the optimal split based on each variable $c010-math-199$ is computed. Of those $c010-math-200$ splits, the one resulting in the lowest error is chosen. This process is then repeated for each partition recursively. Note that for any dataset, it is in principle possible to create a perfect partition by growing the tree until each leaf has a single record in it (or only records with equal values for $c010-math-201$ ). In practice, one stops at some minimum number of records. The resulting (still large) tree is then pruned by removing leaf nodes from the bottom up. The final result is determined by a trade-off between error minimization and tree size $c010-math-202$ :

Here, $c010-math-203$ is the set of subtrees that can be obtained by pruning the initial tree. The function ‘error’ records the mismatch between prediction and observation, appropriate for the variable $c010-math-204$ (e.g., standard deviation for numerical variables end mismatch ratio for categorical variables). The term $c010-math-205$ penalizes the number of nodes $c010-math-206$ . Here, $c010-math-207$ is referred to as the cost-complexity parameter. It is determined automatically by computing $c010-math-208$ for a series of values of $c010-math-209$ and applying cross-validation to select the best one [see also James et al. (2013, Chapter 8) or Hastie et al. (2001, Chapter 9)].

With the simputation package, CART-based imputation can be performed with the impute_cart. The specification of predictor variables tells impute_cart what variables can be used in the tree. In many cases, one can choose all variables except the predicted since decision trees have variable selection built-in.

  impute_cart(retl, staff ∼ .) %>% head(3)
  ##   size    staff turnover other.rev total.rev vat
  ## 1  sc0 75.00000       NA        NA      1130  NA
  ## 2  sc3  9.00000     1607        NA      1607  NA
  ## 3  sc3 30.66667     6886       -33      6919  NA

Imputation took place in the third record. The imputation value can be traced by following the decision tree of Figure 10.5. The total revenue in the third record equals 6919. Since this is larger than 3464, we end in a leaf node immediately and predict a value of 30.67.

One advantage of CART models over linear models for imputation is their resilience against missing values in predictors. For a linear model, the imputation $c010-math-210$ cannot be estimated when any of the $c010-math-211$ happens to be missing unless it is somehow imputed. In a CART model, the actual decision tree contains more information than shown in Figure 10.5. Anticipating on possible missing predictors, each node stores one or more backup split rules based on other predictors present in the dataset. If during prediction some value $c010-math-212$ is found to be missing, its first so-called surrogate variable is used to decide the split. If the surrogate is also missing, the next one is used, and so on, until an observed surrogate is found, or no surrogates are left. In the latter case, the most populated child node is chosen. The loss of quality of prediction is smaller when surrogates are highly correlated with the primary splitting variables.

10.3.6 Random Forest

Random forest (Breiman, 2001) is an ensemble-based improvement over CART. It can be used to predict both qualitative and quantitative variables. The idea is to take bootstrap samples from the original data, and grow a decision tree for each sample. Moreover, at each split, a subset of the $c010-math-213$ available predictors (typically about $c010-math-214$ is randomly chosen as possible splitting variables. Randomizing the available splitting variables is used to decrease correlation between the trees.

Training a random forest model thus results in a set of $c010-math-215$ trees $c010-math-216$ called a forest. If the predicted variable is numerical, the prediction is an aggregate over the individual predictions such as the mean,

but in principle, it is possible to use a robust aggregate such as the median as well. For categorical variables, the majority vote over the trees is taken.

With the simputation package, random forest models can be employed for imputation as follows:

  impute_rf(retl, staff ∼ .) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75       NA        NA      1130  NA
  ## 2  sc3     9     1607        NA      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

Random forest models are somewhat less resilient against missing predictors than the CART models discussed in the previous paragraph. The underlying Fortran code by Breiman and Cutler (2004) is able to use a rough imputation scheme on the training set to generate the forest (this can be set by passing na_action=na.rough fix; this will impute medians for numeric data and modes for categorical data). For prediction, however, not every variable needs to be present: the average can be taken over the subset of trees that do return a value. However, for small datasets such as in this example, it may occur that all trees return NA so no prediction is possible. Here, the problem can be partially resolved by removing a variable with little observations from the list of predictors.

  impute_rf(retl, staff ∼ . - vat) %>% head(3)
  ##   size    staff turnover other.rev total.rev vat
  ## 1  sc0 75.00000       NA        NA      1130  NA
  ## 2  sc3  9.00000     1607        NA      1607  NA
  ## 3  sc3 20.23483     6886       -33      6919  NA

Besides simputation, there are other R packages implementing imputation based on random forests. The missForest package by Stekhoven and Bühlmann (2012) implements an iterative imputation procedure. For initiation, missing values get imputed using a simple rule. Next, a random forest is trained on the completed dataset, yielding updated imputation values. This second step is repeated until a convergence criterion has been satisfied. When installed, the missForest package can be interfaced via simputation using impute_mf.

  impute_mf(retl, staff ∼ .) %>% head(3)
  ##   missForest iteration 1 in progress…done!
  ##   missForest iteration 2 in progress…done!
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0 75.00       NA        NA      1130  NA
  ## 2  sc3  9.00     1607        NA      1607  NA
  ## 3  sc3 18.15     6886       -33      6919  NA

Here, all variables are imputed (iteratively) and used as predictors, but only the variables on the left-hand side for the formula are copied to the resulting dataset. With the formula . ∼ . all variables can be imputed.

10.4 Donor Imputation with R

In donor imputation, a missing value in one record is replaced with an observed value that is copied from another and somehow otherwise similar record. The record from which the value is copied is referred to as the ‘donor’ record; hence, the name of the method. Donor imputation is also referred to as hot deck imputation. The etymology of this term derives from the state of the art in computing when the method was first applied. Researchers would ‘hot deck impute’ by drawing from a deck of computer punch cards representing records (Andridge and Little, 2010; Cranmer and Gill, 2013).

When compared to model-based imputation, the advantage of donor imputation is that the imputed value is always an actually existing (observed) value. Statistical models always run the risk of predicting a value that is not (physically) possible, especially when extrapolating beyond the observed range of values. The downside of donor imputation is that in spite of its wide application, theoretical underpinning is not as strong as for model-based methods. Moreover, Andridge and Little (2010) conclude in their extensive review that no consensus exists on the best way to apply hot deck imputation methods, and note that ‘many multivariate hot deck methods seem relatively ad hoc’. Nevertheless, hot deck methods have been commonly applied for a long time in areas related to official statistics [see Ono and Miller (1969); Bailar and Bailar (1979); and Cox (1980) for some early applications and method comparisons] and to a lesser extend in medical or epidemiological settings. Method comparisons are given in, for example, Barzi and Woodward (2004); Engels and Diehr (2003); Perez et al. (2002); Reilly and Pepe (1997); Tang et al. (2005), and Twisk and de Vente (2002).

Hot deck imputation methods are commonly categorized along two dimensions. The first dimension distinguishes between methods where multiple missing values in a record are imputed from the same donor (multivariate donor imputation) and methods where a separate donor may be appointed for each missing variable. The main advantage of multivariate donor imputation is that one only imputes valid and existing value combinations, so that imputed values cannot introduce inconsistencies. The downside is that the number of possible donors may be greatly reduced as the number of missing values in a record increases. The hot deck donor imputation routines in simputation have an option called pool, which control this behavior. Its possible values are

`"complete"`:	Use only complete records as donor pool and perform multivariate imputation. This is the default
`"univariate"`:	A new donor is sought for each variable.
`"multivariate"`:	For each occurring pattern of missingness find suitable donors and perform multivariate imputation.

The second dimension distinguishes between the various ways donor records are determined, and each of these methods (discussed next) can be executed in a univariate or multivariate fashion.

10.4.1 Random and Sequential Hot Deck Imputation

In random hot deck imputation, a donor is sampled from a donor pool. Often, a dataset is separated into imputation cells for which one or more auxiliary variables have the same values. With the simputation package, the imputation cells are determined by the right-hand side of the formula object specifying the model (we continue with the retl dataset constructed in the previous paragraph).

  set.seed(1) # make reproducible
  # random hot deck imputation (multivariate; complete cases are donor)
  impute_rhd(retl, turnover + other.rev + total.rev ∼ size) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75      359         9      1130  NA
  ## 2  sc3     9     1607     98350      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

In the above example, the dataset is split according to the size class label, and data are imputed in univariate manner. That is, for each variable, a value is sampled from all observed values within the same size class. If multiple categorical variables are used to define imputation cells, the donor pools can quickly decrease in size, leading possibly to many imputations of the same value. Setting pool="univariate" can alleviate this issue to a small extend since per-variable donor pools are generally larger than multivariate donor pools.

  # random hot deck imputation (univariate)
  impute_rhd(retl, turnover + other.rev + total.rev ∼ size
   , pool="univariate") %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75      197       622      1130  NA
  ## 2  sc3     9     1607        33      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

By default, records are drawn uniformly from the pool, but one can pass a numeric vector prob assigning a probability to each record in the imputed data. Probabilities will be rescaled as necessary, depending on grouping and donor pool specification.

In sequential hot deck, one sorts the dataset using one or more variables, and missing values in a record are taken from the first preceding or ensuing record that has a value. If values are taken from preceding records, the method is referred to as last observation carried forward or LOCF in short, if values are taken from ensuing records, the method is referred to as next observation carried backward (NOCB). With the simputation package, sequential hot deck is executed with the impute_shd function. The ‘predictor variables’ in the formula are used to sort the data (with every variable after the first used as tie-breaker for the previous one).

  impute_shd(retl, turnover + other.rev + total.rev ∼ staff) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75     9067       622      1130  NA
  ## 2  sc3     9     1607        38      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

The sort order is always ‘increasing’, but with the argument order, one can choose to use between "nocb" (the default) and "locb" imputation.

Both random and sequential hot decks are implemented in the simputation source code. Especially for large datasets and many groups, it is beneficial to use the faster implementation provided by the VIM package. This is possible for both impute_rhd and impute_shd by setting backend="VIM". Options specific to simputation (such as the order argument) will be ignored, but one can pass any argument of VIM::hotdeck to impute_rhd or impute_shd for detailed control over the imputation method.

10.4.2 $c010-math-217$ Nearest Neighbors and Predictive Mean Matching

In the $c010-math-218$ -nearest-neighbor (knn) method, a similarity measure is used to find the knns to a record containing missing values. Next, a donor value is determined. Donor value determination can be done by randomly selecting from the $c010-math-219$ neighbors or, for example (in the case of categorical data), by choosing the majority value. A particularly popular similarity measure is that of Gower (1971). Given two records $c010-math-220$ and $c010-math-221$ , each with $c010-math-222$ variables that may be numeric, categorical, or missing, Gower's similarity measure $c010-math-223$ can be written as

The values of $c010-math-224$ and $c010-math-225$ depend on the variable type. If the $c010-math-226$ th variable is numeric, then

Here, $c010-math-227$ is the observed range of the $c010-math-228$ th variable. If the $c010-math-229$ th variable is categorical, then $c010-math-230$ is defined as

For numerical and categorical variables, the importance weights $c010-math-231$ may be chosen at will, but they are usually set to 1 or 0, where setting $c010-math-232$ amounts to excluding the $c010-math-233$ th variable from the similarity calculation. For a dichotomous (logical, in R) variable, $c010-math-234$ is yet defined differently, namely,

while the weights are defined as

Here, we identify the logical outputs true with 1 and false with 0. The rationale is that a dichotomous variable only adds to the similarity when both variables are true. If any of the two variables is true they add to the weight in the denominator.

In the simputation package, knn imputation based on Gower's similarity is performed with the impute_knn function. The predictor variables in the formula argument specify which variables are used to determine Gower's similarity. Below, we use all variables by specifying the dot (.). The default value for $c010-math-235$ , but by setting $c010-math-236$ , values are copied directly from the nearest neighbor.

  impute_knn(retl, turnover + other.rev + total.rev ∼ ., k=1) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75     9067       622      1130  NA
  ## 2  sc3     9     1607        13      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

Like for the random and sequential hot deck imputation procedures, the donor pool can be specified ("complete", "univariate", or "multivariate") and the VIM package can be used as computational backend.

Predictive mean matching (PMM) is a nearest-neighbor imputation method where the donor is determined by comparing predicted donor values with model-based predictions for the recipient's missing values. It can therefore be seen as a method that lies between the purely model-based and purely donor-based imputation methods. On one hand, it partially shares the benefits of both approaches, utilizing the power of predictive modeling while making sure only observed values are imputed. On the other hand, it inherits some of the intricacies of both worlds, such as issues with model selection and the possibility of small donor pools. In practice, PMM has become a popular method, and the popular mice package for multiple imputation (van Buuren and Groothuis-Oudshoorn, 2011) uses it as default imputation method.

In simputation, PMM is achieved with the impute_pmm function. Besides the data to be imputed and a predictive model-specifying formula, it takes one of the impute_ functions as an argument to preimpute the recipients with a chosen model. By default, impute_lm is used so the formula object must specify a linear model.

  impute_pmm(retl, turnover + other.rev + total.rev ∼ staff) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75     7271        30      1130  NA
  ## 2  sc3     9     1607      1831      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

However, one can switch to a robust linear model based on the $c010-math-237$ -estimator as follows:

  impute_pmm(retl, turnover + other.rev + total.rev ∼ staff
   , predictor=impute_rlm, method="MM") %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75     9067       622      1130  NA
  ## 2  sc3     9     1607        13      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

10.5 Other Methods in the `simputation` Package

There are a few other imputation approaches supported by the simputation package that facilitate certain type of imputations.

The first method is a utility function allowing to replace missing values with a constant. For example,

  impute_const(retl, other.rev ∼ 0) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75       NA         0      1130  NA
  ## 2  sc3     9     1607         0      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

Such an imputation strategy implies strong assumptions that in some cases may nevertheless be reasonable. For example, one could assume that values that are not submitted by a respondent can be interpreted as ‘not applicable’ or in this case zero. Obviously, one should very carefully test such an assumption since they can introduce severe bias in estimates based on the data.

The second method is referred to by de Waal et al. (2011) as proxy imputation. Here, the missing value is estimated by copying a value from the same record, but from another variable. For example, we may estimate the variable total turnover in the retailers dataset by copying the amount of turnover reported to the tax office for vat.

  impute_proxy(retl, total.rev ∼ vat) %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75       NA        NA      1130  NA
  ## 2  sc3     9     1607        NA      1607  NA
  ## 3  sc3    NA     6886       -33      6919  NA

This method is useful only when both the economic definition of turnover and the legal definition used by the tax authorities coincide or are very close.

The simputation package is more flexible than De Waal et al.'s original definition and also allows for imputing functions of variables.

  impute_proxy(retl
   , turnover ∼ mean(turnover/total.rev,na.rm=TRUE) * total.rev) %>%
    head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 1351.159        NA      1130  NA
  ## 2  sc3     9 1607.000        NA      1607  NA
  ## 3  sc3    NA 6886.000       -33      6919  NA

Here, the right-hand side of the formula may evaluate to a vector of unit length or to a vector with length equal to the number of input rows. Proxy imputation also accepts grouping, so imputing the group mean can be done as follows:

  impute_proxy(retl, turnover ∼ mean(turnover, na.rm=TRUE) | size)
  %>% head(3)
  ##   size staff turnover other.rev total.rev vat
  ## 1  sc0    75 1420.375        NA      1130  NA
  ## 2  sc3     9 1607.000        NA      1607  NA
  ## 3  sc3    NA 6886.000       -33      6919  NA

10.6 Imputation Based on the EM Algorithm

The purpose of the Expectation–Maximization algorithm (Dempster et al., 1977) is to estimate the maximum-likelihood estimate for parameters of a multivariate probability distribution in the presence of missing data. As such, it is not an imputation method. Imputations can be generated by computing the expected value for missing items conditional on the observed data or by sampling them from the conditional multivariate distribution.

Contrary to the model-based imputation methods discussed up until now, in the EM algorithm, there is no fixed distinction between predicting and predicted variables—one simply uses everything available to estimate the parameters of some model distribution. This also means that one needs to assume a distributional form (e.g., multivariate normal) for the combined variables in the dataset.

10.6.1 The EM Algorithm

Consider again a set of random variables $c010-math-238$ with a joint probability distribution $c010-math-239$ parameterized by a numeric vector $c010-math-240$ . We denote a realization of $c010-math-241$ as a row vector $c010-math-242$ and a set of $c010-math-243$ realizations as a $c010-math-244$ matrix $c010-math-245$ . A common way to estimate the parameters of $c010-math-246$ given a set of observations is to assume a distributional form for $c010-math-247$ (say, multinormal) and then solve the following maximization problem.

where $c010-math-248$ is the space of possible values for $c010-math-249$ . Furthermore, we used Bayes' rule in the second line, and in the third line, we used that taking the logarithm does not alter the location of the maximum. If there is no prior knowledge about $c010-math-250$ , we may assume that $c010-math-251$ is uniformly distributed over $c010-math-252$ , so $c010-math-253$ is constant. Since $c010-math-254$ is also independent of $c010-math-255$ , the maximization problem can be simplified to

10.10

where we introduced the notation $c010-math-257$ , which is referred to as the maximum-likelihood function. Correspondingly, Eq. (10.10) is referred to as the maximum-likelihood estimator. It returns the value of $c010-math-258$ such that $c010-math-259$ is the most likely observed data. To actually solve this equation, one substitutes a distribution (e.g., multinormal), equates the right-hand side to zero, and solves for $c010-math-260$ .

If only part of the realizations $c010-math-261$ are actually observed, this maximization problem cannot be solved. To move forward, we partition $c010-math-262$ into the observed values $c010-math-263$ and missing values $c010-math-264$ . Using Bayes' rule twice, we can write

Again, dropping terms not depending on $c010-math-265$ (and we already assumed that $c010-math-266$ is constant), we find that maximizing the likelihood function of Eq. (10.10) can equivalently be written as the maximization over the likelihood function

10.11

Since $c010-math-268$ is unknown, this expression cannot be maximized. The idea of Dempster et al. (1977) is to replace the maximum-likelihood function with its expected value with regard to the missing values.

Here, integration is over all unobserved variables, where $c010-math-269$ indicates the domain of possible values for the variables in $c010-math-270$ . In the second line, Eq. (10.11) was substituted. The integral in the second line is the negative entropy of the distribution $c010-math-271$ . It is therefore usually denoted $c010-math-272$ . In principle, one can numerically maximize the above expression to obtain $c010-math-273$ . However, if $c010-math-274$ has a simple form, and we choose some value $c010-math-275$ , the integral

can often be worked out explicitly. Next, an updated value for $c010-math-276$ can be found by maximizing $c010-math-277$ as a function of $c010-math-278$ . The Expectation–Maximization algorithm is indeed an iteration over this procedure. The crucial result of Dempster et al. (1977) is that the value of $c010-math-279$ must increase at each iteration and converge to a maximum under mild conditions, including cases where $c010-math-280$ is a member of the regular exponential family.

If $c010-math-285$ is a member of the regular exponential family, the expressions to be evaluated by the algorithm can be simplified by expressing them in terms of (expected values of) sufficient statistics [e.g., de Waal et al. (2011, Chapter 8) or Schafer (1997, Chapter 5)]. The regular exponential family includes a wide range of commonly applied distributions including the (multivariate) normal, Bernoulli, exponential, (negative) binomial, Poisson, and gamma distributions [see, e.g., Brown (1986)]. Indeed, many implementations of the EM algorithm demand that $c010-math-286$ is from the regular exponential family. The Amelia package discussed below is restricted to models based on the multivariate normal distribution.

Advantages of the EM algorithm include that it is a simple and well-understood algorithm that is guaranteed to converge in principle. Given a distribution from the exponential family, the E and M steps can be computed very quickly. Also, since the algorithm provides an estimate for the full multivariate distribution, it has a good chance of correcting for the random (MAR) mechanism. A disadvantage is that computation may take many iterations since the convergence criterion (measured in terms of the difference in $c010-math-287$ between iterations) decreases approximately linearly with the number of iterations (Schafer, 1997). Convergence can be especially slow when the fraction of missing values is high or when the model distribution is a poor description of the actual data distribution. Furthermore, the EM algorithm does not immediately provide a variance estimate for $c010-math-288$ .

10.6.2 EM Imputation Assuming the Multivariate Normal Distribution

Probably, one of the most implemented models for EM estimation is the model where numeric variables are distributed according to the multivariate normal distribution. This distribution is parameterized by the mean vector $c010-math-289$ and covariance matrix $c010-math-290$ , with the probability density function given by

Because of its relatively simple form, the update rules for the E and M steps can be worked out yielding expressions that eventually be evaluated using simple matrix algebra and linear system solving.

Before stating the algorithm, consider a record $c010-math-291$ with observed values $c010-math-292$ and missing values $c010-math-293$ . Given an estimate for the mean vector and covariance matrix, these can be rearranged accordingly, so

Here, $c010-math-294$ is the estimated covariance matrix for observed variables, $c010-math-295$ the estimated covariance matrix between observed and missing variables, and $c010-math-296$ the estimated covariance matrix for the missing variables. After a fair amount of tedious algebra and integration, one can show that the expected value(s) for the missing part of $c010-math-297$ conditional on $c010-math-298$ , $c010-math-299$ , and $c010-math-300$ is given by

10.12

This equation can be used to impute a record once the parameters have been estimated (Procedure 10.6.2). The imputed dataset is obtained by applying Eq. (10.12) one last time on every record in the dataset.

  impute_em(retl, ∼ .- size) %>% head(3)
  ##   size    staff  turnover other.rev total.rev      vat
  ## 1  sc0 75.00000  893.2151  1168.523      1130 7917.634
  ## 2  sc3  9.00000 1607.0000  4373.896      1607 1749.444
  ## 3  sc3 11.84416 6886.0000   -33.000      6919 1991.730

10.7 Sampling Variance under Imputation

The goal of data analysis is often to infer the value of a population parameter, say $c010-math-313$ . If the data is obtained by sampling from the population, an estimate $c010-math-314$ is obtained by applying some procedure to the observed dataset. In such cases, one is often interested in how much the estimator $c010-math-315$ would vary if the sampling-and-estimation procedure was to be repeated on the same population. The variation over all possible samples is usually measured by the sampling variance.

Suppose that the parameter of interest is the population mean $c010-math-316$ of a variable $c010-math-317$ . Given a sample, obtained using simple random sample without replacement, an unbiased estimate $c010-math-318$ can be obtained by computing the sample mean. The sampling variance of $c010-math-319$ , estimated from the same sample, is given by the well-known expression

10.13

where $c010-math-321$ and $c010-math-322$ are the sample and population size and $c010-math-323$ the observed sample values.

The important thing to realize here is that the expression for sampling variance depends on the sampling scheme (including size) and the expression or procedure used to obtain the estimator. This means that if part of the sampled data must be imputed, this imputation procedure should be considered part of the estimation procedure. This is easy to see from the abovementioned example. Suppose that some fraction of the observations $c010-math-324$ are missing completely at random, and we impute them with the estimated mean (ignoring missing values). This clearly leads to negatively biased variance estimation since the terms in the sum of Eq. (10.13) corresponding to missing $c010-math-325$ are zero by definition.

More generally, suppose that we are interested in some population parameter $c010-math-326$ . Suppose further that estimation of $c010-math-327$ includes some model-based imputation procedure depending on model parameters $c010-math-328$ that are estimated from the same sample as $c010-math-329$ . Using the rule of conditional variances (sometimes called Eve's law), we can write

10.14

If the sampling variance of $c010-math-331$ is assumed to be zero or very small, the second term vanishes, and we get $c010-math-332$ . The first term can therefore be interpreted as the sampling variance of $c010-math-333$ that is intrinsic to the sampling scheme, where the procedure to estimate $c010-math-334$ includes imputation. The second term corresponds to variance added by uncertainty in the imputation model parameters.

The assumption that $c010-math-335$ does not vary is valid in the case of imputations that do not vary with sample composition. This is the case, for example, when imputing with a fixed value or when using a deductive method such as those described in Section 9.3.2. Recall that in deductive methods one derives imputed values from conditions on the data and observed values that are deemed valid. It is also valid when the model used to impute the data was fitted on a dataset that is different from the imputed dataset (but note that this may bias the estimator since one assumes that model parameters are constant across datasets).

In many common cases the assumption of zero variation in imputation model parameters will be invalid, and hence the second term in Eq. (10.14) will be positive. In particular, any procedure where a dataset is imputed using a predictive model, which is subsequently ignored in variance estimation, will underestimate the sampling variance.

Over the past decades, two contrasting views of variance estimation over partially imputed datasets have developed. The first, historically, is due to Rubin who published and refined methodology for multiple imputation over several publications [see Rubin (1978); Rubin (1996); and the basic reference Rubin (1987)]. Rubin (1978) proposes to impute each missing data multiple times so as to ‘reflect variation within a model as well as [ $c010-math-336$ ] due to a variety of reasonable models’. An analyst who is assumed to be not involved in the imputation process can get an idea of the variance due to imputation by computing parameters of interest over several copies of the imputed dataset.

The second view was proposed first by Rao and Shao (1992) in the context of hot deck imputation, but the idea applies more generally. In their view, one considers a population parameter $c010-math-337$ , which is in the complete data case estimated by $c010-math-338$ . As usual, the estimated quantity has an associated (estimated) sampling variance, denoted $c010-math-339$ (recall that the sampling variance is caused by variation of $c010-math-340$ over all possible same-sized samples from a population). In the case of missing data, the estimator $c010-math-341$ is replaced with $c010-math-342$ , which is similar to $c010-math-343$ except that it is applied to the imputed dataset. Based on a resampling formalism (the jackknife), they are able to define and analytically approximate an estimator for the sampling variance $c010-math-344$ .

An account of both methodologies and arguments for and against these approaches is given in Rao (1996), Rubin (1996), and Fay (1996) and the ensuing comments by Judkins (1996), Binder (1996), and Eltinge (1996). In the following paragraphs, both methods are discussed in some detail.

10.8 Multiple Imputations

Multiple imputation is a computational method that allows for direct estimation of the two variance components of Eq. (10.14). A schematic of the main idea is shown in Figure 10.6. We start on the left-hand side with a (possibly multivariate) dataset $c010-math-345$ of which some fraction of values are missing. This dataset is imputed $c010-math-346$ times, independently using an imputation model where both the model parameters and the imputed values are drawn from a distribution that is conditioned on the model's predictor variables. Each dataset is then used to estimate a population parameter $c010-math-347$ using the same procedure one would use on $c010-math-348$ if it were complete to begin with. This yields an ensemble of estimates $c010-math-349$ . The final estimate is then computed as the average over the ensemble:

10.15

Illustration of Estimation using multiple imputation. — **Figure 10.6** Estimation using multiple imputation.

Similarly, one can estimate the variance for each estimate $c010-math-351$ as one would for a complete sample. For example, if $c010-math-352$ is the mean of a variable, and $c010-math-353$ is obtained from simple random sampling without replacement, one could use Expression (10.13). This exercise then yields an ensemble of estimated variances $c010-math-354$ . Averaging over this ensemble estimates the first term in Eq. (10.14), the expected value of the variance of $c010-math-355$ . The second term of Eq. (10.14)—the variance of the expected value—is estimated as the variance over the estimates $c010-math-356$ . Combining the two estimates, we get (Rubin, 1987)

10.16

where the term $c010-math-358$ is a correction for the finite number of multiple imputations. In the context of multiple imputation, the first term is often referred to as within-imputation variance and the second term as the between-imputation variance.

An advantage of the multiple imputation method is that it separates the imputation problem from later analysis and estimation. In fact, one of the aims of its development was to provide a way for data providers to impute a dataset such that analysts can use it for any statistical inference without knowing the details of the imputation method. The trade-off is then that an analyst must repeat the analysis $c010-math-359$ times and appropriately combine the results.

This luxury of separating imputation from analysis does not come for free, however. As it turns out, it is not possible to choose the imputation method completely independent from the parameters being estimated. In essence, the imputation method must be such that the $c010-math-360$ and $c010-math-361$ (computed from imputed datasets) are unbiased estimates compared to the estimator one would use if $c010-math-362$ was observed completely. Furthermore, the between-imputation variance of $c010-math-363$ must be an unbiased estimator of the variance caused by uncertainty in the model parameters (Rubin, 1987 1996). Imputations that satisfy these demands are called proper imputations [for a proper technical definition, cf. Rubin (1987)]. The demand for proper imputations therefore implies that imputation is not fully decoupled from analysis and that one needs to carefully design the imputation methods.

A typical improper imputation method is imputation of the mean. Although this does yield an unbiased point estimate of the population mean, estimation of the sampling variance will be negatively biased. However, if the imputed variable is strongly correlated with one of the observed variables, this bias can be reduced with regression. The typical approach to multiple imputation is therefore an attempt to utilize as much as possible all relations between the variables so as to approximate the full multivariate distribution. Two such approaches will be discussed in the following sections: a bootstrapped version of the EM algorithm in Sections 10.8.1 and 10.8.2 and multivariate imputation by chained equations in Sections 10.8.3 and 10.8.4.

A natural question to ask is how large a value of $c010-math-364$ is necessary. A value often quoted in the literature is $c010-math-365$ when the fraction of missing information $c010-math-366$ is a few percent at most. This recommendation is based on an analysis of the efficiency¹ of an estimator, which Rubin (1987) shows to be approximately equal to $c010-math-367$ relative to the $c010-math-368$ case. Graham et al. (2007) performed extensive simulations to study the effect of $c010-math-369$ on the statistical power of an estimator rather than its efficiency. Based on the simulations where the parameter of interest concerned regression coefficients between a constructed pair of random normal variables, they conclude that drop-off in statistical power² is much faster than is to be suspected from the drop-off in efficiency as $c010-math-370$ decreases. They therefore recommend to use much higher numbers of imputations.

Table 10.1 reproduces the recommendations of Graham et al. (2007). In the case of $c010-math-371$ power loss compared to $c010-math-372$ , one would need to impute 20 times. A comparison with the full information maximum-likelihood model (FIML, which in their simulation is equivalent to the multivariate normal EM algorithm) is also made. The latter model may be considered a benchmark for their simulation when $c010-math-373$ . When interpreting Table 10.1, one should keep in mind that it was produced based on artificial multivariate normal data. The number of necessary imputations may vary depending on the shape of the (multivariate) distribution and (possibly nonlinear) relations between the variables. In practice, one will therefore often need to experiment and gain experience with values for $c010-math-374$ that are appropriate for a particular imputation problem.

c010-math-376 — **Table 10.1** Number of multiple imputations $c010-math-375$ , as recommended by Graham *et al*. (2007)

$c010-math-382$ is the fraction of missing information, which equals the fraction of missing data when variables are independent.

10.8.1 Multiple Imputation Based on the EM Algorithm

Honaker et al. (2011) propose a bootstrapping scheme to randomize the parameters for a multivariate normal model distribution. The idea, summarized in Procedure 10.8.1, is to create an ensemble of $c010-math-383$ datasets by resampling from the original dataset with replacement, such that each dataset in the ensemble has the same number of records as the original. Next, the EM algorithm is used on each of the $c010-math-384$ datasets to find maximum-likelihood estimates of the distribution parameters (mean vector and covariance matrix). Conditional on the validity of the bootstrapping scheme, the resulting ensemble of multivariate normal parameters estimates their sampling distribution. For each dataset in the ensemble, each record can be completed by sampling from the multivariate normal distribution conditional on the observed values in the record.

Regarding the ‘properness’ of this imputation method for estimating population parameters and their variances from the ensemble, one should realize that the imputation method is based on estimates of the means and covariance matrix. This means that only linear relationships between the variables and their variances can be trusted to be properly represented in the ensemble of imputed datasets. For example, suppose that we have a dataset containing realizations of variables $c010-math-390$ , $c010-math-391$ , and $c010-math-392$ , with some of them missing. After creating an ensemble of imputed datasets using the EMB algorithm, we can safely estimate $c010-math-393$ , and $c010-math-394$ and their variances in the model $c010-math-395$ . However, if the model was to be augmented with a term $c010-math-396$ , the value and variance of this interaction effect can in principle not be trusted since the imputation model did not include it.

10.8.2 The `Amelia` Package

The Amelia package³ of the same authors implements the Expectation Maximization with Bootstrapping (EMB) algorithm. The core function is amelia, which multiply imputes a multivariate dataset based on Procedure 10.8.1.

Here, we use two variables from the retailers dataset to demonstrate amelia.

  data(retailers, package="validate")
  dat <- retailers[c("staff","turnover")]

Visual inspection (histograms) of both staff and turnover reveals that they follow very skew distributions, but both distributions are more symmetric when viewed on a log scale. Even though the multivariate normal distribution is not a very good approximation of the transformed data, we will model the data as such. The amelia function has the logs option, allowing one to indicate which variables should be modeled on a logarithmic scale. Both transformation and back-transformation are then taken care of by amelia. To get an indication on the number of imputation needed, compute the fraction of missing data.

  colSums(is.na(dat))/nrow(dat)
  ##      staff   turnover
  ## 0.10000000 0.06666667

We see that the maximum fraction of missing values is 0.1. Taking the smallest value for loss of statistical power in Table 10.1, this means we should take at least 20 imputations.

  out <- Amelia::amelia(dat, m=20, logs=c("staff","turnover"), p2s=0)

Here, we also set p2s=0 (p2s = print to screen) to prevent amelia from writing output to screen during the EM iterations.

The object out is a list (with class attribute amelia) containing the copies of the imputed dataset (here, 20 copies), the statistics resulting from each EM optimization, and some information on convergence of each EM run. Individual components can be accessed with the $ operator. For example, out$imputations is a list of imputed datasets.

Figure 10.7 summarizes the object. Here, we plotted the original data and an estimate of its bivariate normal density on a log scale and added the randomly imputed values as colored points. Observe that the original data (black points) contain some outliers that may influence the estimated probability density. The imputed turnover values (white points) are plotted on a horizontal lines since they are estimated conditional on a fixed observed value of staff (and vice versa for imputed staff numbers). Both in the case of turnover and staff some extrapolation outside of the observed data ranges was necessary to complete the records.

Illustration of Observed and multiple imputed staff numbers and turnovers. — **Figure 10.7** Observed and multiple imputed staff numbers and turnovers. The data is taken from the `retailers` dataset of the `validate` package. Imputations are generated with `Amelia::amelia`.

For cases with multiple variables, a plot such as Figure 10.7 cannot be created. Amelia includes a plot function that creates plots for each variable, comparing the distribution of the original data with that of the imputed data. To create such a plot (not shown here) simply pass an amelia object to plot.

  plot(out)

With the function overimpute, one can create a per-variable comparison between values observed versus values estimated by the model. To create such a plot for the variable staff (see Figure 10.8) do the following.

  overimpute(out, var="staff")

The graph shows observed versus predicted values for all observations of staff and their 90% confidence intervals. The color of the intervals codes the fraction of missing values in the record containing the plotted point.

In some cases, the likelihood function may have local maxima or be otherwise ‘badly behaved’. Problems may surface, for example, when some variables are strongly colinear. The disperse function reruns EM optimization from several random starting points. For stable solutions, one expects that each optimization ends in the same optimum. Using disperse, a graph of the optimization paths is created so that one can visually check whether all optimizations yield the same (or very close) result.

Analyzing the multiply imputed dataset is most conveniently done with the Zelig package. The zelig function accepts an object of class amelia and can run many different models. Below, we compute the coefficients of a linear least squares ("ls") regression of turnover against staff.

  mod <- Zelig::zelig(turnover ∼ staff, model="ls", data=out, cite=FALSE)

The argument cite=FALSE suppresses printing of citation information when running the model. The output object is of class zelig and can be summarized like any model object in R.

  # summary of model based on original data
  summary( lm(turnover ∼ staff, data=dat) )
  ##
  ## Call:
  ## lm(formula = turnover ∼ staff, data = dat)
  ##
  ## Residuals:
  ##    Min     1Q Median     3Q    Max
  ##  -3397  -2648  -1930  -1048  76827
  ##
  ## Coefficients:
  ##             Estimate Std. Error t value Pr(>|t|)
  ## (Intercept)  2969.53    2024.67   1.467    0.149
  ## staff          67.67     121.43   0.557    0.580
  ##
  ## Residual standard error: 11200 on 49 degrees of freedom
  ##   (9 observations deleted due to missingness)
  ## Multiple R-squared:  0.006298,  Adjusted R-squared:  -0.01398
  ## F-statistic: 0.3105 on 1 and 49 DF,  p-value: 0.5799
  # summary of model based on multiple imputed data
  summary(mod)
  ## Model: Combined Imputations
  ##
  ##             Estimate Std.Error z value Pr(>|z|)
  ## (Intercept)   -11795     26341   -0.45     0.65
  ## staff           2308      1775    1.30     0.19
  ##
  ## For results from individual imputed datasets, use
     summary(x, subset = i:j)
  ## Statistical Warning: The GIM test suggests this model
     is misspecified
  ##  (based on comparisons between classical and robust SE's;
     see http://j.mp/GIMtest).
  ##  We suggest you run diagnostics to ascertain the cause,
     respecify the model
  ##  and run it again.
  ##
  ## Next step: Use 'setx' method

Observe that the standard errors computed from the multiply imputed datasets exceed those from the original data. A full description of Zelig's capabilities is beyond the scope of this work, and the reader is kindly referred to Imai et al. (2008) for an introduction and overview.

Exercises for Section 10.8.2

10.8.3 Multivariate Imputation with Chained Equations (Mice)

Imputation methods based on the EM algorithm, be it single or multiple imputation, depend on the ability to formulate a multivariate distribution for the treated data from which to sample imputations. As an alternative, one may formulate a separate probability model for each variable to be imputed. Model parameters and imputation values are generated sequentially and randomly conditional on known (possibly previously imputed) variables. In the so-called fully conditional specification, one allows every variable except the one that is currently imputed to be part of the predictive model. The sequence of imputations is repeated over until certain distributional properties have converged. Procedure 10.8.2 contains the synopsis of a single such sequence of imputations. Multiple imputation is then achieved by repeating this whole procedure $c010-math-398$ times to create $c010-math-399$ imputed datasets.

It was pointed out by van Buuren and Groothuis-Oudshoorn (2011) that the general approach of iterating over a sequence of probability models has been reinvented several times under different names. Authors reporting such approaches include Kennickell (1991), Brand (1999), Oudshoorn et al. (1999), Raghunathan et al. (2001), Heckerman et al. (2000), Rubin (2003), and Gelman (2004). Here, we follow the terminology of van Buuren and Groothuis-Oudshoorn (2011), who introduced the term chained equations. The term refers to the fact that in a sequence where variables are imputed randomly, one by one, and conditional on the previously imputed variables, this conditioning introduces a chain of dependencies on the probability distributions.

The procedures reported by various authors differ both in the chosen sequence of probability models and details of the initialization and iteration over imputations. In a linear model as in Brand (1999); Oudshoorn et al. (1999), one assumes

and $c010-math-415$ a design matrix derived from $c010-math-416$ : all variables of the current dataset, except the imputed variable (this does not mean that all variables are necessarily part of the model: the predictors may differ between imputed variables). If $c010-math-417$ is determined by minimizing $c010-math-418$ (ordinary least squares), a classic result from regression theory shows that this implies

which then determines the distribution from which to sample $c010-math-419$ in Procedure 10.8.2. Kennickell (1991), Gelman (2004), and Raghunathan et al. (2001) discuss generalized regression models with explicit transformations of predictor and/or imputed variables, and for each model one needs to determine how to sample the parameters conditionally on the predictor variables. For example, the latter author gives explicit recipes for sampling coefficients and imputations in the case of numeric data (normal linear regression) count data (Poisson regression), binary or general categorical data [(generalized) logistic regression], and mixed data (a two-step procedure). One group differing significantly from (generalized) linear modeling is Heckerman et al. (2000), who use probabilistic decision trees.

The general chained equation approach is thus very flexible and even allows conflicting models to be specified (Arnold et al., 1999; Rubin, 2003). For example, given two variables $c010-math-420$ and $c010-math-421$ and the models $c010-math-422$ and $c010-math-423$ . Clearly, this model makes no sense structurally, and it is not possible for both $c010-math-424$ and $c010-math-425$ to be normally distributed ( $c010-math-426$ and $c010-math-427$ are not connected by a linear transformation). A second way of stating this is that the two models, with $c010-math-428$ and $c010-math-429$ assumed to be normally distributed, do not permit an implicit joint distribution for $c010-math-430$ and $c010-math-431$ .

10.8.4 Imputation with the `mice` Package

The mice package supports a number of model types for multiple imputation with chained equations. The imputation and analyses are set up to closely reflect the states of data in Figure 10.6. The package's core function is called mice. In this example we use the same data as for Section 10.8.2.

  # create a 2-variable dataset
  data(retailers, package="validate")
  dat <- retailers[c("staff","turnover")]

Imputation using the default method (PMM, based on liner modeling) works as follows:

  library(mice)
  out <- mice::mice(dat, m=20, printFlag=FALSE)

We set printFlag=FALSE to avoid printing iteration info to the screen during computation, and as before (Section 10.8.2) we set the number of imputations m=20. The output, stored in out, is an object of class mids, which stands for ‘multiple imputed datasets’. In fact, to save memory, only the imputed values are stored to be used when needed during analyses. The imputed values can be obtained by selecting the appropriate elements of out$imp. For example,

  out$imp$staff[, 1:6]
  ##     1  2  3  4  5  6
  ## 3   6  1 13 75 52 60
  ## 4   5  6  5 24 53 53
  ## 5   3  3  6 52 13  1
  ## 14 52 29 53 75  3  3
  ## 40  1  1 60 52 24 24
  ## 43  6  1  6 29 52 52

selects the first six imputations for each missing value of staff. The row numbers indicate the record number in the original dataset.

Analyzing the imputed datasets can be done using mice's version of with.

  fits <- with(out, lm(staff ∼ turnover))

The object fits contains the results of fitting $c010-math-432$ complete data linear models based on the imputed datasets. One should think of fits as resembling the ensemble of $c010-math-433$ values in Figure 10.6. The class of fits is mira, which stands for ‘multiply imputed repeated analyses’.

The final step is to combine (pool) the analyses to the final estimates using the pool function.

  est <- pool(fits)
  summary(est)
  ##                      est           se         t        df     Pr(>|t|)
  ## (Intercept) 1.251705e+01 2.562367e+00 4.8849567 32.380998 2.701511e-05
  ## turnover    2.285708e-05 3.985103e-05 0.5735632  7.613212 5.828042e-01
  ##                     lo 95        hi 95 nmis       fmi    lambda
  ## (Intercept)  7.300088e+00 1.773401e+01   NA 0.3488676 0.3098555
  ## turnover    -6.985859e-05 1.155728e-04    4 0.8499097 0.8150586

The object est is of class mipo, meaning ‘multiply imputed pooled outcomes’. Its printed output resembles the output of an lm object, but note that its contents are different: pool gathers the data in mipo objects in a mira way that makes summarizing the statistics using summary easier. One can therefore not use residuals or predict to obtain residuals or predictions from the final estimated model. On the other hand, the output of summary applied to a mipo object is a simple R matrix, which can easily be processed further for predictive purposes with a bit of programming.

The mice package comes with some diagnostic tools as well. The first we have already seen: by inspecting the imputed values for staff, we can already see that the method used here gives quite a large variance. A visual comparison of the imputed and observed values can be created using stripplot(out) (not shown here—make sure to first load the lattice package). As an illustration, let us compute the ranges of imputed values per record for staff and compare it with the range in the original data.

  range(dat$staff,na.rm=TRUE)
  ## [1]  1 75
  apply(out$imp$staff, 1 , range)
  ##       3  4  5 14 40 43
  ## [1,]  1  1  1  1  1  1
  ## [2,] 75 75 75 75 60 53

For each record, the imputations, randomly selected conditional on the other variables in the dataset (turnover), the range is as large as the range across all records(!). The problem in this example is that the variables are better modeled on the log–log scale, as shown in Figure 10.7.

Unlike Amelia, mice has no built-in facilities to perform transformations on both predictor and predicted variables (there are facilities to transform the predictor variables using the so-called passive imputation). The solution in our case would be to transform the dataset, create the mids object, and to implement the back-transformation as part of the call to with when creating the mira object. The latter object can then be pooled normally.

Finally, we note that it is possible to obtain the imputed datasets using the complete function. For example, to obtain the complete dataset from the 15th imputation, do the following.

  imp_15 <- complete(out, 15)
  head(imp_15, 3)
  ##   staff turnover
  ## 1    75     3571
  ## 2     9     1607
  ## 3     1     6886

Unfortunately, the Zelig package (see Section 10.8.2) does not work directly with objects of class mira since the zelig function expects a list of imputed datasets. However, it is not difficult to create an appropriate object from a mice output.

  M <- 20
  # Complete the original data, 20 times
  completed <- lapply(1:20, function(i) complete(out, i))
  # pass the data frames to 'Zelig::mi'
  completed_mi <- do.call(Zelig::mi, completed)
  # use the created object as data for 'zelig'
  Zelig::zelig(staff ∼ turnover, data=completed_mi, model='ls',
     cite=FALSE)
  ## Model: Combined Imputations
  ##
  ##             Estimate Std.Error z value Pr(>|z|)
  ## (Intercept) 1.25e+01  2.56e+00    4.88    1e-06
  ## turnover    2.29e-05  3.99e-05    0.57     0.57
  ##
  ## For results from individual imputed datasets, use
     summary(x, subset = i:j)
  ## Next step: Use 'setx' method

Here, we pass the list of completed data frames to Zelig::mi to create an object of class mi. This object is accepted by zelig to perform multiple analysis.

A full tutorial on mice can be found in van Buuren and Groothuis-Oudshoorn (2011), and for a full tutorial on zelig, the reader is referred to Imai et al. (2008).

10.9 Analytic Approaches to Estimate Variance of Imputation

10.9.1 Imputation as Part of the Estimator

We assume that $c010-math-434$ is obtained by sampling without replacement from a population of size $c010-math-435$ , with the aim of estimating the population parameter $c010-math-436$ . Using the standard estimator of the mean $c010-math-437$ , the sampling variance can be estimated with Eq. (10.13)

A second way to estimate the sampling variance of an estimator is to compute the jackknife variance. This variance is computed by estimating a population parameter $c010-math-438$ times over the samples created by leaving out one observation at the time. It can be shown that the jackknife variance is given by (Rao, 1996)

where $c010-math-439$ is the average over every observed value in $c010-math-440$ except for the $c010-math-441$ th element. A nice property of the jackknife estimator is that it extends to general estimators $c010-math-442$ , and in fact, if we choose $c010-math-443$ , we have $c010-math-444$ .

In the case of missing values, it is not unusual to compute estimates based on the imputed dataset (so for the estimator of the mean, we have $c010-math-445$ ). In the case of a general parameter, the jackknife variance is given by

10.17

where $c010-math-447$ is computed in the same way as $c010-math-448$ , except that all imputed values are replaced by the values that would be imputed when $c010-math-449$ is left out. Rao and Shao (1992) and Rao (1996) have derived analytical approximations of Eq. (10.17) when $c010-math-450$ under random hot deck or regression imputation. They also show that for these cases $c010-math-451$ is a consistent estimator. In the case of a general estimator $c010-math-452$ , the jackknife variance is not very difficult to determine computationally by explicitly evaluating the sum in Eq. (10.17), possibly in parallel. Chen and Shao (2001) show that the jackknife estimator can overestimate the variance for nearest-neighbor hot deck imputation and propose improved nonparametric estimates.

10.10 Choosing an Imputation Method

The best method to use for a particular imputation problem generally depends on statistical as well as practical considerations. The latter may include factors such as the level of knowledge and experience of the statistician, the availability of a particular package, restrictions on allowed methodology in regulated environments, performance, and general requirements on the quality of the result. From a statistical point of view, the data type (numerical, categorical, and mixed), the observed data distribution, the missing data mechanism, and the relevance of maintaining distributional properties after imputation all play a role. In the following, some of the main differentiating characteristics of the methods described earlier will be summarized. An overview of the pros and cons of various approaches to imputation can also be found in the paper by Schafer and Graham (2002).

Regarding the chosen methodology, we distinguish the following four main characteristics to be determined:

1. parametric or nonparametric models
2. univariate or multivariate imputation
3. donor imputation or estimated imputation
4. single imputation or multiple imputation.

In Table 10.2, examples of imputation methods corresponding to these choices are given (where possible).

The advantage of parametric models over nonparametric models is their interpretability. The coefficients of, say, a linear model have meaningful units of measure, which means that their values can be assessed based on domain knowledge. In many cases, the statistical properties of the parameters (variance, confidence intervals, and $c010-math-453$ -values) are readily available. Do note, however, that these inferential properties always depend on the assumption that data is obtained by a proper randomization procedure. In the case of data coming from administrative databases or typical ‘big data’ sources, such conditions are rarely met in practice. Parametric models allow for extrapolation beyond observed values of predictors, which may be an advantage if predictors are of high quality. It can be a nuisance, for example, when predictors contain outliers. The presence of influential outliers can be mitigated by applying robust parameterization, such as $c010-math-454$ -estimation or elasticnet. Nonparametric models tend to yield lower error of prediction within the range of observed values than parametric models, but the trade-off is that extrapolation beyond the observed predictor range is usually not meaningful. Also, nonparametric models such as random forest can to an extent model interactions between predictors automatically, while for parametric models, incorporating such effects requires manual configuration.

Table 10.2 A classification of imputation methods with examples

Type	Example
`pues`	(robust) Regression imputation
`puem`	Multiple (robust) regression imputation
`puds`	(robust) Regression-based predictive mean matching
`pudm`	Multiple (robust) regression-based knn-predictive mean matching
`pmes`	Multivariate normal EM-based imputation
`pmem`	Multivariate normal EM-based multiple imputation (EMB)
`pmds`	Mice-pmm, avoiding the multiple imputations
`pmdm`	Mice-pmm
`nues`	CART-based imputation.
`nuem`	Multiple CART-based imputation
`nuds`	CART-based predictive mean matching
`nudm`	CART-based knn-predictive mean matching
`nmes`	Iterative random forest (missForest)
`nmem`	—
`nmds`	Iterative random forest-based pmm
`nmdm`	—

The first column indicates whether an imputation method is parametric or nonparametric, univariate or multivariate, estimation or donor-based, or single or multiple imputation. For example, the first row concerns parametric univariate estimation-based single imputation.

One advantage of multivariate imputation over univariate imputation is that the relations between all variables are modeled all at once. In the parametric case, the most common approach is to assume the multivariate normal distribution. This then models linear correlations between variables (no interactions). The missForest algorithm is a nonparametric counterpart that makes no distributional assumptions and moreover can handle nonnumeric data. Combinations of EM-based and nonparametric modeling can be devised as well. For example, Rahman and Islam (2011) propose to use a tree-based algorithm to split a dataset into highly correlated subsets and subsequently use the EM algorithm to impute each section. Applying multiple univariate imputations is attractive mainly because of its simplicity and the freedom of choosing a model for each variable separately. This may, however, lead implicitly to incompatible assumptions about the nature of the multivariate distribution (see also Section 10.8.3). Regarding the choice between parametric and nonparametric multivariate methods, it is worth noting that (Shah et al., 2014) studied the performance of parametric MICE methodology with that of the missForest algorithm using simulated missing values on a medical dataset. Comparing some standard parameters such as bias and confidence intervals of estimated values, they found no large differences.

The main advantage of donor-based imputation over estimated imputation is that one can be sure that the imputed value is plausible, in the sense that it is something that has been observed. In the case of multivariate donor imputation, where multiple values are copied from a single record, this means that restrictions on the relations between those variables will be satisfied (we assume that only records that do not violate such restrictions will be used as donors). Multivariate restrictions concerning imputed values and values already present in the imputed record are still be violated, unless these are taken into account during donor selection. In comparison, predictive models usually do not take account of any restrictions on the data and therefore may (and often will) impute unacceptable values or value combinations that require further processing. A risk of donor-based imputation is that a certain group of donors is used so often that it significantly influences the (multivariate) distribution of the impute dataset. This risk is often mitigated by restricting the number of times a donor record may be used, although a study by Joenssen and Bankhofer (2012) shows that whether this improves the result really depends on the chosen hot deck method and data type. Given also the lack of foundational theory on hot deck methodology (Andridge and Little, 2010), some simulations are probably always necessary to fine-tune donor-based imputation methods.

10.11 Constraint Value Adjustment

Imputation methods in general cannot take account of logical or mathematical relations imposed on the data, although some approaches exist for specific cases (see de Waal (2017) for a recent overview). A viable and generic approach is therefore to choose a suitable imputation method and to adjust the imputed values afterward so that conditions can be met.

Here, we discuss a method that can be used to adjust numerical data under linear equality and/or inequality restrictions. The method has been discussed recently in relation with constrained imputation by Pannekoek and Zhang (2015). The underlying algorithm has probably been reinvented many times. The earliest reference known to these authors is that of Hildreth (1957). In the following, we first focus on the general method. After this, we point out some subtleties that arise in the context of data cleaning and imputation with examples, and we will finish with a practical example using the rspa package.

10.11.1 Formal Description

From a mathematical point of view, the problem is the following. Given a set of equality and inequality constraints represented by a system of equations and in equations:

10.18

Suppose that we are presented with a vector $c010-math-456$ not satisfying these restrictions. In our applications, these will be values that have been imputed into a numerical record. We wish to move away from $c010-math-457$ minimally, so that we obtain a new vector, say $c010-math-458$ , such that $c010-math-459$ does satisfy the restrictions. Now the term ‘minimally’ needs to be interpreted by defining a distance function. A generic choice is to use the (weighted) Euclidean distance, which yields the following minimization problem.

where $c010-math-460$ is by definition a diagonal matrix with positive diagonal elements. A special case occurs when all the restrictions are equalities such as those arising from accounting balance restrictions. In that case, one can apply the Lagrange multiplier method. That is, one defines the function

where $c010-math-461$ is a vector of dual variables (Lagrange multipliers). The solution (if any exists) is obtained by equating derivatives of $c010-math-462$ with respect to the components of $c010-math-463$ and $c010-math-464$ to zero and solving for $c010-math-465$ . The solution to this problem is

When the set of restrictions also contains inequality restrictions, such as nonnegativity demands on certain variables, the Lagrange multiplier method no longer applies. The algorithm proposed by Hildreth (1957) exploits the fact that although satisfying all restrictions at once is difficult, it is not so difficult to find a solution that satisfies a single restriction. So the idea is to solve for one restriction at a time, iterating (possibly multiple times) over the restrictions until a satisfactory solution is found. A synopsis is given in Procedure 10.11.1, but it is most easily understood using a simple example.

Let us consider a two-vector $c010-math-482$ , subject to the constraints

10.19

10.20

In Figure 10.9, the valid regions for each of these restrictions are shown. To interpret this figure, first note that the equalities $c010-math-485$ and $c010-math-486$ define lines in the $c010-math-487$ plane that border on the valid regions. The valid regions are half-planes shown as shaded areas. The region that satisfies both constraints is the overlap of both regions, which in Figure 10.9 is doubly shaded.

**Figure 10.9** Valid regions (shaded) for restrictions (10.20) and (10.20). The black dots represents $c010-math-488$ .

In matrix notation, the restrictions are defined as

Here, we have $c010-math-489$ . We can verify that for $c010-math-490$ , the restrictions do not hold. Indeed, we have

The idea of the Successive Projection Algorithm is to iteratively project $c010-math-491$ onto the borders. In Figure 10.9 this is indicated with arrows. First, $c010-math-492$ is updated by projection on the line $c010-math-493$ . The co-ordinates of the new point are computed by following Procedure 10.11.1. We get:

This step is depicted as the arrow from $c010-math-494$ to the line $c010-math-495$ in Figure 10.9. One can check that the first restriction is now satisfied by filling in the conditions in matrix notation. In the second step, we project onto the line $c010-math-496$ . This yields the following:

This vector satisfies both restrictions, so we are done.

Finally, let us summarize the procedure. In the first step, the signed distance $c010-math-497$ between the current value of $c010-math-498$ and the border of the half-space defined by the $c010-math-499$ th constraint is computed. We then distinguish between three following cases. In the first case, the current value of $c010-math-500$ violates the current constraint. In that case $c010-math-501$ and the updated vector will be the projection onto the half-space defined by the current restriction. In the second case, $c010-math-502$ already satisfies the current constraint. By comparing the current distance with the accumulated (nonnegative) distance $c010-math-503$ already traveled in the direction of this half-space, it is checked whether we can reduce the distance traveled (hence, the computation of $c010-math-504$ ) without violating the restriction. This is the case when $c010-math-505$ , and the result is a projection of $c010-math-506$ on the border of the half-space from within the valid region. When $c010-math-507$ , we get $c010-math-508$ , and the accumulated distance traveled will be reduced to zero.

10.11.2 Application to Imputed Data

When applying the successive projection algorithm to imputed values, there are a few subtleties to keep in mind. In general, the pattern of missing values varies from record to record. This means that for each record a new set of constraints is derived by substituting the observed (nonimputed) values into the overall set of constraints. This then yields the constraints that guide adjustment of the imputed values. However, the successive projection algorithm only converges when an actual solution exists. That is, one must ensure that the variables that have been imputed actually permit a solution. This can be done by preceding the imputation procedure with an appropriate error localization step (see Section 7.1).

The second subtlety is related to the chosen weights. Different variables often vary on different scales. Choosing a simple Euclidean distance between the original and adjusted value sets is likely to disturb ratios between the variables. This might render the resulting adjusted values implausible from the domain knowledge perspective (Pannekoek and Zhang, 2015). They show that when the weights are chosen as

the adjusted values will in the first order preserve the ratios observed in the initial values.

10.11.3 Adjusting Imputed Values with the `rspa` Package

The successive projection has been implemented fundamentally by the lintools package. A convenient wrapper that allows for adjusting values stored in a data.frame is available through the rspa package, which we will use in the following example.

We will continue using the retailers dataset from the validate package and also use this package to constrain their values. The simputation package is used to find imputations.

  library(validate)
  library(simputation)
  library(errorlocate)
  library(rspa)
  data(retailers)

We define a number of balance restrictions and nonnegativity rules for variables in the retailers dataset (see also Chapter 6).

  v <- validator(
    staff >= 0
    , turnover >= 0
    , other.rev >= 0
    , turnover + other.rev == total.rev
    , total.rev - total.costs == profit
    , total.costs >= 0
  )

To ensure that we are able to get values that satisfy all rules, we perform error localization and set all erroneous values to NA with the replace_errors function of the errorlocate package.

  d1 <- replace_errors(retailers,v)
  miss <- is.na(d1)

A logical matrix that signals what values are to be imputed (and thus later adjusted) is stored as well. We use the missForest algorithm to every missing value, not taking into account the validation rules.

  d2 <- impute_mf(d1, . ∼ .)
  ##   missForest iteration 1 in progress…done!
  ##   missForest iteration 2 in progress…done!
  ##   missForest iteration 3 in progress…done!
  ##   missForest iteration 4 in progress…done!
  sum(is.na(d2))
  ## [1] 0

Finally, we use the rspa function to adjust the imputed values. Here, we use the simple Euclidean distance, but it is possible to pass a weights argument defining a weight vector for each record.

  d3 <- match_restrictions(d2, v, adjust=is.na(d1))

Let us inspect the results. We allow for an error of $c010-math-509$ since that is the default convergence parameter defined by rspa. We use the compare function of the validate package to summarize the changes with respect to rule violations in consecutive versions of the data.

  # set sensitivity to violations of linear (in)equalities to
  # less than 1e-2
  voptions(x=v, lin.eqeps=1e-2, lin.ineqeps=1e-2)
  compare(v, start=retailers,locate=d1, imputed=d2, adjusted=d3)
  ## Object of class validatorComparison:
  ##
  ##                     Version
  ## Status               start locate imputed adjusted
  ##   validations          360    360     360      360
  ##   verifiable           265    238     360      360
  ##   unverifiable          95    122       0        0
  ##   still_unverifiable    95     95       0        0
  ##   new_unverifiable       0     27       0        0
  ##   satisfied            246    238     298      360
  ##   still_satisfied      246    238     246      246
  ##   new_satisfied          0      0      52      114
  ##   violated              19      0      62        0
  ##   still_violated        19      0      18        0
  ##   new_violated           0      0      44        0

The compare function shows that after setting the erroneous values to missing, no violations occur anymore, while the number of unverifiable checks increased with 27. These correspond to rules that could not be checked because of extra missing values in the data. After imputation, all 360 checks (60 records times 6 record-wise restrictions) can be verified. However, 62 of them are violated. These violations disappear (to within $c010-math-510$ ) after imputed values are adjusted to match the restrictions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.