Multiple imputation FAQ page (original) (raw)

This page provides basic information about multiple imputation (MI) in the form of answers to Frequently Asked Questions (FAQs). For more extensive, non-technical overviews of MI, see the articles by Schafer and Olsen (1998) and Schafer (1999). Answers to FAQs regarding MI in large public-use data sets (e.g. from surveys and censuses) are given by Rubin (1996).

What is multiple imputation?
Is MI the only principled way to handle missing data?
Why are only a few imputations needed?
How does one create multiple imputations?
The imputation model
What if the imputation model is wrong?
What is the relationship between the model used for imputation and the model used for analysis?
How do I combine the results across the multiply imputed sets of data?
What is the rate of missing information?
Is multiple imputation a Bayesian procedure?
Removing incomplete cases is so much easier than multiple imputation; why can't I just do that?
Why can't I just impute once?
Is multiple imputation like EM?
Is multiple imputation related to MCMC?
Can multiple imputations be generated nonparametrically?
What about SOLAS?
What if the missing data are not 'missing at random'?
Isn't multiple imputation just making up data?
References

What is multiple imputation?

Imputation, the practice of 'filling in' missing data with plausible values, is an attractive approach to analyzing incomplete data. It apparently solves the missing-data problem at the beginning of the analysis. However, a naive or unprincipled imputation method may create more problems than it solves, distorting estimates, standard errors and hypothesis tests, as documented by Little and Rubin (1987) and others.

The question of how to obtain valid inferences from imputed data was addressed by Rubin's (1987) book on multiple imputation (MI). MI is a Monte Carlo technique in which the missing values are replaced by m_>1 simulated versions, where_m is typically small (e.g. 3-10). In Rubin's method for `repeated imputation' inference, each of the simulated complete datasets is analyzed by standard methods, and the results are combined to produce estimates and confidence intervals that incorporate missing-data uncertainty. Rubin (1987) addresses potential uses of MI primarily for large public-use data files from sample surveys and censuses. With the advent of new computational methods and software for creating MI's, however, the technique has become increasingly attractive for researchers in the biomedical, behavioral, and social sciences whose investigations are hindered by missing data. These methods are documented in a recent book by Schafer (1997) on incomplete multivariate data.

Is MI the only principled way to handle missing data?

MI is not the only principled method for handling missing values, nor is it necessarily the best for any given problem. In some cases, good estimates can be obtained through weighted estimation procedures. In fully parametric models, maximum-likelihood estimates can often be calculated directly from the incomplete data by specialized numerical methods, such as the EM algorithm. Those procedures may be somewhat more efficient than MI because they involve no simulation. Given sufficient time and resources, one could perhaps derive a better statistical procedure than MI for any particular problem. In real-life applications, however, where missing data are nuisance rather than a the primary focus, an easy, approximate solution with good properties can be preferable to one that is more efficient but problem-specific and complicated to implement.

Why are only a few imputations needed?

Many are surprised by the claim that only 3-10 imputations may be needed. Rubin (1987, p. 114) shows that the efficiency of an estimate based on m imputations is approximately

where is the rate of missing information for the quantity being estimated. The efficiencies achieved for various values of m and rates of missing information are shown below.

Unless the rate of missing information is very high, In most situations there is simply little advantage to producing and analyzing more than a few imputed datasets.

See how the rate of missing information is estimated.

How does one create multiple imputations?

Except in trivial settings, the probability distributions that one must draw from to produce proper MI's tend to be complicated and intractable. Recently, however, a variety of new simulation methods have appeared in the statistical literature. These methods, known as Markov chain Monte Carlo (MCMC), have spawned a small revolution in Bayesian analysis and applied parametric modeling(Gilks, Richardson & Spiegelhalter, 1996).Schafer (1997) has adapted and implemented MCMC methods for the purpose of multiple imputation. In particular, he has written general-purpose MI software for incomplete multivariate data. They may be downloaded free of charge at our website.

The imputation model

In order to generate imputations for the missing values, one must impose a probability model on the complete data (observed and missing values). Each of our software packages applies a different class of multivariate complete-data models. NORM uses the multivariate normal distribution. CAT is based on loglinear models, which have been traditionally used by social scientists to describe associations among variables in cross-classified data. The MIX program relies on the_general location model_, which combines a loglinear model for the categorical variables with a multivariate normal regression for the continuous ones. Details of these models are given by Schafer (1997). The newer package PAN uses a multivariate extension of a popular two-level linear regression model commonly applied to multilevel data (e.g. Bryk & Raudenbush, 1992). The PAN model is appropriate for describing multiple variables collected on a sample of individuals over time, or multiple variables collected on individuals who are grouped together into larger units (e.g. students within classrooms).

What if the imputation model is wrong?

Experienced analysts know that real data rarely conform to convenient models such as the multivariate normal. In most applications of MI, the model used to generate the imputations will at best be only approximately true. Fortunately, experience has repeatedly shown that MI tends to be quite forgiving of departures from the imputation model. For example, when working with binary or ordered categorical variables, it is often acceptable to impute under a normality assumption and then round off the continuous imputed values to the nearest category. Variables whose distributions are heavily skewed may be transformed (e.g. by taking logarithms) to approximate normality and then transformed back to their original scale after imputation.

What is the relationship between the model used for imputation and the model used for analysis?

An imputation model should be chosen to be (at least approximately) compatible with the analyses to be performed on the imputed datasets. The imputation model should be rich enough to preserve the associations or relationships among variables that will be the focus of later investigation. For example, suppose that a variable_Y_ is imputed under a normal model that includes the variable_X1_. After imputation, the analyst then uses linear regression to predict Y from X1 and another variable_X2_ which was not in the imputation model. The estimated coefficient for X2 from this regression would tend to be biased toward zero, because Y has been imputed without regard for its possible relationship with X2. In general, any association that may prove important in subsequent analyses should be present in the imputation model.

The converse of this rule, however, is not necessary. If _Y_has been imputed under a model that includes X2, there is no need to include X2 in future analyses involving _Y_unless its relationship to Y is of substantive interest. Results pertaining to Y are not biased by the inclusion of extra variables in the imputation phase. Therefore, a rich imputation model that preserves a large number of associations is desirable because it may be used for a variety of post-imputation analyses.

Detailed discussion on the interrelationships between the model used for imputation and the model used for analysis is given by Meng (1995) and Rubin (1996).

Above all, the processes of imputation and analysis should be guided by common sense. For example, suppose that variables with skewed, truncated, or heavy-tailed distributions are, for the sake of convenience, imputed under an assumption of joint normality. Analyses that depend primarily on means, variances, and covariances, such as regression or principal-component methods, should perform reasonably well even though the imputer's model is too simplistic. On the other hand, common sense would suggest that the same imputations ought not be used for estimation of 5th or 95th percentiles, or other analyses sensitive to non-normal shape.

How do I combine the results across the multiply imputed sets of data?

Rubin (1987) presented this method for combining results from a data analysis performed m times, once for each of m imputed data sets, to obtain a single set of results.

From each analysis, one must first calculate and save the estimates and standard errors. Suppose that is an estimate of a scalar quantity of interest (e.g. a regression coefficient) obtained from data set j(_j_=1,2,...,m) and is the standard error associated with . The overall estimate is the average of the individual estimates,

For the overall standard error, one must first calculate the within-imputation variance,

and the between-imputation variance,

The total variance is

The overall standard error is the square root of T. Confidence intervals are obtained by taking the overall estimate plus or minus a number of standard errors, where that number is a quantile of Student's t-distribution with degrees of freedom

A significance test of the null hypothesis _Q_=0 is performed by comparing the ratio

to the same t-distribution. Additional methods for combining the results from multiply imputed data are reviewed by Schafer (1997, Ch. 4).

What is the rate of missing information?

When performing a multiply-imputed analysis, the variation in results across the imputed data sets reflects statistical uncertainty due to missing data. Rubin's (1987) rules for MI inference provide some diagnostic measures that indicate how strongly the quantity being estimated is influenced by missing data. The estimated rate of missing information is

where

is the relative increase in variance due to nonresponse.

The rate of missing information, together with the number of imputations m, determines the relative efficiency of the MI inference; see Why are only a few imputations needed?

Is multiple imputation a Bayesian procedure?

Partly yes and partly no. When imputations are created under Bayesian arguments (and they usually are), MI has a natural interpretation as an approximate Bayesian inference for the quantities of interest based on the observed data. The validity of MI, however, does not require one to fully subscribe to the Bayesian paradigm. Rubin (1987) provides technical conditions under which MI leads to frequency-valid answers. An imputation method which satisfies these conditions is said to be 'proper.'

Rubin's definition of 'proper', like many frequentist criteria, are useful for evaluating the properties of a given method but provide little guidance for one seeking to create such a method in practice. For this reason, Rubin recommends that imputations be created through a Bayesian process: specify a parametric model for the complete data (and, if necessary, a model for the mechanism by which data become missing), apply a prior distribution to the unknown model parameters, and simulate m independent draws from the conditional distribution of the missing data given the observed data by Bayes' Theorem. In simple problems, the computations necessary for creating MI's can be performed explicitly through formulas. In non-trivial applications, special computational techniques such as Markov chain Monte Carlo (MCMC) must be applied.

Removing incomplete cases is so much easier than multiple imputation; why can't I just do that?

The shortcomings of various case-deletion strategies have been well documented (e.g. Little & Rubin, 1987). If the discarded cases form a representative and relatively small portion of the entire dataset, then case deletion may indeed be a reasonable approach. However, case deletion leads to valid inferences in general only when missing data are missing completely at random in the sense that the probabilities of response do not depend on any data values observed or missing. In other words, case deletion implicitly assumes that the discarded cases are like a random subsample. When the discarded cases differ systematically from the rest, estimates may be seriously biased. Moreover, in multivariate problems, case deletion often results in a large portion of the data being discarded and an unacceptable loss of power.

Why can't I just impute once?

If the proportion of missing values is small, then single imputation may be quite reasonable. Without special corrective measures, single-imputation inference tends to overstate precision because it omits the between-imputation component of variability. When the fraction of missing information is small (say, less than 5%) then single-imputation inferences for a scalar estimand may be fairly accurate. For joint inferences about multiple parameters, however, even small rates of missing information may seriously impair a single-imputation procedure. In modern computing environments, the effort required to produce and analyze a multiply-imputed dataset is often not substantially greater than what is required for good single imputation.

Is multiple imputation like EM?

MI bears a close resemblance to the EM algorithm and other computational methods for calculating maximum-likelihood estimates based on the observed data alone. These methods summarize a likelihood function which has been averaged over a predictive distribution for the missing values. MI performs this same type of averaging by Monte Carlo rather than by numerical methods. In large samples, when relevant aspects of the imputer's and analyst's models agree, inferences obtained by MI with sufficiently many imputations will be nearly the same as those obtained by direct maximization of the likelihood.

Markov chain Monte Carlo (MCMC) is a collection of methods for simulating random draws from nonstandard distributions via Markov chains. MCMC is one of the primary methods for generating MI's in nontrivial problems. In much of the existing literature on MCMC (e.g. Gilks, Richardson & Spiegelhalter, 1996, and their references) MCMC is used for parameter simulation, for creating a large number of (typically dependent) random draws of parameters from Bayesian posterior distributions under complicated parametric models. In MI-related applications, however, MCMC is used to create a small number of independent draws of the missing data from a predictive distribution, and these draws are then used for multiple-imputation inference. In many cases it is possible to conduct an analysis either by parameter simulation or by multiple imputation. Parameter simulation tends to work well when interest is confined to small number of well-defined parameters, whereas multiple imputation is more attractive for exploratory or multi-purpose analyses involving a large number of estimands. Generating and storing 10 versions of the missing data is often more efficient than generating and storing the hundreds or thousands of dependent draws that would be required to achieve a comparable degree of precision through parameter simulation.

Can multiple imputations be generated nonparametrically?

In some cases, it is possible to create proper multiple imputations with minimal distributional assumptions. Consider a univariate sample_Y = (y1, y2, ..., yn)_ where the first_a_<n values are observed and the remaining_n-a_ values are missing. Rubin (1987)describes a simple method called the approximate Bayesian bootstrap (ABB) in which one creates (a) a new pool of respondents by sampling values from (y1, y2, ..., ya) with replacement, and then (b) a set of imputed data by sampling n-a values with replacement from the pool obtained in (a). The method, which is most appropriate for large samples, produces imputations under a multinomial model with categories corresponding to the distinct values seen in the sample. The resampling of respondents in part (a) approximates a draw of the multinomial probabilities from a Bayesian posterior distribution under a Dirichlet prior.

What if the missingness of Y is related to covariates X = X1, X2, ..., Xp? The ABB can be extended in a variety of ways to incorporate the additional information provided by X. If the covariates are discrete and the number of respondents a is sufficiently large, it may be possible to partition the sample into cells corresponding to unique patterns of X1, X2 , ..., Xp_and carry out the ABB procedure within each cell. With continuous covariates or large p, this strategy tends to be ineffective because the observed data become too sparse. In these situations, Lavori, Dawson and Shera (1995) suggest defining response indicators R = (r1, r2, ..., rn), where_ri=1 if unit i responded and ri=0 if not, and modeling the response propensities pi = P(ri=1) by logistic regression on the covariates X. The sample may then be partitioned into cells defined by coarse grouping (e.g. quintiles) of the estimated pi, and the ABB procedure can be performed within each cell. This strategy produces valid inferences about quantities pertaining to the distribution of Y when probabilities of missingness depend on X; grouping by response propensity effectively eliminates distortions that arise when respondents and nonrespondents differ in their_X_-distributions. This approach to MI has been implemented in a new commercial software product called SOLAS.

What about SOLAS?

Solas (Statistical Solutions, 1998) implements the ABB procedure based on estimated response propensities as described above. It is important to note that the imputations produced by SOLAS are effective for analyses pertaining to the distribution of the incomplete variable Y, but they are not appropriate in general for analyses involving relationships between Y and the covariates X used to create the imputations. Consider a hypothetical covariate Xj that is highly correlated with the response Y but unrelated to the missingness indicators_R_. Imputed values for Y will bear no relationship to_Xj_ because that variable has no influence in the logistic regression model; a multiple imputation-based estimate of the correlation between Xj and Y will be biased toward zero. The response-propensity ABB is unable to preserve many important features of the joint distribution of Y and the covariates_X_.

In short, the SOLAS method works well for inferences regarding aspects of the distribution of Y, but can be quite misleading for more general uses regarding relationships among variables (e.g. regression modeling).

In the situation described above, where we have a covariate X_that is correlated with Y but unrelated to missingness for_Y, the natural solution is to incorporate X into the imputation procedure for Y. In simple settings this can be done nonparametrically, e.g. by forming imputation classes based on groupings of_X_. As more covariates are added, distinct classifications by X become awkward and model-based methods become more attractive, particularly when we want to preserve the basic correlations among variables rather than high-order interaction effects.

Please note that I have described the SOLAS procedures that were available in early 1999. Later versions of SOLAS may incorporate additional model-based options which can help preserve correlation structure.

What if the missing data are not 'missing at random'?

Most of the techniques presently available for creating MI's assume that the missing values are 'missing at random' (MAR) in the sense defined by Rubin (1976) and Little and Rubin (1987). That is, they assume that missing data values carry no information about probabilities of missingness. This assumption is mathematically convenient because it allows one to eschew an explicit probability model for nonresponse. In some applications, however, ignorability may seem artificial or implausible. With attrition in a longitudinal study, for example, it is possible that subjects drop out for reasons related to current data values. It is important to note that the MI paradigm does not require or assume that nonresponse is ignorable. Imputations may in principle be created under any kind of assumptions or model for the missing-data mechanism, and the resulting inferences will be valid under that mechanism.

Isn't multiple imputation just making up data?

When MI is presented to a new audience, some may view it as a kind of statistical alchemy in which information is somehow invented or created out of nothing. This objection is quite valid for single-imputation methods, which treat imputed values no differently from observed ones. MI, however, is nothing more than a device for representing missing-data uncertainty. Information is not being invented with MI any more than with EM or other well accepted likelihood-based methods, which average over a predictive distribution for the missing data by numerical techniques rather than by simulation.

References

Bryk, A.S. and Raudenbush, S.W. (1992) Hierarchical Linear Models. Sage, Newbury Park.

Gilks, W.R., Richardson, S. & Spiegelhalter, D.J. (Eds.) (1996) Markov Chain Monte Carlo in Practice. Chapman & Hall, London.

Lavori, P.W., Dawson, R., and Shera, D. (1995) A multiple imputation strategy for clinical trials with truncation of patient data. Statistics in Medicine, 14, 1913-1925.

Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing Data. J. Wiley & Sons, New York.

Meng, X.L.(1995) Multiple-imputation inferences with uncongenial sources of input (with discussion). Statistical Science, 10, 538-573.

Rubin, D.B. (1976) Inference and missing data. Biometrika, 63, 581-592.

Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. J. Wiley & Sons, New York.

Rubin, D.B. (1996) Multiple imputation after 18+ years (with discussion). Journal of the American Statistical Association, 91, 473-489.

Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall, London.

Schafer, J.L. (1999) Multiple imputation: a primer. Statistical Methods in Medical Research, in press.

Schafer, J.L. and Olsen, M.K. (1998) Multiple imputation for multivariate missing-data problems: a data analyst's perspective. Multivariate Behavioral Research, 33, 545-571.

Statistical Solutions, Inc. (1998) SOLAS for Missing Data Analysis, Version 1. Cork, Ireland: Statistical Solutions.

Go to software for multiple imputation

Go to Joe Schafer's home page