Some Simplifications for the Expectation Maximization (Em) Algorithm: The Linear Regression Model Case (original) (raw)

Handling Missing Data with Expectation Maximization Algorithm

Global Research and Development Journal for Engineering (GRDJE), Volume 6, Issue 11, pages 9 - 32, 2021

Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating parameter of statistical models in case of incomplete data or hidden data. EM assumes that there is a relationship between hidden data and observed data, which can be a joint distribution or a mapping function. Therefore, this implies another implicit relationship between parameter estimation and data imputation. If missing data which contains missing values is considered as hidden data, it is very natural to handle missing data by EM algorithm. Handling missing data is not a new research but this report focuses on the theoretical base with detailed mathematical proofs for fulfilling missing values with EM. Besides, multinormal distribution and multinomial distribution are the two sample statistical models which are concerned to hold missing values.

Direct and Unbiased Multiple Imputation Methods for Missing Values of Categorical Variables

2021

Missing data is a common problem in statistical analyses. To make use of information in data with incomplete observation, missing values can be imputed so that standard statistical methods can be used to analyze the data. Variables with missing values are often categorical and the missing pattern may not be monotone. Currently, commonly used imputation methods for data with a non-monotone missing pattern do not allow direct inclusion of categorical variables. Categorical variables are converted to numerical variables before imputation. For many applications, the imputed numerical values for those categorical variables must then be converted back to categorical values. However, this conversion introduces bias which can seriously affect subsequent analyses. In this paper, we propose two direct imputation methods for categorical variables with a non-monotone missing pattern: the direct imputation approach incorporated with the expectationmaximization algorithm and the direct imputation...

ML Estimation of Mean and Covariance Structures with Missing Data Using Complete Data Routines

Journal of Educational and Behavioral Statistics, 1999

We consider maximum likelihood (ML) estimation of mean and covariance structure models when data are missing. Expectation maximization (EM), generalized expectation maximization (GEM), Fletcher-Powell, and Fisherscoring algorithms are described for parameter estimation. It is shown how the machinery within a software that handles the complete data problem can be utilized to implement each algorithm. A numerical differentiation method for obtaining the observed information matrix and the standard errors is given. This method too uses the complete data program machinery. The likelihood ratio test is discussed for testing hypotheses. Three examples are used to compare the cost of the four algorithms mentioned above, as well as to illustrate the standard error estimation and the test of hypothesis considered.

From predictive methods to missing data imputation

Journal of Machine Learning Research, 2017

Missing data is a common problem in real-world settings and for this reason has attracted significant attention in the statistical literature. We propose a flexible framework based on formal optimization to impute missing data with mixed continuous and categorical variables. This framework can readily incorporate various predictive models including Knearest neighbors, support vector machines, and decision tree based methods, and can be adapted for multiple imputation. We derive fast first-order methods that obtain high quality solutions in seconds following a general imputation algorithm opt.impute presented in this paper. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository. In all scenarios of missing at random mechanisms and various missing percentages, opt.impute produces the best overall imputation in most data sets benchmarked against five other methods: mean impute, K-nearest neighbors, iterative knn, Bayesian PCA, and predictive-mean matching, with an average reduction in mean absolute error of 8.3% against the best cross-validated benchmark method. Moreover, opt.impute leads to improved out-of-sample performance of learning algorithms trained using the imputed data, demonstrated by computational experiments on 10 downstream tasks. For models trained using opt.impute single imputations with 50% data missing, the average out-of-sample R 2 is 0.339 in the regression tasks and the average out-of-sample accuracy is 86.1% in the classification tasks, compared to 0.315 and 84.4% for the best cross-validated benchmark method. In the multiple imputation setting, downstream models trained using opt.impute obtain a statistically significant improvement over models trained using multivariate imputation by chained equations (mice) in 8/10 missing data scenarios considered.

Performance Comparison of Imputation Algorithms on Missing at Random Data

2018

Missing data continues to be an issue in any field that deals with data due to the fact that almost all the widely accepted and standard statistical methods assume complete data for all variables included in the analysis. Hence, in most studies statistical power is weakened and parameter estimates are biased, leading to weak conclusions and generalizations. Many studies have established that multiple imputation methods are effective ways of handling missing data. This paper examines three different imputation methods (predictive mean matching; Bayesian linear regression; linear regression, non Bayesian) in the M ICE package in the statistical software, R, to ascertain which of the three methods imputes data that yields parameter estimates closest to the parameter estimates of a complete data given different percentages of missingness. The paper extends the analysis by generating a pseudo data of the original data to establish how the imputation methods perform under varying conditions.

Multiple Imputation for Missing Values with an Empirical Application

Journal of Risk & Control, 2021

Missing data are the most common problem in many research areas. For cross-section and time-series data, imputation can be a challenging problem. The most widely used method for filling missing observations is the multiple imputation which increase the number of the available data and thereby reducing biases that may occur when observations with missing values are simply deleted. The main purpose of this paper is to employ a bootstrapping expectation–maximization (EM) algorithm in order to impute missing values mainly to economic data. In the application we use a dataset that is consisted by annual panel data for the 27 countries of the European Union covering the period 2000-2017. The data were obtained from the databases of World Bank and Eurostat namely the Global Financial Development Database, The Standardized World Income Inequality Database by Solt (2019) and the World Development Indicators. Different indicators were chosen representing the development of banking system and ...

Missing data imputation in multivariate t distribution with unknown degrees of freedom using expectation maximization algorithm and its stochastic variants

Model Assisted Statistics and Applications, 2020

Many researchers encounter the missing data problem. The phenomenon may be occasioned by data omission, non-response, death of respondents, recording errors, among others. It is important to find an appropriate data imputation technique to fill in the missing positions. In this study, the Expectation Maximization (EM) algorithm and two of its stochastic variants, stochastic EM (SEM) and Monte Carlo EM (MCEM), are employed in missing data imputation and parameter estimation in multivariate t distribution with unknown degrees of freedom. The imputation efficiencies of the three methods are then compared using mean square error (MSE) criterion. SEM yields the lowest MSE, making it the most efficient method in data imputation when the data assumes the multivariate t distribution. The algorithm’s stochastic nature enables it to avoid local saddle points and achieve global maxima; ultimately increasing its efficiency. The EM and MCEM techniques yield almost similar results. Large sample d...

A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 27

Survey Methodology, 2001

This article describes and evaluates a procedure for imputing missing values for a relatively complex data structure when the data are missing at random. The imputations are obtained by fitting a sequence of regression models and drawing values from the corresponding predictive distributions. The types of regression models used are linear, logistic, Poisson, generalized logit or a mixture of these depending on the type of variable being imputed. Two additional common features in the imputation process are incorporated: restriction to a relevant subpopulation for some variables and logical bounds or constraints for the imputed values. The restrictions involve subsetting the sample individuals that satisfy certain criteria while fitting the regression models. The bounds involve drawing values from a truncated predictive distribution. The development of this method was partly motivated by the analysis of two data sets which are used as illustrations. The sequential regression procedure is applied to perform multiple imputation analysis for the two applied problems. The sampling properties of inferences from multiply imputed data sets created using the sequential regression method are evaluated through simulated data sets.

Selecting the model for multiple imputation of missing data: Just use an IC!

Statistics in Medicine, 2021

Multiple imputation and maximum likelihood estimation (via the expectation‐maximization algorithm) are two well‐known methods readily used for analyzing data with missing values. While these two methods are often considered as being distinct from one another, multiple imputation (when using improper imputation) is actually equivalent to a stochastic expectation‐maximization approximation to the likelihood. In this article, we exploit this key result to show that familiar likelihood‐based approaches to model selection, such as Akaike's information criterion (AIC) and the Bayesian information criterion (BIC), can be used to choose the imputation model that best fits the observed data. Poor choice of imputation model is known to bias inference, and while sensitivity analysis has often been used to explore the implications of different imputation models, we show that the data can be used to choose an appropriate imputation model via conventional model selection tools. We show that BIC can be consistent for selecting the correct imputation model in the presence of missing data. We verify these results empirically through simulation studies, and demonstrate their practicality on two classical missing data examples. An interesting result we saw in simulations was that not only can parameter estimates be biased by misspecifying the imputation model, but also by overfitting the imputation model. This emphasizes the importance of using model selection not just to choose the appropriate type of imputation model, but also to decide on the appropriate level of imputation model complexity.