A passive and inclusive strategy to impute missing values of a composite categorical variable with an application to determine HIV transmission categories (original) (raw)

Multiple imputation of unordered categorical missing data: A comparison of the multivariate normal imputation and multiple imputation by chained equations

Brazilian Journal of Probability and Statistics, 2016

Missing data are common in survey data sets. Enrolled subjects do not often have data recorded for all variables of interest. The inappropriate handling of them may negatively affect the inferences drawn. Therefore, special attention is needed when analysing incomplete data. The multivariate normal imputation (MVNI) and the multiple imputation by chained equations (MICE) have emerged as the best techniques to deal with missing data. The former assumes a normal distribution of the variables in the imputation model and the latter fills in missing values taking into account the distributional form of the variables to be imputed. This study examines the performance of these methods when data are missing at random on unordered categorical variables treated as predictors in the regression models. First, a survey data set with no missing values is used to generate a data set with missing at random observations on unordered categorical variables. Then, the two methods are separately used to impute the missing values of the generated data set. Their performance is compared in terms of bias and standard errors of the estimates from the regression models that determine the association between the woman's contraceptive methods use status and her marital status, controlling for the region of origin. The baseline data used is the 2007 Demographic and Health Survey (DHS) data set from the Democratic Republic of Congo. The findings indicate that although the MVNI relies on the statistical parametric theory, it produces more accurate estimates than MICE for non-ordered categorical variables.

Direct and Unbiased Multiple Imputation Methods for Missing Values of Categorical Variables

2021

Missing data is a common problem in statistical analyses. To make use of information in data with incomplete observation, missing values can be imputed so that standard statistical methods can be used to analyze the data. Variables with missing values are often categorical and the missing pattern may not be monotone. Currently, commonly used imputation methods for data with a non-monotone missing pattern do not allow direct inclusion of categorical variables. Categorical variables are converted to numerical variables before imputation. For many applications, the imputed numerical values for those categorical variables must then be converted back to categorical values. However, this conversion introduces bias which can seriously affect subsequent analyses. In this paper, we propose two direct imputation methods for categorical variables with a non-monotone missing pattern: the direct imputation approach incorporated with the expectationmaximization algorithm and the direct imputation...

Evaluation of Four Multiple Imputation Methods for Handling Missing Binary Outcome Data in the Presence of an Interaction between a Dummy and a Continuous Variable

Journal of Probability and Statistics, 2021

Multiple imputation by chained equations (MICE) is the most common method for imputing missing data. In the MICE algorithm, imputation can be performed using a variety of parametric and nonparametric methods. The default setting in the implementation of MICE is for imputation models to include variables as linear terms only with no interactions, but omission of interaction terms may lead to biased results. It is investigated, using simulated and real datasets, whether recursive partitioning creates appropriate variability between imputations and unbiased parameter estimates with appropriate confidence intervals. We compared four multiple imputation (MI) methods on a real and a simulated dataset. MI methods included using predictive mean matching with an interaction term in the imputation model in MICE (MICE-interaction), classification and regression tree (CART) for specifying the imputation model in MICE (MICE-CART), the implementation of random forest (RF) in MICE (MICE-RF), and M...

A method for comparing multiple imputation techniques: a case study on the U.S. National COVID Cohort Collaborative

arXiv (Cornell University), 2022

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been proposed to attempt to recover the missing information. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithms works best in a given scenario. Furthermore, the selection of each algorithm parameters and data-related modelling choices are also both crucial and challenging. In this paper, we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. The experiments presented here show that our approach could effectively highlight the most valid and performant missing-data handling strategy for our case study. Moreover, our methodology allowed us to gain an understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.

Evaluations of imputation methods for missing data with random forest modeling: an application with unbalanced data with a three categories outcome, the linkage to HIV care Uganda study

Background: Incomplete observation units may contain important information about the population being studied, analysis conducted with the complete portion of the dataset, done in most standard statistical software, may produce biased or low statistical power results. In this paper we investigate the behavior of imputation methods for a linkage to HIV care study dataset with more than 60% incomplete cases.Methods: Missing data imputation algorithms amelia, missForest, mice and hmisc were considered. Two sets of simulations were conducted: first with the subset of data containing only complete observations and second with the whole dataset. Imputation accuracy and general behavior for each imputation algorithm were accessed in the first set. Random forest models were fit to evaluate overall prediction accuracy and sensitivity in both sets of simulations. Results: The imputed values by missForest, a single imputation method, were more accurate for all incomplete variables and scenario...

Using Multiple Imputation and Inverse Probability Weighting to Adjust for Missing Data in HIV Prevalence Estimates: A Cross-Sectional Study in Mwanza, North Western Tanzania

2021

Background Population surveys and demographic studies are the gold standard for estimating HIV prevalence. However, non-response in these surveys is of major concern especially if it is not random and complete case analysis becomes an inappropriate method to analyse the data. Therefore, a comprehensive analysis that will account for the missing data must be used to obtain unbiased HIV prevalence estimates. MethodsSerological samples were collected from participants who were resident in a Demographic Surveillance System (DSS) in Kisesa, Tanzania. HIV prevalence was estimated using three methods. Firstly, using the Complete case analysis (CCA), assuming data were Missing Completely at Random (MCAR). The other two methods, multiple imputations (MI) and inverse probability weighting (IPW), assumed that non-response was missing at random (MAR). For MI, a logistic regression model adjusting for age, sex, residence, and marital status was used to impute 20 datasets to re-estimate the HIV p...

Multiple Imputation by Fully Conditional Specification for Dealing with Missing Data in a Large Epidemiologic Study

International Journal of Statistics in Medical Research, 2015

Missing data commonly occur in large epidemiologic studies. Ignoring incompleteness or handling the data inappropriately may bias study results, reduce power and efficiency, and alter important risk/benefit relationships. Standard ways of dealing with missing values, such as complete case analysis (CCA), are generally inappropriate due to the loss of precision and risk of bias. Multiple imputation by fully conditional specification (FCS MI) is a powerful and statistically valid method for creating imputations in large data sets which include both categorical and continuous variables. It specifies the multivariate imputation model on a variable-by-variable basis and offers a principled yet flexible method of addressing missing data, which is particularly useful for large data sets with complex data structures. However, FCS MI is still rarely used in epidemiology, and few practical resources exist to guide researchers in the implementation of this technique. We demonstrate the application of FCS MI in support of a large epidemiologic study evaluating national blood utilization patterns in a sub-Saharan African country. A number of practical tips and guidelines for implementing FCS MI based on this experience are described.

An Empirical Comparison of Multiple Imputation Methods for Categorical Data

The American Statistician, 2017

Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet Process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models

Practical considerations for sensitivity analysis after multiple imputation applied to epidemiological studies with incomplete data

BMC Medical Research Methodology, 2012

Background: Multiple Imputation as usually implemented assumes that data are Missing At Random (MAR), meaning that the underlying missing data mechanism, given the observed data, is independent of the unobserved data. To explore the sensitivity of the inferences to departures from the MAR assumption, we applied the method proposed by . This approach aims to approximate inferences under a Missing Not At random (MNAR) mechanism by reweighting estimates obtained after multiple imputation where the weights depend on the assumed degree of departure from the MAR assumption. Methods: The method is illustrated with epidemiological data from a surveillance system of hepatitis C virus (HCV) infection in France during the 2001-2007 period. The subpopulation studied included 4343 HCV infected patients who reported drug use. Risk factors for severe liver disease were assessed. After performing complete-case and multiple imputation analyses, we applied the sensitivity analysis to 3 risk factors of severe liver disease: past excessive alcohol consumption, HIV co-infection and infection with HCV genotype 3. Results: In these data, the association between severe liver disease and HIV was underestimated, if given the observed data the chance of observing HIV status is high when this is positive. Inference for two other risk factors were robust to plausible local departures from the MAR assumption. Conclusions: We have demonstrated the practical utility of, and advocate, a pragmatic widely applicable approach to exploring plausible departures from the MAR assumption post multiple imputation. We have developed guidelines for applying this approach to epidemiological studies.

Comparison of Selected Multiple Imputation Methods for Continuous Variables – Preliminary Simulation Study Results

Acta Universitatis Lodziensis. Folia Oeconomica

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missFores...