The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning | Political Analysis | Cambridge Core (original) (raw)

Abstract

Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS’s accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

Type

Article

References

Beaulieu-Jones, B. K., and Greene, C.. 2016. “Semi-Supervised Learning of the Electronic Health Record for Phenotype Stratification.” Journal of Biomedical Informatics 64(2):168–178.CrossRef Google Scholar PubMed

Cranmer, S. J., and Gill, J.. 2013. “We Have to be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data.” British Journal of Political Science 43(2):425–449.CrossRef Google Scholar

Duan, Y., Lv, Y., Kang, W., and Zhao, Y.. 2014. “A Deep Learning Based Approach for Traffic Data Imputation.” In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), 912–917. New York: IEEE.CrossRef Google Scholar

Gal, Y., and Ghahramani, Z.. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning, 1050–1059. New York: ACM.Google Scholar

Gondara, L., and Wang, K.. 2018. “Mida: Multiple Imputation Using Denoising Autoencoders.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining: Advances in Knowledge Discovery and Data Mining, 260–272. Cham: Springer.CrossRef Google Scholar

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R.. 2012. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors. Neural Networks 2:1–18.Google Scholar

Honaker, J., and King, G.. 2010. “What to Do About Missing Values in Time-Series Cross-Section Data.” American Journal of Political Science 54(2):561–581.CrossRef Google Scholar

Honaker, J., King, G., and Blackwell, M.. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45(7):1–47.CrossRef Google Scholar

King, G., Honaker, J., Joseph, A., and Scheve, K.. 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political Science Review 95(1):49–69.CrossRef Google Scholar

Kropko, J., Goodrich, B., Gelman, A., and Hill, J.. 2014. “Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches.” Political Analysis 22(4):497–519.CrossRef Google Scholar

Lall, R. 2016. “How Multiple Imputation Makes a Difference.” Political Analysis 24(4):414–433.CrossRef Google Scholar

Lall, R., and Robinson, T.. 2020. “Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning.“ https://doi.org/10.7910/DVN/UPL4TT, Harvard Dataverse, V1, UNF:6:nx0l6jH3yhFhdUA34V9V/g== [fileUNF].Google Scholar

Little, R. J., and Rubin, D.. 1987. Statistical Analysis with Missing Data. New York: Wiley.Google Scholar

Novo, A. A. 2015. Norm. Vienna, Austria: R Foundation for Statistical Computing.Google Scholar

Ramseyer, J. M., and Rasmussen, E. B.. 2016. “Voter Ideology: Regression Measurement of Position on the Left-Right Spectrum.” Working Paper.Google Scholar

Rubin, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.CrossRef Google Scholar

Schafer, J. L., and Olsen, M. K.. 1998. “Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective.” Multivariate Behavioral Research 33(4):545–571.CrossRef Google Scholar PubMed

Su, Y.-S., Gelman, A., Hill, J., and Yajima, M.. 2011. “Multiple Imputation With Diagnostics (mi) in r: Opening Windows into the Black Box.” Journal of Statistical Software 45(2):1–31.CrossRef Google Scholar

van Buuren, S. 2012. Flexible Imputation of Missing Data. Boca Raton, FL: Taylor and Francis.CrossRef Google Scholar

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.. 2008. “Extracting and Composing Robust Features with Denoising Autoencoders.” In Proceedings of the 25th International Conference on Machine Learning, 1096–1103. New York: ACM.CrossRef Google Scholar