The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning | Political Analysis | Cambridge Core (original) (raw)

Abstract

Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS’s accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

Type

Article

Copyright

© The Author(s) 2021. Published by Cambridge University Press on behalf of the Society for Political Methodology

References

Beaulieu-Jones, B. K., and Greene, C.. 2016. “Semi-Supervised Learning of the Electronic Health Record for Phenotype Stratification.” Journal of Biomedical Informatics 64(2):168–178.CrossRefGoogle ScholarPubMed

Cranmer, S. J., and Gill, J.. 2013. “We Have to be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data.” British Journal of Political Science 43(2):425–449.CrossRefGoogle Scholar

Duan, Y., Lv, Y., Kang, W., and Zhao, Y.. 2014. “A Deep Learning Based Approach for Traffic Data Imputation.” In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), 912–917. New York: IEEE.CrossRefGoogle Scholar

Gal, Y., and Ghahramani, Z.. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning, 1050–1059. New York: ACM.Google Scholar

Gondara, L., and Wang, K.. 2018. “Mida: Multiple Imputation Using Denoising Autoencoders.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining: Advances in Knowledge Discovery and Data Mining, 260–272. Cham: Springer.CrossRefGoogle Scholar

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R.. 2012. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors. Neural Networks 2:1–18.Google Scholar

Honaker, J., and King, G.. 2010. “What to Do About Missing Values in Time-Series Cross-Section Data.” American Journal of Political Science 54(2):561–581.CrossRefGoogle Scholar

Honaker, J., King, G., and Blackwell, M.. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45(7):1–47.CrossRefGoogle Scholar

King, G., Honaker, J., Joseph, A., and Scheve, K.. 2001. “Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation.” American Political Science Review 95(1):49–69.CrossRefGoogle Scholar

Kropko, J., Goodrich, B., Gelman, A., and Hill, J.. 2014. “Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches.” Political Analysis 22(4):497–519.CrossRefGoogle Scholar

Lall, R. 2016. “How Multiple Imputation Makes a Difference.” Political Analysis 24(4):414–433.CrossRefGoogle Scholar

Lall, R., and Robinson, T.. 2020. “Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning.“ https://doi.org/10.7910/DVN/UPL4TT, Harvard Dataverse, V1, UNF:6:nx0l6jH3yhFhdUA34V9V/g== [fileUNF].Google Scholar

Little, R. J., and Rubin, D.. 1987. Statistical Analysis with Missing Data. New York: Wiley.Google Scholar

Novo, A. A. 2015. Norm. Vienna, Austria: R Foundation for Statistical Computing.Google Scholar

Ramseyer, J. M., and Rasmussen, E. B.. 2016. “Voter Ideology: Regression Measurement of Position on the Left-Right Spectrum.” Working Paper.Google Scholar

Rubin, D. B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.CrossRefGoogle Scholar

Schafer, J. L., and Olsen, M. K.. 1998. “Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective.” Multivariate Behavioral Research 33(4):545–571.CrossRefGoogle ScholarPubMed

Su, Y.-S., Gelman, A., Hill, J., and Yajima, M.. 2011. “Multiple Imputation With Diagnostics (mi) in r: Opening Windows into the Black Box.” Journal of Statistical Software 45(2):1–31.CrossRefGoogle Scholar

van Buuren, S. 2012. Flexible Imputation of Missing Data. Boca Raton, FL: Taylor and Francis.CrossRefGoogle Scholar

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.. 2008. “Extracting and Composing Robust Features with Denoising Autoencoders.” In Proceedings of the 25th International Conference on Machine Learning, 1096–1103. New York: ACM.CrossRefGoogle Scholar