Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records | American Political Science Review | Cambridge Core (original) (raw)

Abstract

Since most social science research relies on multiple data sources, merging data sets is an essential part of researchers’ workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable, and data may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a fast and scalable algorithm to implement a canonical model of probabilistic record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. An open-source software package is available for implementing the proposed methodology.

References

Adena, Maja, Enikolopov, Ruben, Petrova, Maria, Santarosa, Veronica, and Zhuravskaya, Ekaterina. 2015. “Radio and the Rise of the Nazis in Prewar Germany.” Quarterly Journal of Economics 130: 1885–939.CrossRefGoogle Scholar

Ansolabehere, Stephen, and Hersh, Eitan. 2012. “Validation: What Big Data Reveal about Survey Misreporting and the Real Electorate.” Political Analysis 20: 437–59.CrossRefGoogle Scholar

Ansolabehere, Stephen, and Hersh, Eitan. 2017. “ADGN: An Algorithm for Record Linkage Using Address, Date of Birth, Gender and Name.” Statistics and Public Policy 4: 1–10.CrossRefGoogle Scholar

Belin, Thomas R., and Rubin, Donald B.. 1995. “A Method for Calibrating False-Match Rates in Record Linkage.” Journal of the American Statistical Association 90: 694–707.CrossRefGoogle Scholar

Berent, Matthew K., Krosnick, Jon Arthur, and Lupia, A.. 2016. “Measuring Voter Registration and Turnout in Surveys. Do Official Government Records Yield More Accurate Assessments?” Public Opinion Quarterly . 80: 597–621.CrossRefGoogle Scholar

Bolsen, Toby, Ferraro, Paul J., and Miranda, Juan Jose. 2014. “Are Voters More Likely to Contribute to Other Public Goods? Evidence from a Large-Scale Randomized Policy Experiment.” American Journal of Political Science 58: 17–30.CrossRefGoogle Scholar

Bonica, Adam. 2013. Database on Ideology, Money in Politics, and Elections: Public Version 1.0 [Computer File]. Stanford, CA: Stanford University Libraries.Google Scholar

Cesarini, David, Lindqvist, Erik, Ostling, Robert, and Wallace, Bjorn. 2016. “Wealth, Health, and Child Development: Evidence from Administrative Data on Swedish Lottery Players.” Quarterly Journal of Economics 131: 687–738.CrossRefGoogle Scholar

Christen, Peter. 2012. Data Matching. Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection . Heidelberg, Germany: Springer.Google Scholar

Cohen, William W., Ravikumar, Pradeep, and Fienberg, Stephen. 2003. “A Comparison of String Distance Metrics for Name-Matching Tasks.” In International Joint Conference on Artificial Intelligence (IJCAI) 18.Google Scholar

Cross, Philip J., and Manski, Charles F.. 2002. “Regressions, Short and Long.” Econometrica 70: 357–68.CrossRefGoogle Scholar

Dalzell, Nicole M., and Reiter, Jerome P.. 2018. “Regression Modeling and File Matching Using Possibly Erroneous Matching Variables.” Journal of Computational and Graphical Statistics 1–11. Published online July 11, 2018.Google Scholar

Einav, Liran, and Levin, Jonathan. 2014. “Economics in the Age of Big Data.” Science 346 (6210): 1243089-1–6.Google ScholarPubMed

Engbom, Niklas, and Moser, Christian. 2017. “Returns to Education Through Access to Higher-Paying Firms: Evidence from US Matched Employer-Employee Data.” American Economic Review: Papers and Proceedings 107: 374–78.CrossRefGoogle Scholar

Fellegi, Ivan P., and Sunter, Alan B.. 1969. “A Theory of Record Linkage.” Journal of the American Statistical Association 64: 1183–210.CrossRefGoogle Scholar

Figlio, David, and Guryan, Jonathan. 2014. “The Effects of Poor Neonatal Health on Children’s Cognitive Development.” American Economic Review 104: 3921–55.CrossRefGoogle ScholarPubMed

Giraud-Carrier, Christophe, Goodlife, Jay, Jones, Bradley M., and Cueva, Stacy. 2015. “Effective Record Linkage for Mining Campaign Contribution Data.” Knowledge and Information Systems 45: 389–416.CrossRefGoogle Scholar

Goldstein, Harvey, and Harron, Katie. 2015. Methodological Developments in Data Linkage . John Wiley & Sons, Ltd. Chapter 6: Record Linkage: A Missing Data Problem, pp. 109–24.Google Scholar

Gutman, Roee, Afendulis, Christopher C., and Zaslavsky, Alan M.. 2013. “A Bayesian Procedure for File Linking to End-of-Life Medical Costs.” Journal of the American Medical Informatics Association . 103: 34–47.Google Scholar

Gutman, Roee, Sammartino, Cara J., Green, Traci C., and Montague, Brian T.. 2016. “Error Adjustments for File Linking Methods Using Encrypted Unique Client Identifier (eUCI) with Application to Recently Released Prisoners Who Are HIV+.” Statistics in Medicine 35: 115–29.CrossRefGoogle ScholarPubMed

Harron, Katie, Goldstein, Harvey, and Dibben, Chris, eds. 2015. Methodological Developments in Data Linkage. West Sussex: John Wiley & Sons.Google Scholar

Hersh, Eitan D. 2015. Hacking the Electorate: How Campaigns Perceive Voters. Cambridge, U.K.: Cambridge University Press.CrossRefGoogle Scholar

Herzog, Thomas H., Scheuren, Fritz, and Winkler, William E.. 2010. “Record Linkage.” Wiley Interdisciplinary Reviews: Computational Statistics 2: 535–43.CrossRefGoogle Scholar

Herzog, Thomas N., Scheuren, Fritz J., and Winkler, William E.. 2007. Data Quality and Record Linkage Techniques. New York: Springer.Google Scholar

Hill, Seth. 2017. “Changing Votes or Changing Voters: How Candidates and Election Context Swing Voters and Mobilize the Base.” Electoral Studies 48: 131–48.CrossRefGoogle Scholar

Hill, Seth J., and Huber, Gregory A.. 2017. “Representativeness and Motivations of the Contemporary Donorate: Results from Merged Survey and Administrative Records.” Political Behavior 39: 3–29.CrossRefGoogle Scholar

Hof, Michel H. P., and Zwinderman, Aeilko H.. 2012. “Methods for Analyzing Data from Probabilistic Linkage Strategies Based on Partially Identifying Variables.” Statistics in Medicine 31: 4231–42.CrossRefGoogle ScholarPubMed

Imai, Kosuke, and Tingley, Dustin. 2012. “A Statistical Method for Empirical Testing of Competing Theories.” American Journal of Political Science 56: 218–36.CrossRefGoogle Scholar

Jaro, Matthew. 1972. “UNIMATCH-A Computer System for Generalized Record Linkage Under Conditions of Uncertainty.” Technical Report, Spring Joint Computer Conference.Google Scholar

Jaro, Matthew. 1989. “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida.” Journal of the American Statistical Association . 84: 414–20.CrossRefGoogle Scholar

Jutte, Douglas P., Roos, Leslie L., and Browne, Marni D.. 2011. “Administrative Record Linkage as a Tool for Public Health Research.” Annual Review of Public Health 32: 91–108.Google ScholarPubMed

Kim, Gunky, and Chambers, Raymond. 2012. “Regression Analysis under Incomplete Linkage.” Computational Statistics and Data Analysis 56: 2756–70.CrossRefGoogle Scholar

Lahiri, Partha, and Larsen, Michael D.. 2005. “Regression Analysis with Linked Data.” Journal of the American Statistical Association 100: 222–30.CrossRefGoogle Scholar

Larsen, Michael D., and Rubin, Donald B.. 2001. “Iterative Automated Record Linkage Using Mixture Models.” Journal of the American Statistical Association 96: 32–41.CrossRefGoogle Scholar

McLaughlan, Geoffrey, and Peel, David. 2000. Finite Mixture Models. New York: John Wiley & Sons.CrossRefGoogle Scholar

McVeigh, Brendan S., and Murray, Jared S.. 2017. “Practical Bayesian Inference for Record Linkage.” Technical Report, Carnegie Mellon University.Google Scholar

Meredith, Marc, and Morse, Michael. 2014. “Do Voting Rights Notification Laws Increase Ex-Felon Turnout?” The ANNALS of the American Academy of Political and Social Science 651: 220–49.CrossRefGoogle Scholar

Mummolo, Jonathan, and Nall, Clayton. 2016. “Why Partisans Don’t Sort: The Constraints on Political Segregation.” Journal of Politics 79: 45–59.CrossRefGoogle Scholar

Murray, Jared S. 2016. “Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering.” Journal of Privacy and Confidentiality 7: 3–24.Google Scholar

Neter, John, Maynes, E. Scott, and Ramanathan, R.. 1965. “The Effect of Mismatching on the Measurement of Response Errors.” Journal of the American Statistical Association 60: 1005–27.Google Scholar

Ong, Toan C., Mannino, Michael V., Schilling, Lisa M., and Kahn, Michael G.. 2014. “Improving Record Linkage Performance in the Presence of Missing Linkage Data.” Journal of Biomedical Informatics 52: 43–54.CrossRefGoogle ScholarPubMed

Richman, Jesse T., Chattha, Gulshan A., and Earnest, David C.. 2014. “Do Non-Citizens Vote in U.S. Elections?” Electoral Studies 36: 149–57.CrossRefGoogle Scholar

Ridder, Geert, and Moffitt, Robert. 2007. Handbook of Econometrics. Vol. 6. Elsevier Chapter The Econometrics of Data Combination, pp. 5469–547.Google Scholar

Sadinle, Mauricio. 2014. “Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach.” Annals of Applied Statistics . 8: 2404–34.CrossRefGoogle Scholar

Sadinle, Mauricio. 2017. “Bayesian Estimation of Bipartite Matchings for Record Linkage.” Journal of the American Statistical Association 112: 600–12.CrossRefGoogle Scholar

Sariyar, Murat, Borg, Andreas, and Pommerening, Klaus. 2012. “Missing Values in Deduplication of Electronic Patient Data.” Journal of the American Medical Informatics Association 19: e76–82.CrossRefGoogle ScholarPubMed

Scheuren, Fritz, and Winkler, William E.. 1993. “Regression Analysis of Data Files that Are Computer Matched.” Survey Methodology 19: 39–58.Google Scholar

Scheuren, Fritz, and Winkler, William E.. 1997. “Regression Analysis of Data Files that Are Computer Matched II.” Survey Methodology . 23: 157–65.Google Scholar

Steorts, Rebecca C. 2015. “Entity Resolution with Empirically Motivated Priors.” Bayesian Analysis . 10: 849–75.CrossRefGoogle Scholar

Steorts, Rebecca C., Ventura, Samuel L., Sadinle, Mauricio, and Fienberg, Stephen E.. 2014. “A Comparison of Blocking Methods for Record Linkage.” In Privacy in Statistical Databases, ed. Domingo-Ferrer, Josep. Springer, 253–68.Google Scholar

Tam Cho, Wendy, Gimpel, James, and Hui, Iris. 2013. “Voter Migration and the Geographic Sorting of the American Electorate.” Annals of the American Association of Geographers 103: 856–70.CrossRefGoogle Scholar

Tancredi, Andrea, and Liseo, Brunero. 2011. “A Hierachical Bayesian Approach to Record Linkage and Population Size Problems.” Annals of Applied Statistics . 5: 1553–85.CrossRefGoogle Scholar

Thibaudeau, Yves. 1993. “The Discrimination Power of Dependency Structures in Record Linkage.” Survey Methodology 19.Google Scholar

Winkler, William E. 2000. “Using the EM Algorithm for Weight Computation in the Felligi–Sunter Model of Record Linkage.” Technical Report No. RR2000/05, Statistical Research Division, Methodology and Standards Directorate, U.S. Bureau of the Census.Google Scholar

Winkler, William E. 2005. “Approximate String Comparator Search Strategies for Very Large Administrative Lists.” Research Report Series (Statistics) No. 2005-02, Statistical Research Division U.S. Census Bureau.Google Scholar

Winkler, William E. 2006a. “Automatic Estimation of Record Linkage False Match Rates.” In Proceedings of the Section on Survey Research Methods. American Statistical Association.Google Scholar

Winkler, William E. 2006b. “Overview of Record Linkage and Current Research Directions.” Technical Report, United States Bureau of the Census.Google Scholar

Winkler, William E., and Yancey, Willian. 2006. “Record Linkage Error-Rate Estimation without Training Data.” In Proceedings of the Section on Survey Research Methods. American Statistical Association.Google Scholar

Winkler, William E., Yancey, Willian, and Porter, E. H.. 2010. “Fast Record Linkage of Very Large Files in Support of the Decennial and Administrative Record Projects.” In Proceedings of the Secion on Survey Research Methods.Google Scholar

Yancey, Willian. 2005. “Evaluating String Comparator Performance for Record Linkage.” Research Report Series, Statistical Research Division U.S. Census Bureau.Google Scholar