Incremental Entity Resolution from Linked Documents (original) (raw)

An incremental graph-partitioning algorithm for entity resolution

Information Fusion, 2019

Entity resolution is an important data association task when fusing information from multiple sources. Oftentimes the information arrives continuously and the entity resolution algorithm needs to efficiently update its solution upon receiving new information. In this work, we introduce an incremental entity resolution algorithm based on a graph partitioning formulation. The developed algorithm is able to handle both incrementally arriving entity references, as well as incrementally arriving information which changes the pairwise similarity scores between the references. New information is handled in a way that allows the algorithm to reconsider past decisions when contradicting information arrives. Because the graph partitioning formulation used is NP-Hard, a heuristic algorithm is developed to produce good solutions, which is also compatible with a blocking technique to limit the number of required comparisons. The algorithm is tested on a variety of datasets (randomly generated and real) and it is shown that allowing the algorithm to consider revised scores and revisit prior decisions offers a substantial improvement to accuracy (approximately 30-40% better F-Score on a natural language dataset), compared to other greedy heuristics on the same set of coefficients. It is also shown that, on a test set with 100 references, the incremental algorithm is up to an order of magnitude faster than a batch algorithm approach that resolves the entire problem.

Similarity-aware indexing for real-time entity resolution

… of the 18th ACM conference on …, 2009

Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static databases. However, many organisations are increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include online law enforcement and national security databases, public health surveillance and emergency response systems, financial verification systems, online retail stores, eGovernment services, and digital libraries. A novel inverted index based approach for real-time entity resolution is presented in this paper. At build time, similarities between attribute values are computed and stored to support the fast matching of records at query time. The presented approach differs from other approaches to approximate query matching in that it allows any similarity comparison function, and any 'blocking' (encoding) function, both possibly domain specific, to be incorporated. Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach. The interested reader is referred to the longer version of this paper [5].

Towards scalable real-time entity resolution using a similarity-aware inverted index approach

Proceedings of AusDM, 2008

Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore becomes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted index approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index approaches being up to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when standard blocking is used, and thus more work is required.

Query-time entity resolution

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '06, 2006

Entity resolution is the problem of reconciling database references corresponding to the same real-world entities. Given the abundance of publicly available databases that have unresolved entities, we motivate the problem of query-time entity resolution: quick and accurate resolution for answering queries over such 'unclean' databases at query-time. Since collective entity resolution approaches-where related references are resolved jointly-have been shown to be more accurate than independent attribute-based resolution for off-line entity resolution, we focus on developing new algorithms for collective resolution for answering entity resolution queries at query-time. For this purpose, we first formally show that, for collective resolution, precision and recall for individual entities follow a geometric progression as neighbors at increasing distances are considered. Unfolding this progression leads naturally to a two stage 'expand and resolve' query processing strategy. In this strategy, we first extract the related records for a query using two novel expansion operators, and then resolve the extracted records collectively. We then show how the same strategy can be adapted for query-time entity resolution by identifying and resolving only those database references that are the most helpful for processing the query. We validate our approach on two large real-world publication databases where we show the usefulness of collective resolution and at the same time demonstrate the need for adaptive strategies for query processing. We then show how the same queries can be answered in real-time using our adaptive approach while preserving the gains of collective resolution. In addition to experiments on real datasets, we use synthetically generated data to empirically demonstrate the validity of the performance trends predicted by our analysis of collective entity resolution over a wide range of structural characteristics in the data.

Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining - WSDM '11, 2011

We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

Unsupervised Graph-Based Entity Resolution for Complex Entities

ACM Transactions on Knowledge Discovery From Data, 2023

Entity resolution (ER) is the process of linking records that refer to the same entity. Traditionally, this process compares attribute values of records to calculate similarities and then classiies pairs of records as referring to the same entity or not based on these similarities. Recently developed graph-based ER approaches combine relationships between records with attribute similarities to improve linkage quality. Most of these approaches only consider databases containing basic entities that have static attribute values and static relationships, such as publications in bibliographic databases. In contrast, temporal record linkage addresses the problem where attribute values of entities can change over time. However, neither existing graph-based ER nor temporal record linkage can achieve high linkage quality on databases with complex entities, where an entity (such as a person) can change its attribute values over time while having diferent relationships with other entities at diferent points in time. In this paper we propose an unsupervised graph-based ER framework that is aimed at linking records of complex entities. Our framework provides ive key contributions. First, we propagate positive evidence encountered when linking records to use in subsequent links by propagating attribute values that have changed. Second, we employ negative evidence by applying temporal and link constraints to restrict which candidate record pairs to consider for linking. Third, we leverage the ambiguity of attribute values to disambiguate similar records that however belong to diferent entities. Fourth, we adaptively exploit the structure of relationships to link records that have diferent relationships. Fifth, using graph measures we reine matched clusters of records by removing likely wrong links between records. We conduct extensive experiments on seven real-world data sets from diferent domains showing that on average our unsupervised graph-based ER framework can improve precision by up-to 25% and recall by up-to 29% compared to several state-of-the-art ER techniques.

Entity resolution for probabilistic data

Information Sciences, 2014

ABSTRACT Entity resolution is the problem of identifying the tuples that represent the same real world entity. In this paper, we propose a complete solution to the problem of entity resolution over probabilistic data (ERPD), which arises in many applications that have to deal with probabilistic data. To deal with the ERPD problem, we distinguish between two classes of similarity functions, i.e. context-free and context-sensitive. We propose a PTIME algorithm for context-free similarity functions, and an approximation algorithm for context-sensitive similarity functions. We validated our algorithms through experiments over both synthetic and real datasets. Our extensive performance evaluation shows the effectiveness of our algorithms.

Graph-Parallel Entity Resolution using LSH & IMM

In this paper we describe graph-based parallel algorithms for entity resolution that improve over the map-reduce approach. We compare two approaches to parallelize a Locality Sensitive Hashing (LSH) accelerated, Iterative Match-Merge (IMM) entity resolution technique: BCP, where records hashed together are compared at a single node/reducer, vs an alternative mechanism (RCP) where comparison load is better distributed across processors especially in the presence of severely skewed bucket sizes. We analyze the BCP and RCP approaches analytically as well as empirically using a large synthetically generated dataset. We generalize the lessons learned from our experience and submit that the RCP approach is also applicable in many similar applications that rely on LSH or related grouping strategies to minimize pair-wise comparisons.

Entity Resolution for Uncertain Data

2012

Entity resolution (ER), also known as duplicate detection or record matching, is the problem of identifying the tuples that represent the same real world entity. In this paper, we address the problem of ER for uncertain data, which we call ERUD. We propose two different approaches for the ERUD problem based on two classes of similarity functions, i.e. context-free and context-sensitive. We propose a PTIME algorithm for context-free similarity functions, and a Monte Carlo algorithm for context-sensitive similarity functions. Existing context-sensitive similarity functions need at least one pass over the database to compute some statistical features of data, which makes it very inefficient for our Monte Carlo algorithm. Thus, we propose a novel context-sensitive similarity function that makes our Monte Carlo algorithm more efficient. To further improve the efficiency of our proposed Monte Carlo algorithm, we propose a parallel version of it using the MapReduce framework. We validated our algorithms through experiments over both synthetic and real datasets. Our performance evaluation shows the effectiveness of our algorithms in terms of success rate and response time.