Entity linking based on the co-occurrence graph and entity probability (original) (raw)

Dexter: an open source framework for entity linking

We introduce Dexter, an open source framework for entity linking. The entity linking task aims at identifying all the small text fragments in a document referring to an entity contained in a given knowledge base, e.g., Wikipedia. The annotation is usually organized in three tasks. Given an input document the first task consists in discovering the fragments that could refer to an entity. Since a mention could refer to multiple entities, it is necessary to perform a disambiguation step, where the correct entity is selected among the candidates. Finally, discovered entities are ranked by some measure of relevance. Many entity linking algorithms have been proposed, but unfortunately only a few authors have released the source code or some APIs. As a result, evaluating today the performance of a method on a single subtask, or comparing different techniques is difficult. In this work we present a new open framework, called Dexter, which implements some popular algorithms and provides all the tools needed to develop any entity linking technique. We believe that a shared framework is fundamental to perform fair comparisons and improve the state of the art.

High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data

ArXiv, 2017

The Entity Disambiguation and Linking (EDL) task matches entity mentions in text to a unique Knowledge Base (KB) identifier such as a Wikipedia or Freebase id. It plays a critical role in the construction of a high quality information network, and can be further leveraged for a variety of information retrieval and NLP tasks such as text categorization and document tagging. EDL is a complex and challenging problem due to ambiguity of the mentions and real world text being multi-lingual. Moreover, EDL systems need to have high throughput and should be lightweight in order to scale to large datasets and run on off-the-shelf machines. More importantly, these systems need to be able to extract and disambiguate dense annotations from the data in order to enable an Information Retrieval or Extraction task running on the data to be more efficient and accurate. In order to address all these challenges, we present the Lithium EDL system and algorithm - a high-throughput, lightweight, language...

BUAP_1: A Naïve Approach to the Entity Linking Task

In these notes we are reporting the obtained results by applying the Naïve Bayes classifier to the Entity Linking task of the Knowledge Base Population track at the Text Analysis Conference. Three different runs were submitted to the challenge, each with different ways of approaching the application of the above mentioned classifier. The obtained results were very low, and recent analyses showed that this issue was derived from errors at the pre-processing stage.

Entity Disambiguation and Linking over Queries using Encyclopedic Knowledge

Literature has seen a large amount of work on entity recognition and semantic disambiguation in text but very limited on the effect in noisy text data. In this paper, we present an approach for recognizing and disambiguating entities in text based on the high coverage and rich structure of an online encyclopedia. This work was carried out on a collection of query logs from the Bridgeman Art Library. As queries are noisy unstructured text, pure natural language processing as well as computational techniques can create problems, we need to contend with the impact noise and the demands it places on query analysis. In order to cope with the noisy input, we use machine learning method with statistical measures derived from Wikipedia. It provides a huge electronic text from the Internet, which is also noisy. Our approach is an unsupervised approach and do not need any manual annotation made by human experts. We show that data collection from Wikipedia can be used statistically to derive good performance for entity recognition and semantic disambiguation over noisy unstructured text. Also, as no natural language specific tool is needed, the method can be applied to other languages in a similar manner with little adaptation.

Entity Disambiguation for Knowledge Base Population

2010

Abstract The integration of facts derived from information extraction systems into existing knowledge bases requires a system to disambiguate entity mentions in the text. This is challenging due to issues such as non-uniform variations in entity names, mention ambiguity, and entities absent from a knowledge base. We present a state of the art system for entity disambiguation that not only addresses these challenges but also scales to knowledge bases with several million entries using very little resources.

JVN-TDT Entity Linking Systems at TAC-KBP2012

We present two methods for entity linking in two of our systems submitted to TAC-KBP 2012. The first one, namely Method 1, learns coherence among co-occurrence entities re-ferred to within a text by exploiting Wikipe-dia's link structure and the second one, namely Method 2, combines some heuristics with a statistical model for entity linking. Method 1 exploits two features to train a classifier and exploits coreference relations among co-occurring mentions. Method 2 is a hybrid me-thod containing two phases. The first phase is a rule-based phase that filters candidates and, if possible, it disambiguates mentions with high reliability. The second phase employs a statis-tical model to rank the candidates of each re-maining mention and choose the one with the highest ranking as the right referent of that mention. Experiments are conducted to eva-luate two methods on two datasets – TAC-KBP2011 and TAC-KBP2012 datasets.

An Approach to Collective Entity Linking

2015

Entity linking is the task of disambiguating entities in unstructured text by linking them to an entity in a catalog. Several collective entity linking approaches exist that attempt to collectively disambiguate all mentions in the text by leveraging both local mention-entity context and global entity-entity relatedness. However, the complexity of these models makes it unfeasible to employ exact inference techniques and jointly train the local and global feature weights. In this work we present a collective disambiguation model, that, under suitable assumptions makes efficient implementation of exact MAP inference possible. We also present an efficient approach to train the local and global features of this model and implement it in an interactive entity linking system. The system receives human feedback on a document collection and progressively trains the underlying disambiguation model.

Robustness Evaluation of Entity Disambiguation Using Prior Probes: the Case of Entity Overshadowing

ArXiv, 2021

Entity disambiguation (ED) is the last step of entity linking (EL), when candidate entities are reranked according to the context they appear in. All datasets for training and evaluating models for EL consist of convenience samples, such as news articles and tweets, that propagate the prior probability bias of the entity distribution towards more frequently occurring entities. It was previously shown that the performance of the EL systems on such datasets is overestimated since it is possible to obtain higher accuracy scores by merely learning the prior. To provide a more adequate evaluation benchmark, we introduce the ShadowLink dataset, which includes 16K short text snippets annotated with entity mentions. We evaluate and report the performance of popular EL systems on the ShadowLink benchmark. The results show a considerable difference in accuracy between more and less common entities for all of the EL systems under evaluation, demonstrating the effects of prior probability bias ...

Graph Based Disambiguation of Named Entities using Linked Data

— Identifying entities such as people, organizations, songs, or places in natural language texts is needful for semantic search, machine translation, and information extraction. A key challenge is the ambiguity of entity names, requiring robust methods to disambiguate names to the entities registered in a knowledge base. Several approaches aim to tackle this problem, they still achieve poor accuracy. We address this drawback by presenting a novel knowledge-base-agnostic approach for named entity disambiguation. Our approach includes the HITS algorithm combined with label expansion strategies and string similarity measure like the n-gram similarity. Based on this combination, we can efficiently detect the correct URIs for a given set of named entities within an input text.

Entity Extraction within Plain-Text Collections WISE 2013 Challenge - T1: Entity Linking Track

Lecture Notes in Computer Science, 2013

The increasing availability of electronic texts, such as freecontent encyclopedias on the internet, has unveiled vast interesting and important knowledge in Web 2.0. Nevertheless, to identify relations within a myriad of information is still a challenge. For large corpus of data, it is impractical to manually label each text in order to define relations to extract information. The WISE 2013 conference proposed a challenge (T1 Track) in which teams must label entities within plain texts based on a given set of entities. The Wikilinks dataset comprise 40 million mentions over 3 million entities. This paper describe a straightforward twofold unsupervised strategy to extract and tag entities, aiming to achieve accurate results in the identification of proper nouns and concrete concepts, regardless the domain. The proposed solution is based on a pipeline of text processing modules that includes a lexical parser. In order to validate the proposed solution, we statistically evaluate the results using various measurements in the case study supplied by the Challenge.