Entity Disambiguation and Linking over Queries using Encyclopedic Knowledge (original) (raw)

Entity Disambiguation for Knowledge Base Population

2010

Abstract The integration of facts derived from information extraction systems into existing knowledge bases requires a system to disambiguate entity mentions in the text. This is challenging due to issues such as non-uniform variations in entity names, mention ambiguity, and entities absent from a knowledge base. We present a state of the art system for entity disambiguation that not only addresses these challenges but also scales to knowledge bases with several million entries using very little resources.

Named entity disambiguation: A hybrid statistical and rule-based incremental approach

The Semantic Web, 2008

The rapidly increasing use of large-scale data on the Web makes named entity disambiguation become one of the main challenges to research in Information Extraction and development of Semantic Web. This paper presents a novel method for detecting proper names in a text and linking them to the right entities in Wikipedia. The method is hybrid, containing two phases of which the first one utilizes some heuristics and patterns to narrow down the candidates, and the second one employs the vector space model to rank the ambiguous cases to choose the right candidate. The novelty is that the disambiguation process is incremental and includes several rounds that filter the candidates, by exploiting previously identified entities and extending the text by those entity attributes every time they are successfully resolved in a round. We test the performance of the proposed method in disambiguation of names of people, locations and organizations in texts of the news domain. The experiment results show that our approach achieves high accuracy and can be used to construct a robust named entity disambiguation system.

Graph Based Disambiguation of Named Entities using Linked Data

— Identifying entities such as people, organizations, songs, or places in natural language texts is needful for semantic search, machine translation, and information extraction. A key challenge is the ambiguity of entity names, requiring robust methods to disambiguate names to the entities registered in a knowledge base. Several approaches aim to tackle this problem, they still achieve poor accuracy. We address this drawback by presenting a novel knowledge-base-agnostic approach for named entity disambiguation. Our approach includes the HITS algorithm combined with label expansion strategies and string similarity measure like the n-gram similarity. Based on this combination, we can efficiently detect the correct URIs for a given set of named entities within an input text.

An approach to web-scale named-entity disambiguation

2009

We present a multi-pass clustering approach to large scale, wide-scope named-entity disambiguation (NED) on collections of web pages. Our approach uses name co-occurrence information to cluster and hence disambiguate entities, and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasingly difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data.

Semantic Relatedness Approach for Named Entity Disambiguation

Communications in Computer and Information Science, 2010

Natural Language is a mean to express and discuss about concepts, objects, events, i.e. it carries semantic contents. One of the ultimate aims of Natural Language Processing techniques is to identify the meaning of the text, providing effective ways to make a proper linkage between textual references and their referents, that is real world objects. This work addresses the problem of giving a sense to proper names in a text, that is automatically associating words representing Named Entities with their referents. The proposed methodology for Named Entity Disambiguation is based on Semantic Relatedness Scores obtained with a graph based model over Wikipedia. We show that, without building a Bag of Words representation of the text, but only considering named entities within the text, the proposed paradigm achieves results competitive with the state of the art on two different datasets.

Linking Entities to Wikipedia Documents

2013

This paper addresses the challenging information extraction problem of linking named entities in text to entries in a large knowledge base such as Wikipedia. The approach, which is essentially an evolution of a system originally developed in the context of the English Entity Linking Task of the Text Analysis Conference, uses supervised learning to rank candidate knowledge base entries for each named entity, and then for classifying the top-ranked entry as the correct disambiguation or not. In this paper, I analyze the fundamental design challenges involved in the development of a learningbased entity-linking system, and provide extensive experimental results with both Portuguese and Spanish texts, for a wide range of methods and feature sets. The experiments demonstrate the effectiveness of supervised learning methods, showing that out-of-the-box algorithms and relatively simple to compute features can obtain a high accuracy in this task.

Local and global algorithms for disambiguation to wikipedia

2011

Abstract Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible.

High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data

ArXiv, 2017

The Entity Disambiguation and Linking (EDL) task matches entity mentions in text to a unique Knowledge Base (KB) identifier such as a Wikipedia or Freebase id. It plays a critical role in the construction of a high quality information network, and can be further leveraged for a variety of information retrieval and NLP tasks such as text categorization and document tagging. EDL is a complex and challenging problem due to ambiguity of the mentions and real world text being multi-lingual. Moreover, EDL systems need to have high throughput and should be lightweight in order to scale to large datasets and run on off-the-shelf machines. More importantly, these systems need to be able to extract and disambiguate dense annotations from the data in order to enable an Information Retrieval or Extraction task running on the data to be more efficient and accurate. In order to address all these challenges, we present the Lithium EDL system and algorithm - a high-throughput, lightweight, language...

Ontology-driven automatic entity disambiguation in unstructured text

The Semantic Web-ISWC 2006, 2006

Precisely identifying entities in web documents is essential for document indexing, web search and data integration. Entity disambiguation is the challenge of determining the correct entity out of various candidate entities. Our novel method utilizes background knowledge in the form of a populated ontology. Additionally, it does not rely on the existence of any structure in a document or the appearance of data items that can provide strong evidence, such as e-mail addresses, for disambiguating authors for example. Originality of our method is demonstrated in the way it uses different relationships in a document as well as in the ontology to provide clues in determining the correct entity. We demonstrate the applicability of our method by disambiguating authors in a collection of DBWorld posts using a large scale, real-world ontology extracted from the DBLP. The precision and recall measurements provide encouraging results.

Entity Disambiguation based on a Probabilistic Taxonomy

2011

This paper presents a method for entity disambiguation, one of the most substantial tasks for machines to understand text in natural languages. In a natural language, terms have ambiguity, e.g. "Barcelona" usually means a Spanish city but it can also refer to a professional football club. In our work, we utilize a probabilistic taxonomy that is as rich as our mental world in terms of the concepts of worldly facts it contains. We then employ a naive Bayes probabilistic model to disambiguate a term by identifying its related terms in the same document. Specifically, our method consists of two steps: clustering related terms and conceptualizing the cluster using the probabilistic taxonomy. We cluster related terms probabilistically instead of using any threshold-based deterministic clustering approach. Our method automatically adjusts the relevance weight between two terms by taking the topic of the document into consideration. This enables us to perform clustering without using a sensitive, predefined threshold. Then, we conceptualize all possible clusters using the probabilistic taxonomy, and we aggregate the probabilities of each concept to find the most likely one. Experimental results show that our method outperforms threshold-based methods with optimally set thresholds as well as several gold standard approaches for entity disambiguation.