Using Wikipedia's Big Data for creation of Knowledge Bases (original) (raw)
Related papers
Shaping Wikipedia into a Computable Knowledge Base
2015
Wikipedia is arguably the most important information source yet invented for natural language processing (NLP) and artificial intelligence, in addition to its role as humanity’s largest encyclopedia. Wikipedia is the principal information source for such prominent services as IBM’s Watson [1], Freebase [2], the Google Knowledge Graph [3], Apple’s Siri [4], YAGO [5], and DBpedia [6], the core reference structure for linked open data [7]. Wikipedia information has assumed a prominent role in NLP applications in word sense disambiguation, named entity recognition, co-reference resolution, and multi-lingual alignments; in information retrieval in query expansion, multi-lingual retrieval, question answering, entity ranking, text categorization, and topic indexing; and in semantic applications in topic extraction, relation extraction, entity extraction, entity typing, semantic relatedness, and ontology building [8].
BABAR: Wikipedia Knowledge Extraction
This paper describes BABAR, a knowledge extraction and representation system, completely implemented in CLOS, that is primarily geared towards organizing and reasoning about knowledge extracted from the Wikipedia Website. The system combines natural language processing techniques, knowledge representation paradigms and machine learning algorithms. BABAR is currently an ongoing independent research project, that when sufficiently mature, may provide various commercial opportunities. BABAR uses natural language processing to parse both page name and page contents. It automatically generates Wikipedia topic taxonomies thus providing a model for organizing the approximately 4,000,000 existing Wikipedia pages. It uses similarity metrics to establish concept relevancy and clustering algorithms to group topics based on semantic relevancy. Novel algorithms are presented that combine approaches from the areas of machine learning and recommender systems. The system also generates a knowledge hypergraph which will ultimately be used in conjunction with an automated reasoner to answer questions about particular topics. This paper describes the CLOS implementation of the various subcomponents of BABAR. These include a recursive descent parser, a hypergraph component, a number of new clustering and classification approaches and Horn clause theorem prover. Finally this papers suggests how such a system can be used to implement a new generation of browsers called knowledge browsers.
Utilising Wikipedia for Text Mining Applications
ACM SIGIR Forum, 2016
The process whereby inferences are made from textual data is broadly referred to as text mining. In order to ensure the quality and effectiveness of the derived inferences, several approaches have been proposed for different text mining applications. Among these applications, classifying a piece of text into pre-defined classes through the utilisation of training data falls into supervised approaches while arranging related documents or terms into clusters falls into unsupervised approaches. In both these approaches, processing is undertaken at the level of documents to make sense of text within those documents. Recent research efforts have begun exploring the role of knowledge bases in solving the various problems that arise in the domain of text mining. Of all the knowledge bases, Wikipedia on account of being one of the largest human-curated, online encyclopaedia has proven to be one of the most valuable resources in dealing with various problems in the domain of text mining. How...
KELVIN: a tool for automated knowledge base construction
We present KELVIN, an automated system for processing a large text corpus and distilling a knowledge base about persons, organizations, and locations. We have tested the KELVIN system on several corpora, including: (a) the TAC KBP 2012 Cold Start corpus which consists of public Web pages from the University of Pennsylvania, and (b) a subset of 26k news articles taken from English Gigaword 5th edition. Our NAACL HLT 2013 demonstration permits a user to interact with a set of searchable HTML pages, which are automatically generated from the knowledge base. Each page contains information analogous to the semi-structured details about an entity that are present in Wikipedia Infoboxes, along with hyperlink citations to supporting text.
A knowledge-based search engine powered by Wikipedia
International Conference on Information and Knowledge Management, Proceedings, 2007
This paper describes Koru, a new search interface that offers effective domain-independent knowledge-based information retrieval. Koru exhibits an understanding of the topics of both queries and documents. This allows it to (a) expand queries automatically and (b) help guide the user as they evolve their queries interactively. Its understanding is mined from the vast investment of manual effort and judgment that is Wikipedia. We show how this open, constantly evolving encyclopedia can yield inexpensive knowledge structures that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We conducted a detailed user study with 12 participants and 10 topics from the 2005 TREC HARD track, and found that Koru and its underlying knowledge base offers significant advantages over traditional keyword search. It was capable of lending assistance to almost every query issued to it; making their entry more efficient, improving the relevance of the documents they return, and narrowing the gap between expert and novice seekers.
Learning to integrate relational databases with wikipedia
2009
Wikipedia is a general encyclopedia of unprecedented breadth and popularity. However, much of the Web's factual information still lies within relational databases, each focused on a specific topic. While many database entities are described by corresponding Wikipedia pages, in general this correspondence is unknown unless it has been manually specified. As a result, Web databases cannot leverage the relevant rich descriptions and interrelationships captured in Wikipedia, and Wikipedia readers miss the extensive coverage that a database typically provides on its specific topic. In this paper, we present ETOW, a system that automatically integrates relational databases with Wikipedia. ETOW uses machine learning techniques to identify the correspondences between database entities and Wikipedia pages. In experiments with two distinct Web databases, we demonstrate that ETOW outperforms baseline techniques, reducing error overall by an average of 19%, and reducing false positive rate by 50%. In one experiment, ETOW is able to identify approximately 13,000 correct matches at a precision of 0.97. We also present evidence suggesting that ETOW can substantially improve the coverage and utility of both the relational databases and Wikipedia.
A network of semantically structured wikipedia to bind information
2006
In this article we show how a network of cooperatively updated semi-formal knowledge bases with adequate knowledge valuation, organization and filtering mechanisms can solve the numerous problems of Wikipedia (lack of structure and evaluations of the information, limitation to overviews, edit wars, etc.) and be a good support to learning, research and more generally information sharing and retrieval.
Semantic Relationship Extraction and Ontology Building using Wikipedia: A Comprehensive Survey
International Journal of Computer Applications, 2010
Semantic web as a vision of Tim Berners-Lee is highly dependable upon the availability of machine readable information. Ontologies are one of the different machine readable formats that have been widely investigated. Several studies focus on how to extract concepts and semantic relations in order to build ontologies. Wikipedia is considered as one of the important knowledge sources that have been used to extract semantic relations due to its characteristics as a semi-structured knowledge source that would facilitate such a challenge. In this paper we will focus on the current state of this challenging field by discussing some of the recent studies about Wikipedia and semantic extraction and highlighting their main contributions and results.
Exploring semantically-related concepts from Wikipedia: the case of SeRE
In this paper we present our web application SeRE designed to explore semantically related concepts. Wikipedia and DBpedia are rich data sources to extract related entities for a given topic, like in- and out-links, broader and narrower terms, categorisation information etc. We use the Wikipedia full text body to compute the semantic relatedness for extracted terms, which results in a list of entities that are most relevant for a topic. For any given query, the user interface of SeRE visualizes these related concepts, ordered by semantic relatedness; with snippets from Wikipedia articles that explain the connection between those two entities. In a user study we examine how SeRE can be used to find important entities and their relationships for a given topic and to answer the question of how the classification system can be used for filtering.