The Wikipedia Corpus (original) (raw)

Semantically Annotated Snapshot of the English Wikipedia

2008

This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a "entity containment" derived graph, are licensed under the GNU Free Documentation License and available from http://www.yr-bcn.es/semanticWikipedia.

An open-source toolkit for mining Wikipedia

Artificial Intelligence, 2013

The online encyclopedia Wikipedia is a vast repository of information. For developers and researchers it represents a giant multilingual database of concepts and semantic relations; a promising resource for natural language processing and many other research areas. In this paper we introduce the Wikipedia Miner toolkit: an open-source collection of code that allows researchers and developers to easily integrate Wikipedia's rich semantics into their own applications.

Wikipedia Mining Wikipedia as a Corpus for Knowledge Extraction

Wikipedia, a collaborative Wiki-based encyclopedia, has become a huge phenomenon among Internet users. It covers a huge number of concepts of various fields such as Arts, Geography, History, Science, Sports and Games. As a corpus for knowledge extraction, Wikipedia's impressive characteristics are not limited to the scale, but also include the dense link structure, word sense disambiguation based on URL and brief anchor texts. Because of these characteristics, Wikipedia has become a promising corpus and a big frontier for researchers. A considerable number of researches on Wikipedia Mining such as semantic relatedness measurement, bilingual dictionary construction, and ontology construction have been conducted. In this paper, we take a comprehensive, panoramic view of Wikipedia as a Web corpus since almost all previous researches are just exploiting parts of the Wikipedia characteristics. The contribution of this paper is triple-sum. First, we unveil the characteristics of Wikipedia as a corpus for knowledge extraction in detail. In particular, we describe the importance of anchor texts with special emphasis since it is helpful information for both disambiguation and synonym extraction. Second, we introduce some of our Wikipedia mining researches as well as researches conducted by other researches in order to prove the worth of Wikipedia. Finally, we discuss possible directions of Wikipedia research.

Omnipedia: bridging the wikipedia language gap

2012

Abstract We present Omnipedia, a system that allows Wikipedia readers to gain insight from up to 25 language editions of Wikipedia simultaneously. Omnipedia highlights the similarities and differences that exist among Wikipedia language editions, and makes salient information that is unique to each language as well as that which is shared more widely. We detail solutions to numerous front-end and algorithmic challenges inherent to providing users with a multilingual Wikipedia experience.

A quantitative approach to the use of the Wikipedia

2009 IEEE Symposium on Computers and Communications, 2009

This paper presents a quantitative study of the use of the Wikipedia system by its users (both readers and editors), with special focus on the identification of time and kind-ofuse patterns, characterization of traffic and workload, and comparative analysis of different language editions. The basis of the study is the filtering and analysis of a large sample of the requests directed to the Wikimedia systems for six weeks during the range of months comprising November 2007 to April 2008. In particular, we have considered the twenty most frequently visited language editions of the Wikipedia, identifying for each access to any of them the corresponding namespace (sets of resources with uniform semantics), resource name (article names, for example) and action (editions, submissions, history reviews, save operations, etc.). The results found include the identification of weekly and daily patterns, and several correlations between several actions on the articles. In summary, the study shows an overall picture of how the most visited language editions of the Wikipedia are being accessed by their users.

Shaping Wikipedia into a Computable Knowledge Base

2015

Wikipedia is arguably the most important information source yet invented for natural language processing (NLP) and artificial intelligence, in addition to its role as humanity’s largest encyclopedia. Wikipedia is the principal information source for such prominent services as IBM’s Watson [1], Freebase [2], the Google Knowledge Graph [3], Apple’s Siri [4], YAGO [5], and DBpedia [6], the core reference structure for linked open data [7]. Wikipedia information has assumed a prominent role in NLP applications in word sense disambiguation, named entity recognition, co-reference resolution, and multi-lingual alignments; in information retrieval in query expansion, multi-lingual retrieval, question answering, entity ranking, text categorization, and topic indexing; and in semantic applications in topic extraction, relation extraction, entity extraction, entity typing, semantic relatedness, and ontology building [8].

Academic research into Wikipedia

Digithum, 2012

This issue looks in depth at the multiplicity of the social and cultural impacts of Wikipedia. The articles analyse issues including its development and the consequences for the commercial sector and the public image of large corporations (in the article by Marcia W. DiStaso and Marcus Messner) or its role in the diffusion of culture and architectural heritage (in the article by Emilio José Rodriguez et al.). The article by Antoni Oliver and Salvador Climent details the use of Wikipedia as a structured knowledge corpus, in the framework of the state of the art in natural language processing research. In turn, the article by David Gómez proposes the concept of wikimediasphere and shows how Wikipedia actually forms part of a very dense ecosystem of projects that, though they share common elements, act with a high level of autonomy as nodes on a wider network. Lastly, the article by Nathaniel Tkacz analyses the practical and epistemological implications of one of the basic pillars of Wikipedia's core content policy - the Neutral Point of View - and its relation to a specific concept of truth.

GALATEAS D2W: A Multi-lingual Disambiguation to Wikipedia Web Service

The motivation for entity extraction within a digital cultural collection is the enrichment potential of such a tool -useful in this context for such tasks as metadata generation and query log analysis. The use of Disambiguation to Wikipedia as our particular entity extraction tool is motivated by its generalisable nature and its suitability to noisy text. The particular methodolgy we use does not avail of specific natural language tools and therefore can be applied to other languages with minimal adaptation. This has allowed us to develop a multi-lingual Disambiguation to Wikipedia tool which we have deployed as a web service for the use of the community. 1 We use the title John McCarthy (computer scientist) to refer to the full address http://en.wikipedia.org/wiki/John Mc Carthy (computer scientist).

Wikicorpus: A word-sense disambiguated multilingual Wikipedia corpus

2010

This article presents a new freely available trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia and has been automatically enriched with linguistic information. To our knowledge, this is the largest such corpus that is freely ...