Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History (original) (raw)
Related papers
Finding Structure in Wikipedia Edit Activity
Proceedings of the 25th International Conference Companion on World Wide Web - WWW '16 Companion
This paper documents a study of the real-time Wikipedia edit stream containing over 6 million edits on 1.5 million English Wikipedia articles, during 2015. We focus on answering questions related to identification and use of information cascades between Wikipedia articles, based on author editing activity. Our findings show that by constructing information cascades between Wikipedia articles using editing activity, we are able to construct an alternative linking structure in comparison to the embedded links within a Wikipedia page. This alternative article hyperlink structure was found to be relevant in topic, and timely in relation to external global events (e.g., political activity). Based on our analysis, we contextualise the findings against areas of interest such as events detection, vandalism, edit wars, and editing behaviour.
Wikipedia editing history in DBpedia
IEEE/WIC/ACM International Joint Conference on Web Intelligence (WI' 16)
DBpedia is a huge dataset essentially extracted from the content and structure of Wikipedia. We present a new extraction producing a linked data representation of the editing history of Wikipedia pages. This supports custom querying and combining with other data providing new indicators and insights. We explain the architecture, representation and an immediate application to monitoring events.
Revisiting reverts: accurate revert detection in wikipedia
2012
Abstract Wikipedia is commonly used as a proving ground for research in collaborative systems. This is likely due to its popularity and scale, but also to the fact that large amounts of data about its formation and evolution are freely available to inform and validate theories and models of online collaboration.
The illiterate editor: metadata-driven revert detection in Wikipedia
As the community depends more heavily on Wikipedia as a source of reliable information, the ability to quickly detect and remove detrimental information becomes increasingly important. The longer incorrect or malicious information lingers in a source perceived as reputable, the more likely that information will be accepted as correct and the greater the loss to source reputation. We present The Illiterate Editor (IllEdit), a content-agnostic, metadata-driven classification approach to Wikipedia revert detection. Our primary contribution is in building a metadata-based feature set for detecting edit quality, which is then fed into a Support Vector Machine for edit classification. By analyzing edit histories, the IllEdit system builds a profile of user behavior, estimates expertise and spheres of knowledge, and determines whether or not a given edit is likely to be eventually reverted. The success of the system in revert detection (0.844 F-measure) as well as its disjoint feature set as compared to existing, content-analyzing vandalism detection systems, shows promise in the synergistic usage of IllEdit for increasing the reliability of community information.
Toktrack: A Complete Token Provenance And Change Tracking Dataset For The English Wikipedia
2017
We present a dataset that contains every instance of all tokens (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13, 545, 349, 787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and redeleted from its article, enabling a complete and straightforward tracking of its history. This data would be exceedingly hard to create by an average potential user as it is (i) very expensive to compute and as (ii) accurately tracking the history of each token in revisioned documents is a non-trivial task. Adapting a stateof-the-art algorithm, we have produced a dataset that allows for a range of analyses and metrics, already popular in research and going beyond, to be generated on complete-Wikipedia scale; ensuring quality and allowing researchers to forego expensive text-comparison computation, which so far has hindered scalable usage. We show how this data enables, on token-level, computation of provenance, measuring survival of content over time, very detailed conflict metrics, and finegrained interactions of editors like partial reverts, re-additions and other metrics, in the process gaining several novel insights.
2008
Abstract: Wikipedia, the popular online encyclopedia, has in just six years grown from an adjunct to the now-defunct Nupedia to over 31 million pages and 429 million revisions in 256 languages and spawned sister projects such as Wiktionary and Wikisource. Available under the GNU Free Documentation License, it is an extraordinarily large corpus with broad scope and constant updates. Its articles are largely consistent in structure and organized into category hierarchies.
Learning To Split and Rephrase From Wikipedia Edit History
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia's edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task. Incorporating WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above the prior best result on the WebSplit benchmark.
iChase: Supporting exploration and awareness of editing activities on wikipedia
2010
Abstract To increase its credibility and preserve the trust of its readers. Wikipedia needs to ensure a good quality of its articles. To that end, it is critical for Wikipedia administrators to be aware of contributors' editing activity to monitor vandalism, encourage reliable contributors to work on specific articles, or find mentors for new contributors. In this paper, we present iChase, a novel interactive visualization tool to provide administrators with better awareness of editing activities on Wikipedia.