Research:Content persistence - Meta-Wiki (original) (raw)

From Meta, a Wikimedia project coordination wiki

An example of words persisting between revisions of a Wikipedia article about apples.

Word persistence example. An example of words persisting between revisions of a Wikipedia article about apples.

Content persistence is the measurement of how content persists through the history of revisions to a wiki-page based on the assumption that content that survives a certain amount of time or subsequent revisions does so due to some inherent quality of the content and its relevance to the article. This assumption is based on the view of wikis' publish-first, edit-later model as a case of informal peer review[1] where contributions that are low quality should be quickly removed or overwritten in by subsequent edits. In this way, content persistence can be viewed as a generalization of revert rate.

The persistence of content through revisions of an article is generally determined by performing textual diffs between revisions and tracking the content that does not change. Figure 1 depicts words persisting between revisions of a toy example of an article about apples. The information attained by performing a diff between the revisions might look as follows:

By tracing this diff information, a data structure can be built that keeps track of discrete content items and attributes them to their original author. In order to turn text into discrete content items, a tokenizer is used to discover word boundaries. Once content is broken into tokens, identifiers can be associated with them so that they can be tracked through the history of a page. For example:

  1. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (1, "red"), (1, ".")
  2. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (2, "blue"), (1, ".")
  3. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (1, "red"), (1, ".")
  4. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (4, "tasty"), (4, " "), (4, "and"), (1, " "), (1, "red"), (1, ".")
  5. (1, "Apples"), (1, " "), (1, "are"), (1, " "), (4, "tasty"), (4, " "), (4, "and"), (1, " "), (5, "blue"), (1, ".")

In the last revision's list of tokens, it's obvious now that "Apples are " was added by the first revision since the identifier "1" suggests this. However, it may be surprising that in the same revision, "blue" is given a new identifier, rather than persisting the (2, "blue") seen in revision #2. This is due to the lack of clarity for what text means in relation to other text. However, in revision #3, (1, "red") persisted. This is due to an identity revert, a revision that exactly duplicates a previous revision. Since the content is exactly duplicated, an algorithm can be sure that the tokens are the exact same ones.

Another way to view this set of words is to transform it into a token-major list that expresses which revisions contained the word. For example:

From this list, it is easy to see how many revisions a given token persisted. For example, "Apples" was added in the first revision and appeared in 4 subsequent revisions. Under the assumption that subsequent revisions of the page represent informal review of the contents, one might assume that the token "Apples" was a high quality contribution to the article. However, this assumption falls flat with content that was only recently inserted into the article. This problem is commonly referred to as right censoring since, when time is plotted from left to right, the samples on the right side have less information. To state it simply, we need more revisions in the article before we can know if "tasty" was a good addition to the article or not. However, we can conclude quite confidently that the token "blue" that was added in revision #2 was not of high quality since it did not persist for a single revision (it was immediately reverted).

There's one additional issue to be concerned with: How much does whitespace matter? And for that matter, what about stop words (grayed out in Figure 1)? Recent research[2][3] has eliminated whitespace, stop words and other wiki markup when computing the value, quality and productivity of editors' work.

Open licensed code has been made publicly available for tracing the content persistence through the history of revisions of an article in the python-mwpersistence library (notes from February 2016 architectural discussion).

The WikiWho API provides, for each element of the tokenized Wikitext of an article at any given revision, the revision in which the token was originally added and all revisions in which the token was deleted or reinserted. This enables content persistence measurements of several kinds, e.g., aggregated per token or editor in the article. For per-user aggregations of persisting content over all articles see a secondary endpoint for edit persistence . Available language editions to date: EN, DE, ES, TR, EU.

  1. Stvilia, B., Twidale, M. B., Smith, L. C., & Gasser L (2005). Information quality work organization in Wikipedia. American Society for Information Science and Technology, 59(6), 983-1001.
  2. a b c Priedhorsky, R., Chen, J., Lam, S. K., Panciera, K., Terveen, L., & Riedl, J. (2007). Creating, destroying, and restoring value in Wikipedia, GROUP (pp. 259-268).
  3. a b c Aaron Halfaker, Aniket Kittur, & John Riedl (2011). Don't bite the Newbies: How reverts effect the quantity and quality of Wikipedia work, The 7th International Symposium on Wiki's and Open Collaboration (pp. 163-172). 10.1145/2038558.2038585
  4. a b Aaron Halfaker, Aniket Kittur, Robert E. Kraut, & John Riedl. (2009). A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia, The 5th International Symposium on Wiki's and Open Collaboration Article 15, 10 pages. 10.1145/1641309.1641332
  5. B. Thomas Adler and Luca de Alfaro. A Content-Driven Reputation System for the Wikipedia. Technical Report ucsc-crl-06-18, School of Engineering, University of California, Santa Cruz, 2006.
  6. B. Thomas Adler, Krishnendu Chatterjee, Luca de Alfaro, Marco Faella, Ian Pye, Vishwanath Raman, Assigning Trust to Wikipedia Content, in WikiSym '08: Proceedings of the 2008 international symposium on Wikis, May 2008
  7. Tom Cross, Puppy smoothies: Improving the reliability of open, collaborative wikis," First Monday, volume 11, number 9 (September 2006)
  8. Rosta Farzan, Robert E. Kraut: "Wikipedia Classroom Experiment: bidirectional benefits of students’ engagement in online production communities" CHI’13, April 27–May 2, 2013, Paris, France. PDF
  9. Flöck, Fabian; Erdogan, Kenan; Acosta, Maribel (2017-05-03). TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia. Eleventh International AAAI Conference on Web and Social Media.