How Do Metrics of Link Analysis Correlate to Quality, Relevance and Popularity in Wikipedia (original) (raw)

Evaluating Link-based Recommendations for Wikipedia

Proceedings of the ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL), 2016

Literature recommender systems support users in filtering the vast and increasing number of documents in digital libraries and on the Web. For academic literature, research has proven the ability of citation-based document similarity measures, such as Co-Citation (CoCit), or Co-Citation Proximity Analysis (CPA) to improve recommendation quality. In this paper, we report on the first large-scale investigation of the performance of the CPA approach in generating literature recommendations for Wikipedia, which is fundamentally different from the academic literature domain. We analyze links instead of citations to generate article recommendations. We evaluate CPA, CoCit, and the Apache Lucene MoreLikeThis (MLT) function, which represents a traditional text-based similarity measure. We use two datasets of 779,716 and 2.57 million Wikipedia articles, the Big Data processing framework Apache Flink, and a ten-node computing cluster. To enable our large-scale evaluation, we derive two quasi-gold standards from the links in Wikipedia's "See also" sections and a comprehensive Wikipedia clickstream dataset. Our results show that the citation-based measures CPA and CoCit have complementary strengths compared to the text-based MLT measure. While MLT performs well in identifying narrowly similar articles that share similar words and structure, the citation- based measures are better able to identify topically related information, such as information on the city of a certain university or other technical universities in the region. The CPA approach, which consistently outperformed CoCit, is better suited for identifying a broader spectrum of related articles, as well as popular articles that typically exhibit a higher quality. Additional benefits of the CPA approach are its lower runtime requirements and its language-independence that allows for a cross-language retrieval of articles. We present a manual analysis of exemplary articles to demonstrate and discuss our findings. The raw data and source code of our study, together with a manual on how to use them, are openly available at: https://github.com/wikimedia/citolytics

An Empirical Study to Predict the Quality of Wikipedia Articles

Wikipedia is considered a common way to deliver content in a more effective way as compared to other types of an encyclopedia. However, the quality threat remains an issue regarding the Wikipedia articles. The basic aim of propose research to perform an empirical study to predict the quality of Wikipedia articles. In the proposed methodology, we consider few metrics such as article length (total number of word in an article), number of edits, article age (in the day) and article ranking and perform few statistical tests analyze the quality of Wikipedia articles. Moreover, we observe a significant correlation of proposed metrics with the rating of articles in order to identify their quality.

Ranking of Wikipedia articles revisited: Fair ranking for reasonable quality?

This paper aims to review the fiercely discussed question of whether the ranking of Wikipedia articles in search engines is justified by the quality of the articles. After an overview of current research on information quality in Wikipedia, a summary of the extended discussion on the quality of encyclopedic entries in general is given. On this basis, a heuristic method for evaluating Wikipedia entries is developed and applied to Wikipedia articles that scored highly in a search engine retrieval effectiveness test and compared with the relevance judgment of jurors. In all search engines tested, Wikipedia results are unanimously judged better by the jurors than other results on the corresponding results position. Relevance judgments often roughly correspond with the results from the heuristic evaluation. Cases in which high relevance judgments are not in accordance with the comparatively low score from the heuristic evaluation are interpreted as an indicator of a high degree of trust in Wikipedia. One of the systemic shortcomings of Wikipedia lies in its necessarily incoherent user model. A further tuning of the suggested criteria catalogue, for instance the different weighing of the supplied criteria, could serve as a starting point for a user model differentiated evaluation of Wikipedia articles. Approved methods of quality evaluation of reference works are applied to Wikipedia articles and integrated with the question of search engine evaluation.

Structural Analysis of Wikigraph to Investigate Quality Grades of Wikipedia Articles

Companion Proceedings of the Web Conference 2021

The quality of Wikipedia articles is manually evaluated which is time inefficient as well as susceptible to human bias. An automated assessment of these articles may help in minimizing the overall time and manual errors. In this paper, we present a novel approach based on the structural analysis of Wikigraph to automate the estimation of the quality of Wikipedia articles. We examine the network built using the complete set of English Wikipedia articles and identify the variation of network signatures of the articles with respect to their quality. Our study shows that these signatures are useful for estimating the quality grades of un-assessed articles with an accuracy surpassing the existing approaches in this direction. The results of the study may help in reducing the need for human involvement for quality assessment tasks. CCS CONCEPTS • Information systems → Wikis; • Human-centered computing → Empirical studies in collaborative and social computing.

Quality Assessment of Wikipedia Articles Using h-index

Journal of Information Processing, 2015

In this paper, we propose a method for assessing quality values of Wikipedia articles from edit history using h-index. One of the major methods for assessing Wikipedia article quality is a peer-review based method. In this method, we assume that if an editor's texts are left by the other editors, the texts are approved by the editors, then the editor is decided as a good editor. However, if an editor edits multiple articles, and the editor is approved at a small number of articles, the quality value of the editor deeply depends on the quality of the texts. In this paper, we apply h-index, which is a simple but resistant to excessive values, to the peer-review based Wikipedia article assessment method. Although h-index can identify whether an editor is a good quality editor or not, h-index cannot identify whether the editor is a vandal or an inactive editor. To solve this problem, we propose p-ratio for identifying which editors are vandals or inactive editors. From our experiments, we confirmed that by integrating h-index with p-ratio, the accuracy of article quality assessment in our method outperforms the existing peer-review based method.

Evaluating authoritative sources using social networks: an insight from Wikipedia

Online Information Review, 2006

Purpose -The purpose of this paper is to present an approach to evaluating contributions in collaborative authoring environments, and in particular, Wikis using social network measures. Design/methodology/approach -A social network model for Wikipedia has been constructed, and metrics of importance such as centrality have been defined. Data has been gathered from articles belonging to the same topic using a web crawler, in order to evaluate the outcome of the social network measures in the articles. Findings -Finds that the question of the reliability regarding Wikipedia content is a challenging one and as Wikipedia grows, the problem becomes more demanding, especially for topics with controversial views such as politics or history. Practical implications -It is believed that the approach presented here could be used to improve the authoritativeness of content found in Wikipedia and similar sources. Originality/value -This work tries to develop a network approach to the evaluation of Wiki contributions, and approaches the problem of quality Wikipedia content from a social network point of view.

An Investigation of the Relationship between the Amount of Extra-textual Data and the Quality of Wikipedia Articles

Wikipedia, a web-based collaboratively maintained free encyclopedia , is emerging as one of the most important web-sites on the internet. However, its openness raises many concerns about the quality of the articles and how to assess it automatically. In the Portuguese-speaking Wikipedia, articles can be rated by bots and by the community. In this paper, we investigate the correlation between these ratings and the count of media items (namely images and sounds) through a series of experiments. Our results show that article ratings and the count of media items are correlated.

Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics

In Wikipedia, articles about various topics can be created and edited independently in each language version. Therefore, quality of information about the same topic depends on language. Any interested user can improve an article and that improvement may depend on popularity of the article. The goal of this study is to show what topics are best represented in different language versions of Wikipedia using results of quality assessment for over 39 million articles in 55 languages. In this paper we also analyze how popular are selected topics among readers and authors in various languages. We used two approaches to assign articles to various topics. First, we divided articles into 27 main topics based on information extracted from over 10 million categories in 55 language versions and analyzed about 400 million links from articles to over 10 million categories and over 26 million links between categories. In the second approach we used data from DBpedia and Wikidata. We also showed how...

Measuring article quality in wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM '07, 2007

Wikipedia has grown to be the world largest and busiest free encyclopedia, in which articles are collaboratively written and maintained by volunteers online. Despite its success as a means of knowledge sharing and collaboration, the public has never stopped criticizing the quality of Wikipedia articles edited by non-experts and inexperienced contributors. In this paper, we investigate the problem of assessing the quality of articles in collaborative authoring of Wikipedia. We propose three article quality measurement models that make use of the interaction data between articles and their contributors derived from the article edit history. Our Basic model is designed based on the mutual dependency between article quality and their author authority. The PeerReview model introduces the review behavior into measuring article quality. Finally, our ProbReview models extend PeerReview with partial reviewership of contributors as they edit various portions of the articles. We conduct experiments on a set of well-labeled Wikipedia articles to evaluate the effectiveness of our quality measurement models in resembling human judgement.