Annotating Credibility: Identifying and Mitigating Bias in Credibility Datasets (original) (raw)

MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

We contribute the largest publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It is collected from 26 fact checking websites in English, paired with textual sources and rich metadata, and labelled for veracity by human expert journalists. We present an in-depth analysis of the dataset, highlighting characteristics and challenges. Further, we present results for automatic veracity prediction, both with established baselines and with a novel method for joint ranking of evidence pages and predicting veracity that outperforms all baselines. Significant performance increases are achieved by encoding evidence, and by modelling metadata. Our best-performing model achieves a Macro F1 of 49.2%, showing that this is a challenging testbed for claim veracity prediction.

No Shortcuts to Credibility Evaluation

Establishing and Evaluating Digital Ethos and Online Credibility

This chapter argues that as the online informational landscape continues to expand, shortcuts to source credibility evaluation, in particular the revered checklist approach, falls short of its intended goal, and this method cannot replace the acquisition of a more formally acquired and comprehensive information literacy skill set. By examining the current standard of checklist criteria, the authors identify problems with this approach. Such shortcuts are not necessarily effective for online source credibility assessment, and the authors contend that in cases of high-stakes informational needs, they cannot adequately replace the expertise of information professionals, nor displace the need for proper and continuous information literacy education.

Supporting factual statements with evidence from the web

Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12, 2012

Fact verification has become an important task due to the increased popularity of blogs, discussion groups, and social sites, as well as of encyclopedic collections that aggregate content from many contributors. We investigate the task of automatically retrieving supporting evidence from the Web for factual statements. Using Wikipedia as a starting point, we derive a large corpus of statements paired with supporting Web documents, which we employ further as training and test data under the assumption that the contributed references to Wikipedia represent some of the most relevant Web documents for supporting the corresponding statements. Given a factual statement, the proposed system first transforms it into a set of semantic terms by using machine learning techniques. It then employs a quasi-random strategy for selecting subsets of the semantic terms according to topical likelihood. These semantic terms are used to construct queries for retrieving Web documents via a Web search API. Finally, the retrieved documents are aggregated and re-ranked by employing additional measures of their suitability to support the factual statement. To gauge the quality of the retrieved evidence, we conduct a user study through Amazon Mechanical Turk, which shows that our system is capable of retrieving supporting Web documents comparable to those chosen by Wikipedia contributors.

Truth, lies, and data: Credibility representation in data analysis

The web has evolved in a scale free manner, with available information about different entities developing in different forms, different locations, and at massive scales. This paper addresses the cognitive limitations that information analysts typically experience as they approach the boundaries where automated analysis algorithms are sorely needed. An experiment is conducted to explore information analysts' interactions with recommendations from an automated fact-finder algorithm during the task of answering questions in a fictional humanitarian aid delivery scenario. An experiment (N=285) is performed using three increasingly complex user interfaces, with and without the presence of the automated recommendations. Results show that in the best performing group, interaction with the factfinder recommendations was 47 percent greater than the worst performing group. to discover rules that help analysts to better adapt to specific contexts/missions?

Where the Truth Lies: Explaining the Credibility of Emerging Claims on the Web and Social Media

The web is a huge source of valuable information. However, in recent times, there is an increasing trend towards false claims in social media, other web-sources, and even in news. Thus, fact-checking websites have become increasingly popular to identify such misinformation based on manual analysis. Recent research proposed methods to assess the credibility of claims automatically. However, there are major limitations: most works assume claims to be in a structured form, and a few deal with textual claims but require that sources of evidence or counter-evidence are easily retrieved from the web. None of these works can cope with newly emerging claims, and no prior method can give user-interpretable explanations for its verdict on the claim's credibility. This paper overcomes these limitations by automatically assessing the credibility of emerging claims, with sparse presence in web-sources, and generating suitable explanations from judiciously selected sources. To this end, we retrieve diverse articles about the claim, and model the mutual interaction between: the stance (i.e., support or refute) of the sources, the language style of the articles, the reliability of the sources, and the claim's temporal footprint on the web. Extensive experiments demonstrate the viability of our method and its superiority over prior works. We show that our methods work well for early detection of emerging claims, as well as for claims with limited presence on the web and social media.

A Data Set of Internet Claims and Comparison of their Sentiments with Credibility

ArXiv, 2019

In this modern era, communication has become faster and easier. This means fallacious information can spread as fast as reality. Considering the damage that fake news kindles on the psychology of people and the fact that such news proliferates faster than truth, we need to study the phenomenon that helps spread fake news. An unbiased data set that depends on reality for rating news is necessary to construct predictive models for its classification. This paper describes the methodology to create such a data set. We collect our data from this http URL which is a fact-checking organization. Furthermore, we intend to create this data set not only for classification of the news but also to find patterns that reason the intent behind misinformation. We also formally define an Internet Claim, its credibility, and the sentiment behind such a claim. We try to realize the relationship between the sentiment of a claim with its credibility. This relationship pours light on the bigger picture be...

Augmenting web pages and search results to support credibility assessment

2011

Abstract The presence (and, sometimes, prominence) of incorrect and misleading content on the Web can have serious consequences for people who increasingly rely on the internet as their information source for topics such as health, politics, and financial advice. In this paper, we identify and collect several page features (such as popularity among specialized user groups) that are currently difficult or impossible for end users to assess, yet provide valuable signals regarding credibility.

The 2nd workshop on information credibility on the web (WICOW 2008)

2009

Research on credibility of web content is becoming increasingly important due to low publishing barriers and resulting abundance of untrustworthy or conflicting information on the web. On the 30th October 2008 the 2nd Workshop on Information Credibility on the web was held as part of CIKM 2009 conference in Napa Valley, USA. Nine full and six short papers were accepted and grouped into four sessions. In addition, two keynote speeches have been delivered. This report outlines the main results of the workshop.

A Structured Response to Misinformation

Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW '18

The proliferation of misinformation in online news and its amplification by platforms are a growing concern, leading to numerous efforts to improve the detection of and response to misinformation. Given the variety of approaches, collective agreement on the indicators that signify credible content could allow for greater collaboration and data-sharing across initiatives. In this paper, we present an initial set of indicators for article credibility defined by a diverse coalition of experts. These indicators originate from both within an article's text as well as from external sources or article metadata. As a proof-of-concept, we present a dataset of 40 articles of varying credibility annotated with our indicators by 6 trained annotators using specialized platforms. We discuss future steps including expanding annotation, broadening the set of indicators, and considering their use by platforms and the public, towards the development of interoperable standards for content credibility. This paper is published under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution.

When classification accuracy is not enough: Explaining news credibility assessment

Information Processing & Management, 2021

Dubious credibility of online news has become a major problem with negative consequences for both readers and the whole society. Despite several efforts in the development of automatic methods for measuring credibility in news stories, there has been little previous work focusing on providing explanations that go beyond a black-box decision or score. In this work, we use two machine learning approaches for computing a credibility score for any given news story: one is a linear method trained on stylometric features and the other one is a recurrent neural network. Our goal is to study whether we can explain the rationale behind these automatic methods and improve a reader's confidence in their credibility assessment. Therefore, we first adapted the classifiers to the constraints of a browser extension so that the text can be analysed while browsing online news. We also propose a set of interactive visualisations to explain to the user the rationale behind the automatic credibility assessment. We evaluated our adapted methods by means of standard machine learning performance metrics and through two user studies. The adapted neural classifier showed better performance on the test data than the stylometric classifier, despite the latter appearing to be easier to interpret by the participants. Also, users were significantly more accurate in their assessment after they interacted with the tool as well as more confident with their decisions.