A Study of Automatic Metrics for the Evaluation of Natural Language Explanations (original) (raw)

I don't understand! Evaluation Methods for Natural Language Explanations

2021

Explainability of intelligent systems is key for future adoption. While much work is ongoing with regards to developing methods of explaining complex opaque systems, there is little current work on evaluating how effective these explanations are, in particular with respect to the user’s understanding. Natural language (NL) explanations can be seen as an intuitive channel between humans and artificial intelligence systems, in particular for enhancing transparency. This paper presents existing work on how evaluation methods from the field of Natural Language Generation (NLG) can be mapped onto NL explanations. Also, we present a preliminary investigation into the relationship between linguistic features and human evaluation, using a dataset of NL explanations derived from Bayesian Networks.

Automatic Generation of Natural Language Explanations

Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion, 2018

An important task for recommender system is to generate explanations according to a user's preferences. Most of the current methods for explainable recommendations use structured sentences to provide descriptions along with the recommendations they produce. However, those methods have neglected the review-oriented way of writing a text, even though it is known that these reviews have a strong influence over user's decision. In this paper, we propose a method for the automatic generation of natural language explanations, for predicting how a user would write about an item, based on user ratings from different items' features. We design a character-level recurrent neural network (RNN) model, which generates an item's review explanations using longshort term memories (LSTM). The model generates text reviews given a combination of the review and ratings score that express opinions about different factors or aspects of an item. Our network is trained on a sub-sample from the large real-world dataset BeerAdvocate. Our empirical evaluation using natural language processing metrics shows the generated text's quality is close to a real user written review, identifying negation, misspellings, and domain specific vocabulary.

Challenges and Opportunities in Text Generation Explainability

arXiv (Cornell University), 2024

The necessity for interpretability in natural language processing (NLP) has risen alongside the growing prominence of large language models. Among the myriad tasks within NLP, text generation stands out as a primary objective of autoregressive models. The NLP community has begun to take a keen interest in gaining a deeper understanding of text generation, leading to the development of modelagnostic explainable artificial intelligence (xAI) methods tailored to this task. The design and evaluation of explainability methods are non-trivial since they depend on many factors involved in the text generation process, e.g., the autoregressive model and its stochastic nature. This paper outlines 17 challenges categorized into three groups that arise during the development and assessment of attribution-based explainability methods. These challenges encompass issues concerning tokenization, defining explanation similarity, determining token importance and prediction change metrics, the level of human intervention required, and the creation of suitable test datasets. The paper illustrates how these challenges can be intertwined, showcasing new opportunities for the community. These include developing probabilistic word-level explainability methods and engaging humans in the explainability pipeline, from the data design to the final evaluation, to draw robust conclusions on xAI methods.

Interactive Natural Language Technology for Explainable Artificial Intelligence

2020

We have defined an interdisciplinary program for training a new generation of researchers who will be ready to leverage the use of Artificial Intelligence (AI)-based models and techniques even by nonexpert users. The final goal is to make AI self-explaining and thus contribute to translating knowledge into products and services for economic and social benefit, with the support of Explainable AI systems. Moreover, our focus is on the automatic generation of interactive explanations Supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No 860621. c © Springer Nature Switzerland AG 2021 F. Heintz et al. (Eds.): TAILOR 2020, LNAI 12641, pp. 63–70, 2021. https://doi.org/10.1007/978-3-030-73959-1\_5 64 J. M. Alonso et al. in natural language, the preferred modality among humans, with visualization as a complementary modality.

Something Borrowed: Exploring the Influence of AI-Generated Explanation Text on the Composition of Human Explanations

Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

Recent advances in Human-AI interaction have highlighted the possibility of employing AI in collaborative decision-making contexts, particularly in cases where the decision is subjective, without one ground truth. In these contexts, researchers argue that AI could be used not just to provide a final decision recommendation, but to surface new perspectives, rationales, and insights. In this late-breaking work, we describe the initial findings from an empirical study investigating how complementary AI input influences humans' rationale in ambiguous decision-making. We use subtle sexism as an example of this context, and GPT-3 to create explanation-like text. We find that participants change the language, level of detail, and even the argumentative stance of their explanations after seeing the AI explanation text. They often borrow language directly from this complementary text. We discuss the implications for collaborative decision-making and the next steps in this research agenda.

Better Metrics for Evaluating Explainable Artificial Intelligence Blue Sky Ideas Track

2021

This paper presents objective metrics for how explainable artificial intelligence (XAI) can be quantified. Through an overview of current trends, we show that many explanations are generated post-hoc and independent of the agent's logical process, which in turn creates explanations with limited meaning as they lack transparency and fidelity. While user studies are a known basis for evaluating XAI, studies that do not consider objective metrics for evaluating XAI may have limited meaning and may suffer from confirmation bias, particularly if they use low fidelity explanations unnecessarily. To avoid this issue, this paper suggests a paradigm shift in evaluating XAI that focuses on metrics that quantify the explanation itself and its appropriateness given the XAI goal. We suggest four such metrics based on performance differences, D, between the explanation's logic and the agent's actual performance, the number of rules, R, outputted by the explanation, the number of features, F , used to generate that explanation, and the stability, S, of the explanation. We believe that user studies that focus on these metrics in their evaluations are inherently more valid and should be integrated in future XAI research.

SyntaxShap: Syntax-aware Explainability Method for Text Generation

arXiv (Cornell University), 2024

To harness the power of large language models in safety-critical domains, we need to ensure the explainability of their predictions. However, despite the significant attention to model interpretability, there remains an unexplored domain in explaining sequence-to-sequence tasks using methods tailored for textual data. This paper introduces SyntaxShap, a local, modelagnostic explainability method for text generation that takes into consideration the syntax in the text data. The presented work extends Shapley values to account for parsing-based syntactic dependencies. Taking a game theoric approach, SyntaxShap only considers coalitions constraint by the dependency tree. We adopt a model-based evaluation to compare SyntaxShap and its weighted form to state-ofthe-art explainability methods adapted to text generation tasks, using diverse metrics including faithfulness, coherency, and semantic alignment of the explanations to the model. We show that our syntax-aware method produces explanations that help build more faithful and coherent explanations for predictions by autoregressive models. Confronted with the misalignment of human and AI model reasoning, this paper also highlights the need for cautious evaluation strategies in explainable AI. 1

The KNIGHT Experiments: Empirically Evaluating an Explanation Generation System

1995

Abstract Empirically evaluating explanation generators poses a notoriously difficult problem. To address this problem, we constructed KNIGHT, a robust explanation generator that dynamically constructs natural language explanations about scientific phenomena. We then undertook the most extensive and rigorous empirical evaluation ever conducted on aa explanation generator. First, KNIGHT constructed explanations on randomly chosen topics from the Biology Knowledge Base.

Metrics for Explainable AI: Challenges and Prospects

arXiv (Cornell University), 2018

The question addressed in this paper is: If we present to a user an AI system that explains how it works, how do we know whether the explanation works and the user has achieved a pragmatic understanding of the AI? In other words, how do we know that an explanainable AI system (XAI) is any good? Our focus is on the key concepts of measurement. We discuss specific methods for evaluating: (1) the goodness of explanations, (2) whether users are satisfied by explanations, (3) how well users understand the AI systems, (4) how curiosity motivates the search for explanations, (5) whether the user's trust and reliance on the AI are appropriate, and finally, (6) how the human-XAI work system performs. The recommendations we present derive from our integration of extensive research literatures and our own psychometric evaluations.

Measuring Attribution in Natural Language Generation Models

arXiv (Cornell University), 2021

With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies. * Equal contribution. All authors contributed to all parts of the paper. ♠ Led development of the conceptual framework. ♣ Led human annotation study. ♦ Contributed to modeling experiments. ♥ Provided project leadership and management.

A Study of Automatic Metrics for the Evaluation of Natural Language Explanations (original) (raw)

Related papers