Evaluating Human-Language Model Interaction (original) (raw)

Authors:Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael Bernstein, Percy Liang

View PDF HTML (experimental)

Abstract:Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.

Submission history

From: Mina Lee [view email]
[v1] Mon, 19 Dec 2022 18:59:45 UTC (10,573 KB)
[v2] Tue, 20 Dec 2022 18:53:53 UTC (10,573 KB)
[v3] Wed, 12 Jul 2023 16:29:28 UTC (8,547 KB)
[v4] Sun, 10 Sep 2023 13:31:08 UTC (8,552 KB)
[v5] Fri, 5 Jan 2024 22:09:26 UTC (8,552 KB)