Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance - PubMed (original) (raw)
Review
doi: 10.1016/j.landig.2025.100916. Epub 2025 Dec 13.
Gary S Collins 2, Andrew J Vickers 3, Laure Wynants 4, Kathleen F Kerr 5, Lasai Barreñada 6, Gael Varoquaux 7, Karandeep Singh 8, Karel Gm Moons 9, Tina Hernandez-Boussard 10, Dirk Timmerman 11, David J McLernon 12, Maarten van Smeden 9, Ewout W Steyerberg 13; Topic Group 6 of the STRATOS initiative
Affiliations
- PMID: 41391983
- DOI: 10.1016/j.landig.2025.100916
Free article
Review
Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance
Ben Van Calster et al. Lancet Digit Health. 2025 Dec.
Free article
Abstract
Numerous measures have been proposed to illustrate the performance of predictive artificial intelligence (AI) models. Selecting appropriate performance measures is essential for predictive AI models intended for use in medical practice. Poorly performing models are misleading and may lead to wrong clinical decisions that can be detrimental to patients and increase financial costs. In this Viewpoint, we assess the merits of classic and contemporary performance measures when validating predictive AI models for medical practice, focusing on models that estimate probabilities for a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall performance, classification, and clinical utility) along with corresponding graphical assessments. The first four domains address statistical performance, whereas the fifth domain covers decision-analytical performance. We discuss two key characteristics when selecting a performance measure and explain why these characteristics are important: (1) whether the measure's expected value is optimised when calculated using the correct probabilities (ie, whether it is a proper measure) and (2) whether the measure solely reflects statistical performance or decision-analytical performance by properly accounting for misclassification costs. 17 measures showed both characteristics, 14 showed one, and one (F1 score) showed neither. All classification measures were improper for clinically relevant decision thresholds other than when the threshold was 0·5 or equal to the true prevalence. We illustrate these measures and characteristics using the ADNEX model which predicts the probability of malignancy in women with an ovarian tumour. We recommend the following measures and plots as essential to report: area under the receiver operating characteristic curve, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot showing probability distributions by outcome category.
Copyright © 2025 The Author(s). Published by Elsevier Ltd.. All rights reserved.
Conflict of interest statement
Declaration of interests BVC has stock options in Gynaia Inc, a company providing teaching, training, and decision support tools such as the ADNEX model for gynaecologists and sonographers. DT is a co-founder of, and has stock options in, Gynaia Inc. KFK received an honorarium for teaching a short course in the UW Summer Institutes for Statistics in Clinical and Epidemiological Research in July 2023 and July 2024. GV has stock options in Therapixel, a company providing imaging analysis software based on artificial intelligence (AI). TH-B received royalties from Coursera for courses on AI in health care, received consulting fees from Paul Hartmann AG and Grai-Matter, and has stock options in Verantos Inc. DJM received honoraria from Merck for a fertility prediction modelling workshop, and financial support from Merck and IVIRMA to present at scientific conferences. All other authors declare no competing interests.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources