Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data - PubMed (original) (raw)

Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data

Cathy Ong Ly et al. NPJ Digit Med. 2024.

Abstract

Healthcare datasets are becoming larger and more complex, necessitating the development of accurate and generalizable AI models for medical applications. Unstructured datasets, including medical imaging, electrocardiograms, and natural language data, are gaining attention with advancements in deep convolutional neural networks and large language models. However, estimating the generalizability of these models to new healthcare settings without extensive validation on external data remains challenging. In experiments across 13 datasets including X-rays, CTs, ECGs, clinical discharge summaries, and lung auscultation data, our results demonstrate that model performance is frequently overestimated by up to 20% on average due to shortcut learning of hidden data acquisition biases (DAB). Shortcut learning refers to a phenomenon in which an AI model learns to solve a task based on spurious correlations present in the data as opposed to features directly related to the task itself. We propose an open source, bias-corrected external accuracy estimate, PEst, that better estimates external accuracy to within 4% on average by measuring and calibrating for DAB-induced shortcut learning.

© 2024. The Author(s).

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1

Fig. 1. Data acquisition induced bias in AI systems in healthcare.

Data collection and existing/proposed AI development and validation workflows in healthcare. A Datasets for deep learning are commonly collected across different healthcare pathways (emergency, respirology, obstetrics, cardiology, etc.) or care networks, and each pathway or hospital uses slightly different data acquisition hardware and protocols. B Models are reported to have clinical-level accuracy, but they do so by learning hidden non-semantic and non-structural cues from the acquisition pathway C The performance does not generalize to other hospitals with different data acquisition pathways. We hypothesize that this is because the models have learned to use subtle data acquisition features as surrogates of diagnoses. D We randomly shuffle the data within each patient sample to suppress structural and semantic information. If datasets were unbiased, resultant models would have near zero accuracy, but their performances are higher than anticipated. E In our proposed workflow, data acquisition bias induced shortcut learning (DABIS) estimated using signal shuffling enables the reporting of calibrated accuracy measures that are more reflective of external validation accuracy.

Fig. 2

Fig. 2. Comparison of source data model receiver operator curves (ROC), estimated external validation ROC, and observed external validation ROC on 13 datasets and 5 modalities.

Model receiver operator characteristic curves on source (a, c, e, g, i, k, m, o, q) and external validation datasets (b, d, f, h, j, l, n, p, r, s, t). Source dataset figures include the corresponding DABIS estimate (gray) and the external dataset figures include our estimated curves (yellow). Shaded regions depict the 95% confidence interval. Notice that the ROC curves on the external test datasets (green) are much better approximated by our predicted curves (yellow) than they are by the traditional source test dataset curves (red). MIMIC-CXR was shortened to CXR in this figure.

References

    1. Yu, A. C., Mohajer, B. & Eng, J. External validation of deep learning algorithms for radiologic diagnosis: a systematic review. Radiol.: Artif. Intell.4http://pubs.rsna.org/doi/10.1148/ryai.210064 (2022). -DOI -PMC -PubMed
    1. Wong A, et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern. Med. 2021;181:1065–1070. doi: 10.1001/jamainternmed.2021.2626. -DOI -PMC -PubMed
    1. Dou, Q. et al. Federated deep learning for detecting COVID-19 lung abnormalities in CT: a privacy-preserving multinational validation study. npj Digital Med.4, 60 (2021). -PMC -PubMed
    1. Roberts M, et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 2021;3:199–217. doi: 10.1038/s42256-021-00307-0. -DOI
    1. DeGrave AJ, Janizek JD, Lee S-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 2021;3:610–619. doi: 10.1038/s42256-021-00338-7. -DOI

LinkOut - more resources