Learning Disease vs Participant Signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications (original) (raw)

Using permutations to assess confounding in machine learning applications for digital health

arXiv (Cornell University), 2018

Clinical machine learning applications are often plagued with confounders that can impact the generalizability and predictive performance of the learners. Confounding is especially problematic in remote digital health studies where the participants self-select to enter the study, thereby making it challenging to balance the demographic characteristics of participants. One effective approach to combat confounding is to match samples with respect to the confounding variables in order to balance the data. This procedure, however, leads to smaller datasets and hence impact the inferences drawn from the learners. Alternatively, confounding adjustment methods that make more efficient use of the data (e.g., inverse probability weighting) usually rely on modeling assumptions, and it is unclear how robust these methods are to violations of these assumptions. Here, rather than proposing a new approach to control for confounding, we develop novel permutation based statistical methods to detect and quantify the influence of observed confounders, and estimate the unconfounded performance of the learner. Our tools can be used to evaluate the effectiveness of existing confounding adjustment methods. We illustrate their application using real-life data from a Parkinson's disease mobile health study collected in an uncontrolled environment.

Risk of Training Diagnostic Algorithms on Data with Demographic Bias

Interpretable and Annotation-Efficient Learning for Medical Image Computing

One of the critical challenges in machine learning applications is to have fair predictions. There are numerous recent examples in various domains that convincingly show that algorithms trained with biased datasets can easily lead to erroneous or discriminatory conclusions. This is even more crucial in clinical applications where the predictive algorithms are designed mainly based on a limited or given set of medical images and demographic variables such as age, sex and race are not taken into account. In this work, we conduct a survey of the MICCAI 2018 proceedings to investigate the common practice in medical image analysis applications. Surprisingly, we found that papers focusing on diagnosis rarely describe the demographics of the datasets used, and the diagnosis is purely based on images. In order to highlight the importance of considering the demographics in diagnosis tasks, we used a publicly available dataset of skin lesions. We then demonstrate that a classifier with an overall area under the curve (AUC) of 0.83 has variable performance between 0.76 and 0.91 on subgroups based on age and sex, even though the training set was relatively balanced. Moreover, we show that it is possible to learn unbiased features by explicitly using demographic variables in an adversarial training setup, which leads to balanced scores per subgroups. Finally, we discuss the implications of these results and provide recommendations for further research.

Machine Learning to Evaluate the Quality of Patient Reported Epidemiological Data

CC-BY-NC-ND 4.0 International license This paper will appear in the Proceedings of the 2018 JSM (by Dec 2018). Patient reported epidemiological data are becoming more widely available. One new such dataset, the Fox Insight (FI) project, was launched in 2017 to encourage the study of Parkinson's disease and will be released for public access in 2019. Early analyses of responses from the earliest participants suggest that there may be significant fatigue effects on elements that occur later in the surveys. These trends point to potential violations of assumptions of missingness at random (MAR) and completely at random (MCAR), which can limit the inferences that might otherwise be drawn from analyses of these data. Here we discuss a machine learning approach that can be used to evaluate the likelihood that an individual respondent is " doing their best " vs. not. Bayesian network structural learning is used to identify the network structure, and data quality scores (DQS) were estimated and analyzed within-across-each section of a set of seven patient reported instruments. The proportion of respondents whose DQS scores fell below what would be considered a cutoff (threshold) for data that is unacceptably or unexpectedly similar to random responses ranges from a low of 13% to a high of 66%. Our results suggest that the method is not unduly influenced by the length of instruments or their internal consistency scores. The method can be used to detect, quantify, and then plan or choose the method of addressing nonresponse bias, if it exists, in any dataset an investigator may choose – including the FI dataset, once that is made available. The method can also be used to diagnose challenges that may arise in one's own dataset, possibly arising from a misalignment of patient and investigator perspectives on the relevance or resonance of the data being collected.

Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications

Healthcare Informatics Research

Objectives: With advances in data availability and computing capabilities, artificial intelligence and machine learning technologies have evolved rapidly in recent years. Researchers have taken advantage of these developments in healthcare informatics and created reliable tools to predict or classify diseases using machine learning-based algorithms. To correctly quantify the performance of those algorithms, the standard approach is to use cross-validation, where the algorithm is trained on a training set, and its performance is measured on a validation set. Both datasets should be subject-independent to simulate the expected behavior of a clinical study. This study compares two cross-validation strategies, the subject-wise and the record-wise techniques; the subject-wise strategy correctly mimics the process of a clinical study, while the record-wise strategy does not.Methods: We started by creating a dataset of smartphone audio recordings of subjects diagnosed with and without Park...

Biased binomial assessment of cross-validated estimation of classification accuracies illustrated in diagnosis predictions

NeuroImage. Clinical, 2014

Multivariate classification is used in neuroimaging studies to infer brain activation or in medical applications to infer diagnosis. Their results are often assessed through either a binomial or a permutation test. Here, we simulated classification results of generated random data to assess the influence of the cross-validation scheme on the significance of results. Distributions built from classification of random data with cross-validation did not follow the binomial distribution. The binomial test is therefore not adapted. On the contrary, the permutation test was unaffected by the cross-validation scheme. The influence of the cross-validation was further illustrated on real-data from a brain-computer interface experiment in patients with disorders of consciousness and from an fMRI study on patients with Parkinson disease. Three out of 16 patients with disorders of consciousness had significant accuracy on binomial testing, but only one showed significant accuracy using permutati...

Everything is Varied: The Surprising Impact of Individual Variation on ML Robustness in Medicine

arXiv (Cornell University), 2022

In medical settings, Individual Variation (IV) refers to variation that is due not to population differences or errors, but rather to within-subject variation, that is the intrinsic and characteristic patterns of variation pertaining to a given instance or the measurement process. While taking into account IV has been deemed critical for proper analysis of medical data, this source of uncertainty and its impact on robustness have so far been neglected in Machine Learning (ML). To fill this gap, we look at how IV affects ML performance and generalization and how its impact can be mitigated. Specifically, we provide a methodological contribution to formalize the problem of IV in the statistical learning framework and, through an experiment based on one of the largest real-world laboratory medicine datasets for the problem of COVID-19 diagnosis, we show that: 1) common state-of-the-art ML models are severely impacted by the presence of IV in data; and 2) advanced learning strategies, based on data augmentation and data imprecisiation, and proper study designs can be effective at improving robustness to IV. Our findings demonstrate the critical relevance of correctly accounting for IV to enable safe deployment of ML in clinical settings.

Estimating misclassification error: a closer look at cross-validation based methods

BMC Research Notes, 2012

Background: To estimate a classifier's error in predicting future observations, bootstrap methods have been proposed as reduced-variation alternatives to traditional cross-validation (CV) methods based on sampling without replacement. Monte Carlo (MC) simulation studies aimed at estimating the true misclassification error conditional on the training set are commonly used to compare CV methods. We conducted an MC simulation study to compare a new method of bootstrap CV (BCV) to k-fold CV for estimating classification error. Findings: For the low-dimensional conditions simulated, the modest positive bias of k-fold CV contrasted sharply with the substantial negative bias of the new BCV method. This behavior was corroborated using a real-world dataset of prognostic gene-expression profiles in breast cancer patients. Our simulation results demonstrate some extreme characteristics of variance and bias that can occur due to a fault in the design of CV exercises aimed at estimating the true conditional error of a classifier, and that appear not to have been fully appreciated in previous studies. Although CV is a sound practice for estimating a classifier's generalization error, using CV to estimate the fixed misclassification error of a trained classifier conditional on the training set is problematic. While MC simulation of this estimation exercise can correctly represent the average bias of a classifier, it will overstate the between-run variance of the bias. Conclusions: We recommend k-fold CV over the new BCV method for estimating a classifier's generalization error. The extreme negative bias of BCV is too high a price to pay for its reduced variance.

Disease Labeling via Machine Learning is NOT quite the same as Medical Diagnosis

2019

A key step in medical diagnosis is giving the patient a universally recognized label (e.g. Appendicitis) which essentially assigns the patient to a class(es) of patients with similar body failures. However, two patients having the same disease label(s) with high probability may still have differences in their feature manifestation patterns implying differences in the required treatments. Additionally, in many cases, the labels of the primary diagnoses leave some findings unexplained. Medical diagnosis is only partially about probability calculations for label X or Y. Diagnosis is not complete until the patient overall situation is clinically understood to the level that enables the best therapeutic decisions. Most machine learning models are data centric models, and evidence so far suggest they can reach expert level performance in the disease labeling phase. Nonetheless, like any other mathematical technique, they have their limitations and applicability scope. Primarily, data cent...

How to remove or control confounds in predictive models, with applications to brain biomarkers

GigaScience, 2022

Background With increasing data sizes and more easily available computational methods, neurosciences rely more and more on predictive modeling with machine learning, e.g., to extract disease biomarkers. Yet, a successful prediction may capture a confounding effect correlated with the outcome instead of brain features specific to the outcome of interest. For instance, because patients tend to move more in the scanner than controls, imaging biomarkers of a disease condition may mostly reflect head motion, leading to inefficient use of resources and wrong interpretation of the biomarkers. Results Here we study how to adapt statistical methods that control for confounds to predictive modeling settings. We review how to train predictors that are not driven by such spurious effects. We also show how to measure the unbiased predictive accuracy of these biomarkers, based on a confounded dataset. For this purpose, cross-validation must be modified to account for the nuisance effect. To guide...