Outcome class imbalance and rare events: An underappreciated complication for overdose risk prediction modeling - PubMed (original) (raw)

Outcome class imbalance and rare events: An underappreciated complication for overdose risk prediction modeling

Abigail R Cartus et al. Addiction. 2023 Jun.

Abstract

Background and aims: Low outcome prevalence, often observed with opioid-related outcomes, poses an underappreciated challenge to accurate predictive modeling. Outcome class imbalance, where non-events (i.e. negative class observations) outnumber events (i.e. positive class observations) by a moderate to extreme degree, can distort measures of predictive accuracy in misleading ways, and make the overall predictive accuracy and the discriminatory ability of a predictive model appear spuriously high. We conducted a simulation study to measure the impact of outcome class imbalance on predictive performance of a simple SuperLearner ensemble model and suggest strategies for reducing that impact.

Design, setting, participants: Using a Monte Carlo design with 250 repetitions, we trained and evaluated these models on four simulated data sets with 100 000 observations each: one with perfect balance between events and non-events, and three where non-events outnumbered events by an approximate factor of 10:1, 100:1, and 1000:1, respectively.

Measurements: We evaluated the performance of these models using a comprehensive suite of measures, including measures that are more appropriate for imbalanced data.

Findings: Increasing imbalance tended to spuriously improve overall accuracy (using a high threshold to classify events vs non-events, overall accuracy improved from 0.45 with perfect balance to 0.99 with the most severe outcome class imbalance), but diminished predictive performance was evident using other metrics (corresponding positive predictive value decreased from 0.99 to 0.14).

Conclusion: Increasing reliance on algorithmic risk scores in consequential decision-making processes raises critical fairness and ethical concerns. This paper provides broad guidance for analytic strategies that clinical investigators can use to remedy the impacts of outcome class imbalance on risk prediction tools.

Keywords: Class imbalance; machine learning; overdose; rare events; risk prediction; substance use.

© 2023 Society for the Study of Addiction.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests: none

Figures

Figure 1.

Figure 1.

Histograms of risk scores, receiver operating characteristic (ROC), and precision-recall curves for all models.a a Top row: histograms of predicted risk scores. Middle row: receiver operating characteristics (ROC) curves. Bottom row: precision-recall curves. Within each row, the degree of imbalance increases from left to right: 1:1 for the leftmost column, 10:1, 100:1, and 1000:1 for the rightmost column. In the bottom row, the outcome prevalence of each simulation run is shown as a horizontal blue line. Abbreviations: AUC: area under the curve; PRC: area under the precision-recall curve.

Figure 2.

Figure 2.

Confusion matrices showing accuracy of predicted classifications.a a Each confusion matrix shows the frequency of observations in each quadrant. Concordant quadrants (where the predictions are correct) are shaded in blue, while discordant quadrants (where predictions are incorrect) are shaded in coral. Quadrants are shaded by frequency such that those with more observations are shaded darker. Clockwise from the top left quadrant: true positives, false negatives, true negatives, false positives.

References

    1. Macmadu A, Batthala S, Gabel AMC, Rosenberg M, Ganguly R, Yedinak JL, et al. Comparison of characteristics of deaths from drug overdose before vs during the COVID-19 pandemic in Rhode island. JAMA network open. 2021;4(9):e2125538–e. -PMC -PubMed
    1. Kuehn BM. Accelerated overdose deaths linked with COVID-19. JAMA. 2021;325(6):523-. -PubMed
    1. Hedegaard H, Miniño A, Spencer M, Warner M. Drug overdose deaths in the United States, 1999–2020. NCHS Data Brief, No. 428. National Center for Health Statistics. 2021.
    1. Control CfD Prevention. Drug overdose deaths in the US Top 100,000 annually. Atlanta: Centers for Disease Control and Prevention. 2021.
    1. Ferris LM, Saloner B, Krawczyk N, Schneider KE, Jarman MP, Jackson K, et al. Predicting Opioid Overdose Deaths Using Prescription Drug Monitoring Program Data. Am J Prev Med. 2019;57(6):e211–e7. -PMC -PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources