Multimodal Deep Learning for ARDS Detection (original) (raw)

Abstract

Objective:

Poor outcomes in acute respiratory distress syndrome (ARDS) can be alleviated with tools that support early diagnosis. Current machine learning methods for detecting ARDS do not take full advantage of the multimodality of ARDS pathophysiology. We developed a multimodal deep learning model that uses imaging data, continuously collected ventilation data, and tabular data derived from a patient’s electronic health record (EHR) to make ARDS predictions.

Materials and Methods:

A chest radiograph (x-ray), at least two hours of ventilator waveform (VWD) data within the first 24 hours of intubation, and EHR-derived tabular data were used from 220 patients admitted to the ICU to train a deep learning model. The model uses pretrained encoders for the x-rays and ventilation data and trains a feature extractor on tabular data. Encoded features for a patient are combined to make a single ARDS prediction. Ablation studies for each modality assessed their effect on the model’s predictive capability.

Results:

The trimodal model achieved an area under the receiver operator curve (AUROC) of 0.86 with a 95% confidence interval of 0.01. This was a statistically significant improvement (p<0.05) over single modality models and bimodal models trained on VWD+tabular and VWD+x-ray data.

Discussion and Conclusion:

Our results demonstrate the potential utility of using deep learning to address complex conditions with heterogeneous data. More work is needed to determine the additive effect of modalities on ARDS detection. Our framework can serve as a blueprint for building performant multimodal deep learning models for conditions with small, heterogeneous datasets.

Introduction

Acute respiratory distress syndrome (ARDS) is a severe form of hypoxemic respiratory failure that has proven challenging to diagnose and manage in a timely fashion, despite affecting up to 10% of ICU admissions and 25% of mechanically ventilated (MV) patients [1]. ARDS is associated with substantial morbidity and mortality, prolonged MV, high hospital-associated costs, and long-term physical and psychologic dysfunctions [1, 2]. Moreover, poor outcomes in ARDS are associated with delayed and missed diagnosis and suboptimal use of evidence-based therapies, even by subspecialty-trained clinicians [1, 2, 3, 4], highlighting the need for tools to support early diagnosis.

The diagnostic criteria for ARDS require laboratory-based, electronic health record (EHR)-derived arterial blood gas (ABG) measurement of the degree of hypoxemia, ascertainment of mechanical ventilation settings, and the presence of bilateral opacities on chest imaging [5]. This inherent multimodality of ARDS pathophysiology suggests that integrating multiple data sources could improve early diagnostic accuracy. Recent advances in deep learning have shown promise for early ARDS detection using single modalities, including convolutional neural networks (CNNs) on single chest radiographs (CXRs) [6] and on ventilator waveform data (VWD) [7]. While these unimodal approaches have shown promise, they may miss important clinical patterns that only become apparent when analyzing multiple data sources simultaneously.

Previous work has explored bimodal approaches to ARDS detection through methods that combine the predictions of models trained on CXR and EHR data [8, 9, 10]. However, recent advances in multimodal machine learning outside of ARDS detection suggest that more integrated approaches to modality fusion of more complex data could yield better results. For instance, studies have demonstrated the utility of deep fusion techniques by using medical images and diagnostic notes to improve classification tasks including pneumonia detection, COVID detection, and bone abnormality detection [11]. These more sophisticated methods can capture both the unique contributions of each data type and the potentially important interactions between different physiologic measurements.

Recent work is also often limited to only two datatypes or modalities [10]. While bimodal approaches have shown initial promise, they have been largely limited to static types of data, such as a single metric of hypoxemia from arterial blood gases. This narrow use of data fails to account for the dynamic nature of patients with need for mechanical ventilation leading to missed opportunities to identify the earliest patterns of disease. The complex pathophysiology of ARDS, as suggested by the diverse modalities used in the Berlin criteria, suggests that integrating VWD, CXRs, and clinical data from the EHR could capture subtle interactions between physiologic, radiographic, and clinical manifestations of the condition.

Finally, past work in ARDS detection uses thousands to tens of thousands of patients’ data during training [10]. As collecting data from this volume of patients can be both expensive and time consuming, robust methods that can be trained with less data are needed.

Building on advances in both ARDS detection and multimodal learning, we hypothesized that combining VWD, CXRs, and EHR data into a single, trimodal deep learning model would improve both the accuracy and robustness of early ARDS detection, even when training on small, heterogeneous datasets. Specifically, we investigated whether an integrated trimodal model that uses pretrained encoders could better discriminate between ARDS and non-ARDS patients compared to unimodal and bimodal methods.

Methods

Cohort Description

All patient data were obtained as part of a prospective, Institutional Review Board approved study collecting data from mechanically ventilated adults admitted to the intensive care unit (ICU) at the University of California (UC) Davis Medical Center. Three critical care physicians performed retrospective chart review to identify the cause of respiratory failure in subjects from the study cohort enrolled between 2015 and 2019. The presence or absence of ARDS was determined using the Berlin consensus criteria [5] by independent clinician review, with disagreements between any two clinicians resolved by consensus chart review. Subjects with ARDS were divided into two subgroups as a function of relative ambiguity of ARDS classification: Group 1) patients with confirmed moderate or severe ARDS without chronic obstructive pulmonary disease and/or asthma (to minimize the risk of misclassifying ARDS as a result of concurrent non-ARDS acute or nonacute chronic lung disease-associated hypoxemia), chronic congestive heart failure, or severe obesity and Group 2) patients with any severity of ARDS and without exclusion of these conditions.

Causes of ARDS for patients meeting the Berlin criteria are shown in Table 1. Table 1 also includes basic demographic information and additional clinical information such as Sequential Organ Failure Assessment score, hospital length of stay, and hospital mortality.

Table 1:

Comparison of characteristics between ARDS and non-ARDS patients. Note: 3 non-ARDS patients were not admitted to the ICU.

Characteristic ARDS (n = 120) Non-ARDS (n = 120)
Age (median [IQR]) 57 (47–68) 62 (53–71)
Female (n [%]) 53 (44) 48 (40)
Body mass index (median [IQR]) 27.5 (22.4–34.7) 26.6 (22.2–33)
Sequential Organ Failure Assessment score 1(median [IQR]) 11 (9–14.3) 8 (6–11)
Days from intubation to Berlin criteria (median [IQR]) 0.11 (0.06–0.24)
Median PaO2/FiO2 first 24 hours (median [IQR]) 166 (132–219.5) 311 (218–390)
Worst PaO2/FiO2 first 24 hours (median [IQR]) 111 (73–153) 218 (135–313)
Hospital length of stay (median [IQR]) 12.2 (7.3–22.7) 9 (5.3–16.9)
Hospital mortality (n [%]) 57 (48) 32 (27)
Ventilator-free days in 28 days (median [IQR]) 6.6 (0–21.9) 23.5 (4–26.3)
ARDS insult type (n [%])
Pneumonia 64 (53.3)
Aspiration 31 (25.8)
Non-pulmonary sepsis 18 (15)
Trauma 3 (2.5)
Other 4 (3.3)

No sample size calculation was conducted for this study; however, sample size was guided by the range of cohort sizes in previous studies of VWD analysis [7] and to achieve a balanced dataset for machine learning (ML) model development. Standard ML algorithms are often biased towards the majority class, resulting in a higher misclassification rate for the minority class [12].

Data Preparation

VWD representing air flow in liters per minute was collected from Puritan Bennett 840 ventilators as a 1-dimenional time series at a rate of 50Hz. VWD for each patient was split into windows of 30,000 samples (10 minutes of data collection), and the number of windows for each patient was limited to the first 36 available windows (the first six hours of available data). Incomplete windows were padded with 0s to create uniform window lengths. All patients had at least one hour of VWD data in the first 24 hours after Berlin criteria were first met or after the start of MV.

CXRs were obtained from a Picture Archiving and Communication System (PACS) in the form of DICOM files. For ARDS patients, the first CXR that met the Berlin criteria for ARDS was used. For non-ARDS patients, the first CXR obtained within 24 hours of intubation was used. CXRs were resized and cropped to a uniform size of 512×512 pixels.

Tabular data derived from the EHR were included from the first 24 hours after meeting Berlin criteria or the first 24 hours after the start of MV for non-ARDS patients. A full list of features used is available in Supplementary Table 1. Categorical features were one-hot encoded during data processing. Numerical features that were measured multiple times for a single patient were aggregated. For each numerical feature, we calculated the minimum, maximum, mean, and median of the feature. All numerical features were then normalized to have a mean of zero and unit variance. Missing numerical features were filled with the mean of the feature across the dataset.

During model training, each VWD window was treated as a separate training instance, with the corresponding CXR and EHR data repeated across windows for the same patient. During inference, predictions from all windows for a single patient were averaged to produce a final classification probability.

Train/Test Splitting

All data splits were performed at the patient level. A 5-fold k-fold split procedure (outer folds) was used, with a further nested 5-fold split (inner folds) on each outer fold. Hyperparameter tuning was performed using a grid search on inner folds. Within each training split, data was further split 80/20 into training and validation splits. The splitting procedure can be seen in Supplementary Figure 3.

The entire k-fold procedure was repeated 10 times for each modality combination, and performance was assessed using unseen data from the outer k-fold splits. This resulted in 50 total holdout sets for each model. One-hot encoding and scaling, as described in the previous section, were calculated on the training set for each split and applied to the validation and holdout sets.

Model Development

Each multimodal model consists of sub-classes of models: base encoders for each modality, modality-specific projection heads, and a classification head. Given a batch of patients, the multimodal model first encodes data for each patient from each modality using the base encoders. Then it passes the encoded data through modality-specific projection heads. Finally, it combines the projected data via concatenation before passing it into the single classification head. Figure 1 shows this graphically.

Figure 1:

Figure 1:

A schematic depiction of our trimodal architecture showing encoders, projection heads, and classification layers. X-ray and ventilator waveform encoders use a convolutional neural network architecture, while tabular encoders, projection heads, and classification networks use fully connected layers.

Modality-specific base encoders, Enc(⋅), map an input sample, x, to an embedding vector, e=Enc(x)∈ℝDe, where De is the embedding dimension size. De is 128 for both CXR and VWD data, and 64 for EHR data. Base encoders are described in detail in the next section.

Projection heads, Proj(·), map e to vector z=Proj(e)∈ℝDp. Dp is a tuned as a hyperparameter for each model, with possible values of 8, 16, or 32. Proj(⋅) is a one- or two-layer perceptron network with hidden layer and output layer dimensions of Dp. z is normalized to the unit hypersphere in ℝDp. The number of layers in Proj(⋅) is tuned as a hyperparameter.

Classification heads, Class(⋅), map z to two classes representing ARDS or non-ARDS. Class(⋅) is a two-layer perceptron network with input and hidden layer dimension of Dp∗m, where m is the number of modalities, and output layer dimension of two with a softmax activation.

Enc(⋅) layers are pretrained models (CXR and VWD) or trained on each individual split (EHR). Enc(⋅) layers are frozen during downstream model training. Proj(⋅) and Class(⋅) layers are jointly trained using a cross-entropy loss.

Models are trained using the Adam optimizer with an initial learning rate of 1e-4 and weight decay of 1e-6. Training is performed for a maximum of 150 epochs with early stopping based on validation loss with a patience of 3 epochs after a warm-up of 10 epochs.

Base Encoders

The VWD-specific encoder is a CNN pretrained for audio tagging, the task of assigning one or multiple semantic labels to an audio clip [13]. We use the pretrained model to extract intermediate features to use as our embeddings [14].

The CXR-specific encoder uses a pretrained model, which is a ResNet-50 backbone trained on CXRs for eight cardiopulmonary radiological tasks [15, 16]. We remove the final projection head of the model and use the remaining network to generate CXR embeddings.

The tabular-specific encoder is a model pretrained using the VIME (Value Imputation and Mask Estimation) framework [17]. This approach trains two networks on tabular data corrupted by data masks: one for data reconstruction and another for mask estimation. We use an intermediate network that serves as input to the two networks as our encoder for generating tabular embeddings.

Performance Assessment

We investigated three individual unimodal models and four combined multimodal models. The four combined multimodal models were: VWD+CXR, VWD+tabular, CXR+tabular, and trimodal.

Final model selection for each modality combination was based on the area under the receiver operating characteristic curve (AUROC) as measured on the validation set. Model performance on unseen data from the holdout datasets was assessed by the AUROC, area under the precision-recall curve (AUPRC) and typical confusion matrix statistics. Performance metrics are reported with 95% confidence intervals calculated with python’s SciPy library [18]. Hypothesis testing was performed between each modality combination with two-sample t-tests. Full p-values are in Supplementary Table 5. Model calibration was assessed descriptively using calibration curves plotting the predicted probability of ARDS against the observed prevalence of ARDS across predicted probability strata.

Results

Overall Model Performance

Across all evaluated modality combinations, the trimodal model achieved the highest overall performance with an AUROC of 0.86 (CI: 0.01) and an overall accuracy of 78% (CI: 0.02) when averaged across all holdout datasets. The trimodal model achieved a statistically significant improvement over all unimodal models, as well as the VWD+Tabular and VWD+CXR bimodal models (p < 0.05) but not the CXR+Tablular model. Table 2 shows the full complement of discrimination metrics averaged across all holdout datasets for models trained on each modality combination.

Table 2:

Average metrics for models trained on each modality combination. Best values for each metric and cohort are shown in bold.

Modality Cohort Accuracy AUROC Precision Sensitivity Specificity AUPRC
Tabular All 0.77 ± 0.01 0.83 ± 0.01 0.78 ± 0.01 0.74 ± 0.03 0.79 ± 0.02 0.82 ± 0.02
X-ray All 0.69 ± 0.02 0.76 ± 0.02 0.68 ± 0.04 0.69 ± 0.04 0.68 ± 0.04 0.78 ± 0.02
VWD All 0.71 ± 0.02 0.78 ± 0.02 0.72 ± 0.02 0.71 ± 0.03 0.72 ± 0.03 0.79 ± 0.02
VWD+X-ray All 0.73 ± 0.01 0.82 ± 0.02 0.74 ± 0.02 0.72 ± 0.03 0.74 ± 0.02 0.84 ± 0.01
VWD+Tabular All 0.76 ± 0.02 0.84 ± 0.02 0.76 ± 0.02 0.77 ± 0.03 0.75 ± 0.03 0.83 ± 0.02
X-ray+Tabular All 0.76 ± 0.02 0.85 ± 0.02 0.78 ± 0.02 0.75 ± 0.03 0.78 ± 0.03 0.86 ± 0.02
Trimodal All 0.78 ± 0.02 0.86 ± 0.01 0.78 ± 0.02 0.78 ± 0.03 0.78 ± 0.03 0.87 ± 0.02

Modality Contribution Analysis

In two of the three unimodal models, we observed consistent performance gains across all measures of discrimination when adding either a second or third modality, with the trimodal model showing the highest overall performance (Table 2). The model combining VWD+CXR data yielded substantially better performance than either modality alone (VWD+CXR model AUROC of 0.82 (CI: 0.02), compared to unimodal VWD and CXR models with AUROCs of 0.78 (CI: 0.02) and 0.76 (CI: 0.02), respectively. In contrast, bimodal models that used Tabular data from the EHR did not consistently outperform the unimodal model using Tabular data alone, with bimodal improvement in AUROC, AUPRC, and sensitivity but not in other metrics.

Performance Stratification by Patient Group

Model performance varied notably between Group 1 (more clear-cut ARDS cases) and Group 2 (less clear-cut cases). In Group 1, the trimodal model achieved the highest AUROCs of all models for both Group 1 (0.96, CI: 0.01) and Group 2 0.78 (CI: 0.02) ARDS cases. We observed a consistent performance gap between Group 1 and Group 2 ARDS cases across all modalities, but the gap was not uniform (Table 3). For unimodal models, VWD data showed the largest performance change (ΔAUROC = 0.23), followed by x-ray data (ΔAUROC = 0.18), and then tabular data (ΔAUROC = 0.17). In bimodal models, VWD+Tabular and CXR+Tabular had the same performance decrease (ΔAUROC = 0.18), while the VWD+ CXR model had a larger drop (ΔAUROC = 0.20).

Table 3:

Average metrics for models trained on each modality combination. Best values for each metric and cohort are shown in bold.

Modality Cohort Accuracy AUROC Precision Sensitivity Specificity AUPRC
Tabular Group 1 0.85 ± 0.02 0.93 ± 0.02 0.86 ± 0.04 0.82 ± 0.04 0.88 ± 0.03 0.92 ± 0.02
X-ray 0.79 ± 0.03 0.86 ± 0.03 0.79 ± 0.05 0.72 ± 0.05 0.84 ± 0.03 0.88 ± 0.03
VWD 0.82 ± 0.02 0.90 ± 0.02 0.82 ± 0.03 0.83 ± 0.03 0.81 ± 0.04 0.91 ± 0.02
VWD+X-ray 0.85 ± 0.02 0.92 ± 0.02 0.87 ± 0.03 0.81 ± 0.03 0.88 ± 0.03 0.93 ± 0.02
VWD+Tabular 0.85 ± 0.03 0.93 ± 0.02 0.85 ± 0.03 0.85 ± 0.04 0.85 ± 0.03 0.93 ± 0.02
X-ray+Tabular 0.86 ± 0.02 0.94 ± 0.01 0.88 ± 0.03 0.83 ± 0.03 0.89 ± 0.03 0.94 ± 0.02
Trimodal 0.88 ± 0.02 0.96 ± 0.01 0.91 ± 0.03 0.84 ± 0.04 0.92 ± 0.03 0.96 ± 0.01
Tabular Group 2 0.70 ± 0.02 0.76 ± 0.02 0.73 ± 0.03 0.67 ± 0.04 0.73 ± 0.03 0.76 ± 0.03
X-ray 0.62 ± 0.02 0.68 ± 0.03 0.61 ± 0.04 0.67 ± 0.04 0.56 ± 0.05 0.74 ± 0.02
VWD 0.63 ± 0.03 0.67 ± 0.03 0.65 ± 0.03 0.62 ± 0.04 0.64 ± 0.04 0.71 ± 0.03
VWD+X-ray 0.65 ± 0.02 0.72 ± 0.03 0.66 ± 0.03 0.66 ± 0.04 0.64 ± 0.03 0.76 ± 0.03
VWD+Tabular 0.69 ± 0.02 0.75 ± 0.03 0.71 ± 0.03 0.72 ± 0.04 0.68 ± 0.04 0.76 ± 0.03
X-ray+Tabular 0.69 ± 0.02 0.76 ± 0.02 0.72 ± 0.03 0.70 ± 0.04 0.69 ± 0.04 0.80 ± 0.03
Trimodal 0.70 ± 0.02 0.78 ± 0.02 0.71 ± 0.03 0.73 ± 0.03 0.68 ± 0.04 0.80 ± 0.02

Model Calibration

The trimodal model was generally well-calibrated across its range of predicted probabilities in the holdout datasets, with a small overprediction of ARDS risk toward the lower range of predictions and similarly small underprediction of risk in the upper range (Figure 2). Supplementary Table 6 shows the performance of the trimodal model as assessed by confusion matrix statistics across deciles of model classification thresholds. Note that the number of patients in each decile of predicted probability was not uniform and models made no predictions with probability < 0.2 or > 0.8 (grey histogram bars, Figure 2).

Figure 2:

Figure 2:

A calibration curve (in orange) for our trimodal model showing predicted probability on the x-axis and true probability on the left y-axis. Overlayed with the calibration curve is the prediction distribution (in gray). The distribution shows the prediction frequency, on the right y-axis, at each probability decile, the x-axis.

Discussion

In this study, we examined the feasibility of multimodal deep learning for ARDS detection on heterogeneous datasets by training small models on top of larger pretrained models with different combinations of modalities. We found that small models built on top of the generalized embeddings of larger pretrained models can reliably classify ARDS in ICU patients. We also found that models trained on multiple modalities tend to outperform models trained on subsets of those modalities. Specifically, we showed that trimodal models tended to outperform bimodal models and that bimodal models tended to outperform unimodal models. We found all models to be generally well calibrated across the range of predicted probabilities.

Our findings build upon recent research [10, 9] that collectively supports the potential utility of machine learning to screen for ARDS. Our work extends previous work on machine learning-based ARDS detection in several ways. First, we tested the combinatorial addition of three clinical data sources, which to our knowledge, is the first trimodal approach to ARDS classification. Like other studies that examined one versus two data types, we observed incremental gains in performance with the use of two data sources [19, 20]. Of note, we found that the addition of either VWD or CXRs to EHR data achieved comparable performance. Images are ordered infrequently in clinical settings, whereas VWD is generated continuously, suggesting that the combination of EHR+VWD data could prove more suitable for frequent ML-based screening, especially for patients without ARDS when mechanical ventilation is initiated [21, 9], and points to the potential utility of extending continuous physiologic monitoring signals to multimodal deep learning [22, 23, 24, 25].

Our finding of additional, albeit modest, gains in performance with the use of three data sources suggests that the use of distinct but complementary data types may enable more effective classifiers to be developed. While our work demonstrates the utility of adding a third, more complex modality to an ARDS detection model and provides the flexibility to add more, our results highlight the need for additional research into other difficult-to-diagnose conditions and those with multimodal diagnostic criteria. Additional research could also help determine whether the added complexity and computational costs of multimodal model development are consistently beneficial, and whether certain conditions benefit more or less from the inclusion of certain data types.

We were surprised by our analysis of model performance in subgroups of “pure” ARDS (Group 1) versus patients with ARDS and concurrent chronic lung disease, congestive heart failure, or severe obesity (Group 2). We hypothesized that we would see a relatively larger improvement in measures of discrimination with the bimodal and trimodal models in Group 2 due to the greater medical complexity of that subgroup, where the addition of additional information content might help models to learn ARDS from non-ARDS cases. While bimodal and trimodal models saw consistent, and in many cases substantial, improvements in performance in Group 1 patients, Group 2 patients did not see consistent gains with additional data types. Our findings, if replicated, suggest that multimodal classification of conditions like ARDS have gaps. This is likely due to the known imperfect validity of ARDS diagnostic criteria which have undergone a continuous degree of debate throughout the last number of decades [26, 27]. Thus, the substantial physiologic overlap between clear and ambiguous cases [28] may not simply be addressed with a “more is better” strategy and instead may require subgroup-specific approaches to model development. Validation in multiple clinical centers would shed light on the utility of the “more is better” approach and would help determine if our results are an artifact of the three modalities chosen and/or local practices.

Our work also extends prior research in how it used pretrained models. Our approach used two pretrained models, one for CXR data and one for VWD, and then trained smaller models on top of the pretrained models, which are frozen during training. In the past, ARDS models have employed training strategies like pretraining a model on a general CXR dataset and finetuning it on a smaller ARDS dataset to make unimodal predictions [6], or using pretrained models to extract features that are then combined with tabular features to train traditional ML models [29, 30, 31, 9]. The most similar work to ours trains only the final layers of a pretrained model on ARDS-specific data [8]. However, this work built their multimodal model by ensembling the predictions of unimodal models, while our modalities are combined into a single model. Moreover, we could not find any pre-existing ARDS detection work that uses pretrained models for waveform data.

Our use of pretrained models also allowed us to extend prior research in how many patients are used during training. Our work used only 220 patients, while the smallest training set used in any of the work we found was >1.2k patients. Our observed performance in the trimodal model, which is similar to other published studies despite our small sample size [26, 20], suggests that using distinct but complimentary data types may be able to compensate for a relatively small learning space, which is common in biomedical research settings. Further, the level of performance reached on such a small sample space suggests a potential for improved performance in larger datasets if they are available.

Technical Innovation and Broader Impact

Our approach serves as a proof of concept for adapting ideas from the broader world of ML and multimodal ML research. Using a small classification network on top of a larger, frozen pretrained network has become common practice for image classification [32]. The concept of using a projection network to reduce the dimensionality of a feature space is common for networks like autoencoders [33], and bottlenecks have also been used for multimodal networks outside of the biomedical space [34]. By combining many of these ideas, we showed the viability of multimodal deep learning on small heterogeneous datasets. Deep learning is notoriously data-hungry, and our application of pretrained models, combined with carefully selected architectures, demonstrated the feasibility of developing accurate, well-calibrated models, even with limited training data.

Our approach to handling multimodal data could serve as a template for other clinical applications where multiple data sources provide complementary information. The architecture we developed, with its modality-specific encoders and joint classification head, could be adapted for other conditions where diagnosis relies on the integration of multiple data types, and could easily be extended to incorporate additional modalities, such as other imaging modalities, medication data, or even clinical notes. This is particularly relevant in critical care, where clinicians routinely integrate data from multiple sources to make complex decisions. Moreover, the framework proposed in this paper is particularly well suited for small sample size multimodal datasets. The latter fact makes our framework especially valuable in clinical environments where data scarcity is often a barrier to applying advanced ML techniques.

Limitations

Our study has several important limitations. First, the single-center nature of our study limits our ability to assess generalizability across different clinical settings and patient populations. ARDS, by definition, occurs across a wide spectrum of inciting conditions, and thus, our ARDS patient population may not be generalizable to all ARDS cases. Different institutions may have different ventilator management strategies, although our institution follows ARDSnet protocols. ARDS patients may also have been imaged at different intervals. These factors could affect the relative contributions of each data type to multimodal model classification performance. Second, while our sample size was sufficient to demonstrate the potential of our approach, a larger dataset would be needed to fully validate the model’s performance and ensure robust learning of complex cross-modal patterns. This is particularly important for the more complex ARDS cases in Group 2, where more examples would help better understand the relative contributions of each data type to model performance in less clear-cut cases and in key ARDS subgroups [35, 36].

Third, our retrospective study design was chosen to allow investigation of the effects of multimodal deep learning on ARDS classification but doesn’t address several challenges that would arise in real-world implementation. These include the need for real-time data processing, handling of missing or delayed data, and integration with clinical workflows. Because ARDS remains challenging to diagnose even after 24 hours of meeting diagnostic criteria[1], models focused on classification after this 24-hour window may still have utility. However, early recognition and frequent reassessment would facilitate the use of evidence-based therapies in conditions like ARDS where earlier treatment is thought to improve survival [3].

Finally, while our models showed good calibration within their ranges of predicted values, the absence of predictions in certain probability ranges (0.0–0.2 and 0.8–1.0) suggests some limitations in the models’ ability to express very high or very low confidence. This could be addressed in future work through improved model architecture or training approaches, including the potential inclusion of clinician uncertainty regarding the target label into the training process [37].

Conclusion

Our results demonstrate the potential of multimodal deep learning to progressively improve ARDS classification by increasing the number of data types. Moreover, our methodological approach was able to produce performant models using small training datasets. However, we found heterogeneous performance with different combinations of modalities and in subgroups of ARDS patients, highlighting that more may not always be better. By showing that the integration of multiple data modalities can improve diagnostic performance, even in ambiguous cases, our work contributes to the broader goal of developing AI systems that can meaningfully support clinical decision-making in complex conditions and high-stakes clinical environments.

Supplementary Material

Supplement 1

Acknowledgments

The authors acknowledge support from NIH grant R01HL16351. The authors also acknowledge Anna Liu and Irene Cortes-Puch, MD for their work on data preparation, and the UC Davis Health Information Technology Data Provisioning Core and Health Computing Core.

Footnotes

1

for first 24 hours post first ICU admission

Contributor Information

Stefan Broecker, Department of Computer Science University of California Davis Davis, CA, USA.

Jason Y. Adams, Division of Pulmonary, Critical Care, and Sleep Medicine University of California Davis Sacramento, CA, USA.

Girish Kumar, Department of Mathematics University of California Davis Davis, CA, USA.

Rachael A. Callcut, Division of Trauma, Acute Care Surgery, and Surgical Critical Care University of California Davis Sacramento, CA, USA.

Yuan Ni, Department of Mathematics University of California Davis Davis, CA, USA.

Thomas Strohmer, Department of Mathematics University of California Davis Davis, CA, USA.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1