Revealing the grammar of small RNA secretion using interpretable machine learning - PubMed (original) (raw)
Revealing the grammar of small RNA secretion using interpretable machine learning
Bahar Zirak et al. Cell Genom. 2024.
Abstract
Small non-coding RNAs can be secreted through a variety of mechanisms, including exosomal sorting, in small extracellular vesicles, and within lipoprotein complexes. However, the mechanisms that govern their sorting and secretion are not well understood. Here, we present ExoGRU, a machine learning model that predicts small RNA secretion probabilities from primary RNA sequences. We experimentally validated the performance of this model through ExoGRU-guided mutagenesis and synthetic RNA sequence analysis. Additionally, we used ExoGRU to reveal cis and trans factors that underlie small RNA secretion, including known and novel RNA-binding proteins (RBPs), e.g., YBX1, HNRNPA2B1, and RBM24. We also developed a novel technique called exoCLIP, which reveals the RNA interactome of RBPs within the cell-free space. Together, our results demonstrate the power of machine learning in revealing novel biological mechanisms. In addition to providing deeper insight into small RNA secretion, this knowledge can be leveraged in therapeutic and synthetic biology applications.
Keywords: ExoCLIP; ExoGRU; extracellular RNA; machine learning; small RNA; small RNA secretion.
Copyright © 2024 The Authors. Published by Elsevier Inc. All rights reserved.
Conflict of interest statement
Declaration of interests The authors declare no competing interests.
Figures
Graphical abstract
Figure 1
Predicting small RNA (smRNA) secretion from RNA sequence and structural features (A and B) An overview of our strategy in this study: we used in-house and publicly available data to curate a dataset of intracellular (IC) and cell-free smRNA species. Following extensive feature engineering and evaluating various modeling strategies, we selected the best machine learning models for prediction of smRNA secretion. We observed that ExoGRU, a recurrent neural network model, outperforms other models in this task. We then performed feature attribution scoring and model dissection to dissect the _cis_-regulatory grammar captured by ExoGRU. (C) The architecture of ExoGRU following hyperparameter optimization. (D) Receiver operating characteristic (ROC) and precision-recall (PR) curves for the ExoGRU model for the held-out test set. Positive samples are the extracellular (EC) sequences, and negative samples are the IC ones. The performance metrics of this model are also listed.
Figure 2
Experimental validations of ExoGRU predictions (A) Enrichment scores of ECX vs. muted ECX smRNA in conditioned medium (CM) fractions and EV fractions are shown as log2 fold change of smRNA abundances in the EV or CM fraction relative to the IC fraction. A total of 55 ECX and 55 matched mutated (MUT) ECX sequences were successfully expressed and used for this analysis. p values are 0.0006 and 0.001 for CM and EV enrichments, respectively, calculated using Wilcoxon signed-rank test. (B) ROC curve generated using ECX and MUT experimental CM enrichment scores and ExoGRU’s localization predictions to measure the association between the experimental vs. ExoGRU labels at every classification threshold. The smoothened ROC curve was generated by performing 1,000 bootstraps. (C) EC and IC labels were assigned to sequences from CMs (CM enrichment) using a specificity threshold of 0.75. These experimental labels were subsequently employed to construct a confusion matrix for the classification of ECX and MUT sequences. Performance metrics are provided for this classification. (D) The presented contingency table illustrates the experimental distribution of ExoGRU-generated REX and RIX sequences in CMs. The ExoGRU class predictions for these synthetic sequences achieved an accuracy of 73%, with 82% sensitivity and 59% specificity. A χ2 test was applied to calculate a p value for the observed counts (p = 3.6e−8). (E) Ct values and normalized EV enrichment of REX and RIX sequences. All sequences were cloned under an RNA polymerase III promoter, and their expression in EV was initially normalized against mir-16. Subsequently, the values were then corrected by their abundance in the IC fraction. The thresholds on Ct and EV enrichment axes (shown as dotted lines) are set as one standard deviation from the average of these values for RIX RNAs. REX-1 and REX-5, highlighted in red, satisfy both constraints (based on their Z scores relative to RIX sequences), with combined Fisher’s p values of 1e−11 and 1e−2, respectively. (F) Independent validation of EV enrichment for REX1, REX5, and RIX1 sequences expressed under RNA polymerase II promoter. The qPCR analysis was conducted in a manner similar to that depicted in (E).
Figure 3
Use of ExoGRU in dissecting RNA secretory mechanisms (A) As predicted by exoGRU, YBX1, HNRNPA2B1, and RBM24 motifs are enriched in EC. Each RNA structural motif is shown (far right) along with its pattern of enrichment/depletion across the range of RBPs’ expression (far left). In the heatmap representation, a gold entry marks the enrichment of the given motif in its corresponding expression bin (measured by log-transformed hypergeometric p values), while a light blue entry indicates motif depletion in the bin. Statistically significant enrichments and depletions are marked with red and dark blue borders, respectively. Also shown are the mutual information (MI) values and their associated Z scores. Each MI value is used to calculate a Z score, which is the number of standard deviations of the actual MI relative to MIs calculated for randomly shuffled expression profiles. Also shown are the MI values and their associated Z scores measuring the association between motif presence and absence and EC enrichment. (B) Heatmap showing enrichment score of smRNAs containing HNRNPA2B1 motifs in IC, EV, and CM upon decreasing HNRNPA2B1 expression. The log-fold enrichment values were divided into nine equally populated bins, and the enrichment and depletion patterns across the bins were depicted as described in (A). Red and blue borders mark highly significant motif enrichments and depletions, respectively. From left to right, we show the motif names and their sequence information (“motif,” in the form of an alphanumeric plot), their associated MI values, and their Z scores. (C) Similar heatmaps showing enrichment score of smRNAs containing RBM24 motifs in IC, EV, and CM upon decreasing RBM24 expression.
Figure 4
Applying exoCLIP to look at the enrichment of HNRNPA2B1- and RBM24-bound smRNA sequences in cell-free media (A) Overview of exoCLIP workflow: UV treatment of CMs to crosslink RBP-RNA complexes and using co-immunoprecipitation (coIP) to pull down the RBP-RNA complexes of interest followed by RNA library preparation and sequencing. (B) Examples of tRNA fragments that are associated with HNRNPA2, HNRNPB1, and RBM24 proteins, as extracted from exoCLIP data. The positions of crosslinking-induced deletions (CIDs) are also highlighted in each case by the yellow arrows. In total, the HNRNPA2 exoCLIP yielded 34 unique reads, with 23 of them exhibiting CIDs at a statistically significant level (p = 0). The HNRNPB1 and RBM24 exoCLIPs each resulted in 88 unique reads, where 87 reads from HNRNPB1 and 2 reads from RBM24 showed CIDs (p = 0). p values are calculated by the CTK package. (C) Heatmaps illustrate enrichment levels of ExoGRU-predicted EC and IC smRNAs in smRNA targets extracted from HNRNPA2, HNRNPB1, and RBM24 exoCLIPs. Red and bolded borders show statistically significant enrichments, as determined by a hypergeometric test (corrected p < 0.05). MI value and associated Z score are shown.
References
- Sork H., Conceicao M., Corso G., Nordin J., Lee Y.X.F., Krjutskov K., Orzechowski Westholm J., Vader P., Pauwels M., Vandenbroucke R.E., et al. Profiling of Extracellular Small RNAs Highlights a Strong Bias towards Non-Vesicular Secretion. Cells. 2021;10:1543. doi: 10.3390/cells10061543. -DOI -PMC -PubMed
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Molecular Biology Databases
Research Materials
Miscellaneous