A deep learning approach to identify mRNA localization patterns (original) (raw)

Deep Learning in Spatially Resolved Transcriptomics: A Comprehensive Technical View

arXiv (Cornell University), 2022

Spatially resolved transcriptomics (SRT) has evolved rapidly through various technologies, enabling scientists to investigate both morphological contexts and gene expression profiling at single-cell resolution in parallel. SRT data are complex and multi-modal, comprising gene expression matrices, spatial information, and often high-resolution histology images. Because of this complexity and multi-modality, sophisticated computational algorithms are required to accurately analyze SRT data. Most efforts in this domain have been made to utilize conventional machine learning and statistical approaches, exhibiting sub-optimal results due to the complicated nature of SRT datasets. To address these shortcomings, researchers have recently employed deep learning algorithms including various state-of-the-art methods mainly in spatial clustering, spatially variable gene identification, and alignment. While great progress has been made in developing deep learning-based models for SRT data analysis, further improvement is still needed to create more biologically aware models that consider aspects such as phylogeny-aware clustering or the analysis of small histology image patches. Additionally, strategies for batch effect removal, normalization, and handling overdispersion and zero inflation patterns of gene expression are still needed in the analysis of SRT data using deep learning methods. In this paper, we provide a comprehensive overview of these deep learning methods, including their strengths and limitations. We also highlight new frontiers, current challenges, limitations, and open questions in this field. Also, we provide a comprehensive list of all available SRT databases that can be used as an extensive resource for future studies.

Application of deep convolutional neural networks in classification of protein subcellular localization with microscopy images

Genetic Epidemiology, 2019

Single cell microscopy images analysis has proved invaluable in protein subcellular localization for inferring gene/protein function. Fluorescent-tagged proteins across cellular compartments are tracked and imaged in response to genetic or environmental perturbations. With a large number of images generated by high-content microscopy while manual labeling is both labor-intensive and error-prone, machine learning offers a viable alternative for automatic labeling of subcellular localizations. On the other hand, in recent years applications of deep learning methods to large datasets in natural images and other domains have become quite successful. An appeal of deep learning methods is that they can learn salient features from complicated data with little data preprocessing. For such purposes, we applied several representative types of deep Convolutional Neural Networks (CNNs) and two popular ensemble methods, random forests and gradient boosting, to predict protein subcellular localization with a moderately large cell image dataset. We show a consistently better predictive performance of CNNs over the two ensemble methods. We also demonstrate the use of CNNs for feature extraction. In the end, we share our computer code and pre-trained models to facilitate CNN's applications in genetics and computational biology.

Unified mRNA Subcellular Localization Predictor based on machine learning techniques

BMC genomics, 2024

Background The mRNA subcellular localization bears substantial impact in the regulation of gene expression, cellular migration, and adaptation. However, the methods employed for experimental determination of this localization are arduous, time-intensive, and come with a high cost. Methods In this research article, we tackle the essential challenge of predicting the subcellular location of messenger RNAs (mRNAs) through Unified mRNA Subcellular Localization Predictor (UMSLP), a machine learning (ML) based approach. We embrace an in silico strategy that incorporate four distinct feature sets: kmer, pseudo k-tuple nucleotide composition, nucleotide physicochemical attributes, and the 3D sequence depiction achieved via Z-curve transformation for predicting subcellular localization in benchmark dataset across five distinct subcellular locales, encompassing nucleus, cytoplasm, extracellular region (ExR), mitochondria, and endoplasmic reticulum (ER). Results The proposed ML model UMSLP attains cutting-edge outcomes in predicting mRNA subcellular localization. On independent testing dataset, UMSLP ahcieved over 87% precision, 94% specificity, and 94% accuracy. Compared to other existing tools, UMSLP outperformed mRNALocator, mRNALoc, and SubLocEP by 11%, 21%, and 32%, respectively on average prediction accuracy for all five locales. SHapley Additive exPlanations analysis highlights the dominance of k-mer features in predicting cytoplasm, nucleus, ER, and ExR localizations, while Z-curve based features play pivotal roles in mitochondria subcellular localization detection.

Supervised and Unsupervised End-to-End Deep Learning for Gene Ontology Classification of Neural In Situ Hybridization Images

Entropy, 2019

In recent years, large datasets of high-resolution mammalian neural images have become available, which has prompted active research on the analysis of gene expression data. Traditional image processing methods are typically applied for learning functional representations of genes, based on their expressions in these brain images. In this paper, we describe a novel end-to-end deep learning-based method for generating compact representations of in situ hybridization (ISH) images, which are invariant-to-translation. In contrast to traditional image processing methods, our method relies, instead, on deep convolutional denoising autoencoders (CDAE) for processing raw pixel inputs, and generating the desired compact image representations. We provide an in-depth description of our deep learning-based approach, and present extensive experimental results, demonstrating that representations extracted by CDAE can help learn features of functional gene ontology categories for their classificat...

mRNALoc: a novel machine-learning based in-silico tool to predict mRNA subcellular localization

Nucleic Acids Research

Recent evidences suggest that the localization of mRNAs near the subcellular compartment of the translated proteins is a more robust cellular tool, which optimizes protein expression, post-transcriptionally. Retention of mRNA in the nucleus can regulate the amount of protein translated from each mRNA, thus allowing a tight temporal regulation of translation or buffering of protein levels from bursty transcription. Besides, mRNA localization performs a variety of additional roles like long-distance signaling, facilitating assembly of protein complexes and coordination of developmental processes. Here, we describe a novel machine-learning based tool, mRNALoc, to predict five sub-cellular locations of eukaryotic mRNAs using cDNA/mRNA sequences. During five fold cross-validations, the maximum overall accuracy was 65.19, 75.36, 67.10, 99.70 and 73.59% for the extracellular region, endoplasmic reticulum, cytoplasm, mitochondria, and nucleus, respectively. Assessment on independent dataset...

Deep learning tackles single-cell analysis A survey of deep learning for scRNA-seq analysis

2021

Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioning as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here we present a processing pipeline of single-cell RNA-seq data, survey a total of 25 DL algorithms and their applicability for a specific step in the processing pipeline. Specifically, we establish a unified mathematical representation of all variational autoencoder, autoencoder, and generative adversarial network models, compare the training strategies and loss functions for these models, and relate the loss functions of these mo...

Single-cell Subcellular Protein Localisation Using Novel Ensembles of Diverse Deep Architectures

arXiv (Cornell University), 2022

Unravelling protein distributions within individual cells is key to understanding their function and state and indispensable to developing new treatments. Here we present the Hybrid subCellular Protein Localiser (HCPL), which learns from weakly labelled data to robustly localise single-cell subcellular protein patterns. It comprises innovative DNN architectures exploiting wavelet filters and learnt parametric activations that successfully tackle drastic cell variability. HCPL features correlation-based ensembling of novel architectures that boosts performance and aids generalisation. Large-scale data annotation is made feasible by our "AI-trains-AI" approach, which determines the visual integrity of cells and emphasises reliable labels for efficient training. In the Human Protein Atlas context, we demonstrate that HCPL defines state-of-the-art in the single-cell classification of protein localisation patterns. To better understand the inner workings of HCPL and assess its biological relevance, we analyse the contributions of each system component and dissect the emergent features from which the localisation predictions are derived.

Learning & Visualizing Genomic Signatures of Cancer Tumors using Deep Neural Networks

2020

Deep learning for medical diagnosis using genomics is extremely challenging given the high dimensionality of the data and lack of sufficient patient samples. Another challenge is that deep models are conceived as black boxes without much interpretation on how these complex models make predictions. We propose a deep transfer learning framework for cancer diagnosis with the capability of learning the sequence of DNA and RNA in cancer cells and identifying genetic changes that alter cell behavior and cause uncontrollable growth and malignancy. We design a new Convolutional Neural Network architecture with capabilities of learning the genomic signatures of whole-transcriptome gene expressions collected from multiple tumor types covering multiple organ sites. We demonstrate how our trained model can function as a comprehensive multi-tissue cancer classifier by using transfer learning to build classifiers for tumors lacking sufficient human samples to be trained independently. We introduce visualization procedures to provide more biological insight on how our model is learning genomic signatures and accurately making predictions across multiple cancer tissue types.

Learning Visualizing Genomic Signatures of Cancer Tumors using Deep Neural Networks

2020 International Joint Conference on Neural Networks (IJCNN)

Deep learning for medical diagnosis using genomics is extremely challenging given the high dimensionality of the data and lack of sufficient patient samples. Another challenge is that deep models are conceived as black boxes without much interpretation on how these complex models make predictions. We propose a deep transfer learning framework for cancer diagnosis with the capability of learning the sequence of DNA and RNA in cancer cells and identifying genetic changes that alter cell behavior and cause uncontrollable growth and malignancy. We design a new Convolutional Neural Network architecture with capabilities of learning the genomic signatures of whole-transcriptome gene expressions collected from multiple tumor types covering multiple organ sites. We demonstrate how our trained model can function as a comprehensive multi-tissue cancer classifier by using transfer learning to build classifiers for tumors lacking sufficient human samples to be trained independently. We introduce visualization procedures to provide more biological insight on how our model is learning genomic signatures and accurately making predictions across multiple cancer tissue types.

Deep convolutional neural networks for annotating gene expression patterns in the mouse brain

Background: Profiling gene expression in brain structures at various spatial and temporal scales is essential to understanding how genes regulate the development of brain structures. The Allen Developing Mouse Brain Atlas provides high-resolution 3-D in situ hybridization (ISH) gene expression patterns in multiple developing stages of the mouse brain. Currently, the ISH images are annotated with anatomical terms manually. In this paper, we propose a computational approach to annotate gene expression pattern images in the mouse brain at various structural levels over the course of development. Results: We applied deep convolutional neural network that was trained on a large set of natural images to extract features from the ISH images of developing mouse brain. As a baseline representation, we applied invariant image feature descriptors to capture local statistics from ISH images and used the bag-of-words approach to build image-level representations. Both types of features from multiple ISH image sections of the entire brain were then combined to build 3-D, brain-wide gene expression representations. We employed regularized learning methods for discriminating gene expression patterns in different brain structures. Results show that our approach of using convolutional model as feature extractors achieved superior performance in annotating gene expression patterns at multiple levels of brain structures throughout four developing ages. Overall, we achieved average AUC of 0.894 ± 0.014, as compared with 0.820 ± 0.046 yielded by the bag-of-words approach. Conclusions: Deep convolutional neural network model trained on natural image sets and applied to gene expression pattern annotation tasks yielded superior performance, demonstrating its transfer learning property is applicable to such biological image sets.