CoSREM: a graph mining algorithm for the discovery of combinatorial splicing regulatory elements (original) (raw)
Related papers
Alternative splicing is a vital process for regulating gene expression and promoting proteo-mic diversity. It plays a key role in tissue-specific expressed genes. This specificity is mainly regulated by splicing factors that bind to specific sequences called splicing regulatory elements (SREs). Here, we report a genome-wide analysis to study alternative splicing on multiple tissues, including brain, heart, liver, and muscle. We propose a pipeline to identify differential exons across tissues and hence tissue-specific SREs. In our pipeline, we utilize the DEXSeq package along with our previously reported algorithms. Utilizing the publicly available RNA-Seq data set from the Human BodyMap project, we identified 28,100 differentially used exons across the four tissues. We identified tissue-specific exonic splicing enhancers that overlap with various previously published experimental and computational databases. A complicated exonic enhancer regulatory network was revealed, where multiple exonic enhancers were found across multiple tissues while some were found only in specific tissues. Putative combinatorial exonic enhancers and silencers were discovered as well, which may be responsible for exon inclusion or exclusion across tissues. Some of the exonic enhancers are found to be co-occurring with multiple exonic silencers and vice versa, which demonstrates a complicated relationship between tissue-specific exonic enhancers and silencers.
Computational identification of tissue specific splicing regulatory elements from RNA seq data.pdf
Alternative splicing is a vital process for regulating gene expression and promoting proteomic diversity. It plays a key role in tissue-specific expressed genes. This specificity is mainly regulated by splicing factors that bind to specific sequences called splicing regulatory elements (SREs). Here, we report a genome-wide analysis to study alternative splicing on multiple tissues, including brain, heart, liver, and muscle. We propose a pipeline to identify differential exons across tissues and hence tissue-specific SREs. In our pipeline, we utilize the DEXSeq package along with our previously reported algorithms. Utilizing the publicly available RNA-Seq data set from the Human BodyMap project, we identified 28,100 differentially used exons across the four tissues. We identified tissue-specific exonic splicing enhancers that overlap with various previously published experimental and computational databases. A complicated exonic enhancer regulatory network was revealed, where multiple exonic enhancers were found across multiple tissues while some were found only in specific tissues. Putative combinatorial exonic enhancers and silencers were discovered as well, which may be responsible for exon inclusion or exclusion across tissues. Some of the exonic enhancers are found to be co-occurring with multiple exonic silencers and vice versa, which demonstrates a complicated relationship between tissue-specific exonic enhancers and silencers.
Identifying splicing regulatory elements with de Bruijn graphs
Journal of computational biology : a journal of computational molecular cell biology, 2014
Splicing regulatory elements (SREs) are short, degenerate sequences on pre-mRNA molecules that enhance or inhibit the splicing process via the binding of splicing factors, proteins that regulate the functioning of the spliceosome. Existing methods for identifying SREs in a genome are either experimental or computational. Here, we propose a formalism based on de Bruijn graphs that combines genomic structure, word count enrichment analysis, and experimental evidence to identify SREs found in exons. In our approach, SREs are not restricted to a fixed length (i.e., k-mers, for a fixed k). As a result, we identify 2001 putative exonic enhancers and 3080 putative exonic silencers for human genes, with lengths varying from 6 to 15 nucleotides. Many of the predicted SREs overlap with experimentally verified binding sites. Our model provides a novel method to predict variable length putative regulatory elements computationally for further experimental investigation.
Nucleic Acids Research, 2010
Mining massive amounts of transcript data for alternative splicing information is paramount to help understand how the maturation of RNA regulates gene expression. We developed an algorithm to cluster transcript data to annotated genes to detect unannotated splice variants. A higher number of alternatively spliced genes and isoforms were found compared to other alternative splicing databases. Comparison of human and mouse data revealed a marked increase, in human, of splice variants incorporating novel exons and retained introns. Previously unannotated exons were validated by tiling array expression data and shown to correspond preferentially to novel first exons. Retained introns were validated by tiling array and deep sequencing data. The majority of retained introns were shorter than 500 nt and had weak polypyrimidine tracts. A subset of retained introns matching small RNAs and displaying a high GC content suggests a possible coordination between splicing regulation and production of noncoding RNAs. Conservation of unannotated exons and retained introns was higher in horse, dog and cow than in rodents, and 64% of exon sequences were only found in primates. This analysis highlights previously bypassed alternative splice variants, which may be crucial to deciphering more complex pathways of gene regulation in human.
Thousands of exon skipping events differentiate among splicing patterns in sixteen human tissues
F1000Research, 2013
Alternative splicing is widely recognized for its roles in regulating genes and creating gene diversity. However, despite many efforts, the repertoire of gene splicing variation is still incompletely characterized, even in humans. Here we describe a new computational system, ASprofile, and its application to RNA-seq data from Illumina’s Human Body Map project (>2.5 billion reads). Using the system, we identified putative alternative splicing events in 16 different human tissues, which provide a dynamic picture of splicing variation across the tissues. We detected 26,989 potential exon skipping events representing differences in splicing patterns among the tissues. A large proportion of the events (>60%) were novel, involving new exons (~3000), new introns (~16000), or both. When tracing these events across the sixteen tissues, only a small number (4-7%) appeared to be differentially expressed (‘switched’) between two tissues, while 30-45% showed little variation, and the rema...
Inference of Splicing Regulatory Activities by Sequence Neighborhood Analysis
PLoS Genetics, 2006
Sequence-specific recognition of nucleic-acid motifs is critical to many cellular processes. We have developed a new and general method called Neighborhood Inference (NI) that predicts sequences with activity in regulating a biochemical process based on the local density of known sites in sequence space. Applied to the problem of RNA splicing regulation, NI was used to predict hundreds of new exonic splicing enhancer (ESE) and silencer (ESS) hexanucleotides from known human ESEs and ESSs. These predictions were supported by cross-validation analysis, by analysis of published splicing regulatory activity data, by sequence-conservation analysis, and by measurement of the splicing regulatory activity of 24 novel predicted ESEs, ESSs, and neutral sequences using an in vivo splicing reporter assay. These results demonstrate the ability of NI to accurately predict splicing regulatory activity and show that the scope of exonic splicing regulatory elements is substantially larger than previously anticipated. Analysis of orthologous exons in four mammals showed that the NI score of ESEs, a measure of function, is much more highly conserved above background than ESE primary sequence. This observation indicates a high degree of selection for ESE activity in mammalian exons, with surprisingly frequent interchangeability between ESE sequences.
Co-expression networks uncover regulation of splicing and transcription markers of disease
EPiC Series in Computing
Gene co-expression networks based on gene expression data are usually used to capture biologically significant patterns, enabling the discovery of biomarkers and interpretation of regulatory relationships. However, the coordination of numerous splicing changes within and across genes can exert a substantial impact on the function of these genes. This is particularly impactful in studies of the properties of the nervous system, which can be masked in the networks that only assess the correlation between gene expression levels. A bioinformatics approach was developed to uncover the role of alternative splicing and associated transcriptional networks using RNA-seq profiles. Data from 40 samples, including control and two treatments associated with sensitivity to stimuli across two central nervous system regions that can present differential splicing, were explored. The gene expression and relative isoform levels were integrated into a transcriptome-wide matrix, and then Graphical Lasso...
Briefings in bioinformatics, 2018
Alternative splicing (AS) has shown to play a pivotal role in the development of diseases, including cancer. Specifically, all the hallmarks of cancer (angiogenesis, cell immortality, avoiding immune system response, etc.) are found to have a counterpart in aberrant splicing of key genes. Identifying the context-specific regulators of splicing provides valuable information to find new biomarkers, as well as to define alternative therapeutic strategies. The computational models to identify these regulators are not trivial and require three conceptual steps: the detection of AS events, the identification of splicing factors that potentially regulate these events and the contextualization of these pieces of information for a specific experiment. In this work, we review the different algorithmic methodologies developed for each of these tasks. Main weaknesses and strengths of the different steps of the pipeline are discussed. Finally, a case study is detailed to help the reader be aware...
BMC Genomics, 2019
Background: Splicing is a genetic process that has important implications in several diseases including cancer. Deciphering the complex rules of splicing regulation is crucial to understand and treat splicing-related diseases. Splicing factors and other RNA-binding proteins (RBPs) play a key role in the regulation of splicing. The specific binding sites of an RBP can be measured using CLIP experiments. However, to unveil which RBPs regulate a condition, it is necessary to have a priori hypotheses, as a single CLIP experiment targets a single protein. Results: In this work, we present a novel methodology to predict context-specific splicing factors from transcriptomic data. For this, we systematically collect, integrate and analyze more than 900 CLIP experiments stored in four CLIP databases: POSTAR2, CLIPdb, DoRiNA and StarBase. The analysis of these experiments shows the strong coherence between the binding sites of RBPs of similar families. Augmenting this information with expression changes, we are able to correctly predict the splicing factors that regulate splicing in two gold-standard experiments in which specific splicing factors are knocked-down. Conclusions: The methodology presented in this study allows the prediction of active splicing factors in either cancer or any other condition by only using the information of transcript expression. This approach opens a wide range of possible studies to understand the splicing regulation of different conditions. A tutorial with the source code and databases is available at https://gitlab.com/fcarazo.m/sfprediction.
REMAS: a new regression model to identify alternative splicing events from exon array data
BMC Bioinformatics, 2009
Background: Alternative splicing (AS) is an important regulatory mechanism for gene expression and protein diversity in eukaryotes. Previous studies have demonstrated that it can be causative for, or specific to splicing-related diseases. Understanding the regulation of AS will be helpful for diagnostic efforts and drug discoveries on those splicing-related diseases. As a novel exon-centric microarray platform, exon array enables a comprehensive analysis of AS by investigating the expression of known and predicted exons. Identifying of AS events from exon array has raised much attention, however, new and powerful algorithms for exon array data analysis are still absent till now. Results: Here, we considered identifying of AS events in the framework of variable selection and developed a regression method for AS detection (REMAS). Firstly, features of alternatively spliced exons were scaled by reasonably defined variables. Secondly, we designed a hierarchical model which can represent gene structure and transcriptional influence to exons, and the lasso type penalties were introduced in calculation because of huge variable size. Thirdly, an iterative two-step algorithm was developed to select alternatively spliced genes and exons. To avoid negative effects introduced by small sample size, we ranked genes as parameters indicating their AS capabilities in an iterative manner. After that, both simulation and real data evaluation showed that REMAS could efficiently identify potential AS events, some of which had been validated by RT-PCR or supported by literature evidence. Conclusion: As a new lasso regression algorithm based on hierarchical model, REMAS has been demonstrated as a reliable and effective method to identify AS events from exon array data.