Rileen Sinha | Mount Sinai School of Medicine (original) (raw)

Papers by Rileen Sinha

Research paper thumbnail of Comprehensive genomic characterization of squamous cell lung cancers.

Lung squamous cell carcinoma is a common type of lung cancer, causing approximately 400,000 death... more Lung squamous cell carcinoma is a common type of lung cancer, causing approximately 400,000 deaths per year worldwide. Genomic alterations in squamous cell lung cancers have not been comprehensively characterized, and no molecularly targeted agents have been specifically developed for its treatment. As part of The Cancer Genome Atlas, here we profile 178 lung squamous cell carcinomas to provide a comprehensive landscape of genomic and epigenomic alterations. We show that the tumour type is characterized by complex genomic alterations, with a mean of 360 exonic mutations, 165 genomic rearrangements, and 323 segments of copy number alteration per tumour. We find statistically recurrent mutations in 11 genes, including mutation of TP53 in nearly all specimens. Previously unreported loss-of-function mutations are seen in the HLA-A class I major histocompatibility gene. Significantly altered pathways included NFE2L2 and KEAP1 in 34%, squamous differentiation genes in 44%, phosphatidylinositol-3-OH kinase pathway genes in 47%, and CDKN2A and RB1 in 72% of tumours. We identified a potential therapeutic target in most tumours, offering new avenues of investigation for the treatment of squamous cell lung cancers.

Research paper thumbnail of Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal

The cBioPortal for Cancer Genomics (http://cbioportal.org) provides a Web resource for exploring,... more The cBioPortal for Cancer Genomics (http://cbioportal.org) provides a Web resource for exploring, visualizing, and analyzing multidimensional cancer genomics data. The portal reduces molecular profiling data from cancer tissues and cell lines into readily understandable genetic, epigenetic, gene expression, and proteomic events. The query interface combined with customized data storage enables researchers to interactively explore genetic alterations across samples, genes, and pathways and, when available in the underlying data, to link these to clinical outcomes. The portal provides graphical summaries of gene-level data from multiple platforms, network visualization and analysis, survival analysis, patient-centric queries, and software programmatic access. The intuitive Web interface of the portal makes complex cancer genomics profiles accessible to researchers and clinicians without requiring bioinformatics expertise, thus facilitating biological discoveries. Here, we provide a practical guide to the analysis and visualization features of the cBioPortal for Cancer Genomics.

Research paper thumbnail of The mutational landscape of adenoid cystic carcinoma

Adenoid cystic carcinomas (ACCs) are among the most enigmatic of human malignancies. These aggres... more Adenoid cystic carcinomas (ACCs) are among the most enigmatic of human malignancies. These aggressive salivary gland cancers frequently recur and metastasize despite definitive treatment, with no known effective chemotherapy regimen. Here we determined the ACC mutational landscape and report the exome or whole-genome sequences of 60 ACC tumor-normal pairs. These analyses identified a low exonic somatic mutation rate (0.31 non-silent events per megabase) and wide mutational diversity. Notably, we found mutations in genes encoding chromatin-state regulators, such as SMARCA2, CREBBP and KDM6A, suggesting that there is aberrant epigenetic regulation in ACC oncogenesis. Mutations in genes central to the DNA damage response and protein kinase A signaling also implicate these processes. We observed MYB-NFIB translocations and somatic mutations in MYB-associated genes, solidifying the role of these aberrations as critical events in ACC. Lastly, we identified recurrent mutations in the FGF-IGF-PI3K pathway (30% of tumors) that might represent new avenues for therapy. Collectively, our observations establish a molecular foundation for understanding and exploring new treatments for ACC.

Research paper thumbnail of The molecular diversity of Luminal A breast tumors

Breast cancer is a collection of diseases with distinct molecular traits, prognosis, and therapeu... more Breast cancer is a collection of diseases with distinct molecular traits, prognosis, and therapeutic options. Luminal A breast cancer is the most heterogeneous, both molecularly and clinically. Using genomic data from over 1,000 Luminal A tumors from multiple studies, we analyzed the copy number and mutational landscape of this tumor subtype. This integrated analysis revealed four major subtypes defined by distinct copy-number and mutation profiles. We identified an atypical Luminal A subtype characterized by high genomic instability, TP53 mutations, and increased Aurora kinase signaling; these genomic alterations lead to a worse clinical prognosis. Aberrations of chromosomes 1, 8, and 16, together with PIK3CA, GATA3, AKT1, and MAP3K1 mutations drive the other subtypes. Finally, an unbiased pathway analysis revealed multiple rare, but mutually exclusive, alterations linked to loss of activity of co-repressor complexes N-Cor and SMRT. These rare alterations were the most prevalent in Luminal A tumors and may predict resistance to endocrine therapy. Our work provides for a further molecular stratification of Luminal A breast tumors, with potential direct clinical implications.

Research paper thumbnail of Comparative Genomic Analysis of Primary Versus Metastatic Colorectal Carcinomas

Purpose To compare the mutational and copy number profiles of primary and metastatic colorectal c... more Purpose To compare the mutational and copy number profiles of primary and metastatic colorectal carcinomas (CRCs) using both unpaired and paired samples derived from primary and metastatic disease sites.

Patients and Methods We performed a multiplatform genomic analysis of 736 fresh frozen CRC tumors from 613 patients. The cohort included 84 patients in whom tumor tissue from both primary and metastatic sites was available and 31 patients with pairs of metastases. Tumors were analyzed for mutations in the KRAS, NRAS, BRAF, PIK3CA, and TP53 genes, with discordant results between paired samples further investigated by analyzing formalin-fixed, paraffin-embedded tissue and/or by 454 sequencing. Copy number aberrations in primary tumors and matched metastases were analyzed by comparative genomic hybridization (CGH).

Results TP53 mutations were more frequent in metastatic versus primary tumors (53.1% v 30.3%, respectively; P < .001), whereas BRAF mutations were significantly less frequent (1.9% v 7.7%, respectively; P = .01). The mutational status of the matched pairs was highly concordant (> 90% concordance for all five genes). Clonality analysis of array CGH data suggested that multiple CRC primary tumors or treatment-associated effects were likely etiologies for mutational and/or copy number profile differences between primary tumors and metastases.

Conclusion For determining RAS, BRAF, and PIK3CA mutational status, genotyping of the primary CRC is sufficient for most patients. Biopsy of a metastatic site should be considered in patients with a history of multiple primary carcinomas and in the case of TP53 for patients who have undergone interval treatment with radiation or cytotoxic chemotherapies.

Research paper thumbnail of Evaluating cell lines as tumour models by comparison of genomic profiles

Cancer cell lines are frequently used as in vitro tumour models. Recent molecular profiles of hun... more Cancer cell lines are frequently used as in vitro tumour models. Recent molecular profiles of hundreds of cell lines from The Cancer Cell Line Encyclopedia and thousands of tumour samples from the Cancer Genome Atlas now allow a systematic genomic comparison of cell lines and tumours. Here we analyse a panel of 47 ovarian cancer cell lines and identify those that have the highest genetic similarity to ovarian tumours. Our comparison of copy-number changes, mutations and mRNA expression profiles reveals pronounced differences in molecular profiles between commonly used ovarian cancer cell lines and high-grade serous ovarian cancer tumour samples. We identify several rarely used cell lines that more closely resemble cognate tumour profiles than commonly used cell lines, and we propose these lines as the most suitable models of ovarian cancer. Our results indicate that the gap between cell lines and tumours can be bridged by genomically informed choices of cell line models for all tumour types.

Research paper thumbnail of Identification and characterization of NAGNAG alternative splicing in the moss Physcomitrella patens

BMC Plant Biology, 2010

Background Alternative splicing (AS) involving tandem acceptors that are separated by three nucle... more Background Alternative splicing (AS) involving tandem acceptors that are separated by three nucleotides (NAGNAG) is an evolutionarily widespread class of AS, which is well studied in Homo sapiens (human) and Mus musculus (mouse). It has also been shown to be common in the model seed plants Arabidopsis thaliana and Oryza sativa (rice). In one of the first studies involving sequence-based prediction of AS in plants, we performed a genome-wide identification and characterization of NAGNAG AS in the model plant Physcomitrella patens, a moss. Results Using Sanger data, we found 295 alternatively used NAGNAG acceptors in P. patens. Using 31 features and training and test datasets of constitutive and alternative NAGNAGs, we trained a classifier to predict the splicing outcome at NAGNAG tandem splice sites (alternative splicing, constitutive at the first acceptor, or constitutive at the second acceptor). Our classifier achieved a balanced specificity and sensitivity of ≥ 89%. Subsequently, a classifier trained exclusively on data well supported by transcript evidence was used to make genome-wide predictions of NAGNAG splicing outcomes. By generation of more transcript evidence from a next-generation sequencing platform (Roche 454), we found additional evidence for NAGNAG AS, with altogether 664 alternative NAGNAGs being detected in P. patens using all currently available transcript evidence. The 454 data also enabled us to validate the predictions of the classifier, with 64% (80/125) of the well-supported cases of AS being predicted correctly. Conclusion NAGNAG AS is just as common in the moss P. patens as it is in the seed plants A. thaliana and O. sativa (but not conserved on the level of orthologous introns), and can be predicted with high accuracy. The most informative features are the nucleotides in the NAGNAG and in its immediate vicinity, along with the splice sites scores, as found earlier for NAGNAG AS in animals. Our results suggest that the mechanism behind NAGNAG AS in plants is similar to that in animals and is largely dependent on the splice site and its immediate neighborhood.

Research paper thumbnail of TassDB2 - A comprehensive database of subtle alternative splicing events

BMC Bioinformatics, 2010

Background Subtle alternative splicing events involving tandem splice sites separated by a short ... more Background Subtle alternative splicing events involving tandem splice sites separated by a short (2-12 nucleotides) distance are frequent and evolutionarily widespread in eukaryotes, and a major contributor to the complexity of transcriptomes and proteomes. However, these events have been either omitted altogether in databases on alternative splicing, or only the cases of experimentally confirmed alternative splicing have been reported. Thus, a database which covers all confirmed cases of subtle alternative splicing as well as the numerous putative tandem splice sites (which might be confirmed once more transcript data becomes available), and allows to search for tandem splice sites with specific features and download the results, is a valuable resource for targeted experimental studies and large-scale bioinformatics analyses of tandem splice sites. Towards this goal we recently set up TassDB (Tandem Splice Site DataBase, version 1), which stores data about alternative splicing events at tandem splice sites separated by 3 nt in eight species. Description We have substantially revised and extended TassDB. The currently available version 2 contains extensive information about tandem splice sites separated by 2-12 nt for the human and mouse transcriptomes including data on the conservation of the tandem motifs in five vertebrates. TassDB2 offers a user-friendly interface to search for specific genes or for genes containing tandem splice sites with specific features as well as the possibility to download result datasets. For example, users can search for cases of alternative splicing where the proportion of EST/mRNA evidence supporting the minor isoform exceeds a specific threshold, or where the difference in splice site scores is specified by the user. The predicted impact of each event on the protein is also reported, along with information about being a putative target for the nonsense-mediated decay (NMD) pathway. Links are provided to the UCSC genome browser and other external resources. Conclusion TassDB2, available via http://www.tassdb.info, provides comprehensive resources for researchers interested in both targeted experimental studies and large-scale bioinformatics analyses of short distance tandem splice sites.

Research paper thumbnail of Assessing the fraction of short-distance tandem splice sites under purifying selection

Rna-a Publication of The Rna Society, 2008

Research paper thumbnail of Improved identification of conserved cassette exons using Bayesian networks

BMC Bioinformatics, 2008

Background: Alternative splicing is a major contributor to the diversity of eukaryotic transcript... more Background: Alternative splicing is a major contributor to the diversity of eukaryotic transcriptomes and proteomes. Currently, large scale detection of alternative splicing using expressed sequence tags (ESTs) or microarrays does not capture all alternative splicing events. Moreover, for many species genomic data is being produced at a far greater rate than corresponding transcript data, hence in silico methods of predicting alternative splicing have to be improved.

Research paper thumbnail of Accurate prediction of NAGNAG alternative splicing

Nucleic Acids Research, 2009

Alternative splicing (AS) involving NAGNAG tandem acceptors is an evolutionarily widespread class... more Alternative splicing (AS) involving NAGNAG tandem acceptors is an evolutionarily widespread class of AS. Recent predictions of alternative acceptor usage reported better results for acceptors separated by larger distances, than for NAGNAGs. To improve the latter, we aimed at the use of Bayesian networks (BN), and extensive experimental validation of the predictions. Using carefully constructed training and test datasets, a balanced sensitivity and specificity of !92% was achieved. A BN trained on the combined dataset was then used to make predictions, and 81% (38/47) of the experimentally tested predictions were verified. Using a BN learned on human data on six other genomes, we show that while the performance for the vertebrate genomes matches that achieved on human data, there is a slight drop for Drosophila and worm. Lastly, using the prediction accuracy according to experimental validation, we estimate the number of yet undiscovered alternative NAGNAGs. State of the art classifiers can produce highly accurate prediction of AS at NAGNAGs, indicating that we have identified the major features of the 'NAGNAG-splicing code' within the splice site and its immediate neighborhood. Our results suggest that the mechanism behind NAGNAG AS is simple, stochastic, and conserved among vertebrates and beyond.

Research paper thumbnail of Comprehensive genomic characterization of squamous cell lung cancers.

Lung squamous cell carcinoma is a common type of lung cancer, causing approximately 400,000 death... more Lung squamous cell carcinoma is a common type of lung cancer, causing approximately 400,000 deaths per year worldwide. Genomic alterations in squamous cell lung cancers have not been comprehensively characterized, and no molecularly targeted agents have been specifically developed for its treatment. As part of The Cancer Genome Atlas, here we profile 178 lung squamous cell carcinomas to provide a comprehensive landscape of genomic and epigenomic alterations. We show that the tumour type is characterized by complex genomic alterations, with a mean of 360 exonic mutations, 165 genomic rearrangements, and 323 segments of copy number alteration per tumour. We find statistically recurrent mutations in 11 genes, including mutation of TP53 in nearly all specimens. Previously unreported loss-of-function mutations are seen in the HLA-A class I major histocompatibility gene. Significantly altered pathways included NFE2L2 and KEAP1 in 34%, squamous differentiation genes in 44%, phosphatidylinositol-3-OH kinase pathway genes in 47%, and CDKN2A and RB1 in 72% of tumours. We identified a potential therapeutic target in most tumours, offering new avenues of investigation for the treatment of squamous cell lung cancers.

Research paper thumbnail of Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal

The cBioPortal for Cancer Genomics (http://cbioportal.org) provides a Web resource for exploring,... more The cBioPortal for Cancer Genomics (http://cbioportal.org) provides a Web resource for exploring, visualizing, and analyzing multidimensional cancer genomics data. The portal reduces molecular profiling data from cancer tissues and cell lines into readily understandable genetic, epigenetic, gene expression, and proteomic events. The query interface combined with customized data storage enables researchers to interactively explore genetic alterations across samples, genes, and pathways and, when available in the underlying data, to link these to clinical outcomes. The portal provides graphical summaries of gene-level data from multiple platforms, network visualization and analysis, survival analysis, patient-centric queries, and software programmatic access. The intuitive Web interface of the portal makes complex cancer genomics profiles accessible to researchers and clinicians without requiring bioinformatics expertise, thus facilitating biological discoveries. Here, we provide a practical guide to the analysis and visualization features of the cBioPortal for Cancer Genomics.

Research paper thumbnail of The mutational landscape of adenoid cystic carcinoma

Adenoid cystic carcinomas (ACCs) are among the most enigmatic of human malignancies. These aggres... more Adenoid cystic carcinomas (ACCs) are among the most enigmatic of human malignancies. These aggressive salivary gland cancers frequently recur and metastasize despite definitive treatment, with no known effective chemotherapy regimen. Here we determined the ACC mutational landscape and report the exome or whole-genome sequences of 60 ACC tumor-normal pairs. These analyses identified a low exonic somatic mutation rate (0.31 non-silent events per megabase) and wide mutational diversity. Notably, we found mutations in genes encoding chromatin-state regulators, such as SMARCA2, CREBBP and KDM6A, suggesting that there is aberrant epigenetic regulation in ACC oncogenesis. Mutations in genes central to the DNA damage response and protein kinase A signaling also implicate these processes. We observed MYB-NFIB translocations and somatic mutations in MYB-associated genes, solidifying the role of these aberrations as critical events in ACC. Lastly, we identified recurrent mutations in the FGF-IGF-PI3K pathway (30% of tumors) that might represent new avenues for therapy. Collectively, our observations establish a molecular foundation for understanding and exploring new treatments for ACC.

Research paper thumbnail of The molecular diversity of Luminal A breast tumors

Breast cancer is a collection of diseases with distinct molecular traits, prognosis, and therapeu... more Breast cancer is a collection of diseases with distinct molecular traits, prognosis, and therapeutic options. Luminal A breast cancer is the most heterogeneous, both molecularly and clinically. Using genomic data from over 1,000 Luminal A tumors from multiple studies, we analyzed the copy number and mutational landscape of this tumor subtype. This integrated analysis revealed four major subtypes defined by distinct copy-number and mutation profiles. We identified an atypical Luminal A subtype characterized by high genomic instability, TP53 mutations, and increased Aurora kinase signaling; these genomic alterations lead to a worse clinical prognosis. Aberrations of chromosomes 1, 8, and 16, together with PIK3CA, GATA3, AKT1, and MAP3K1 mutations drive the other subtypes. Finally, an unbiased pathway analysis revealed multiple rare, but mutually exclusive, alterations linked to loss of activity of co-repressor complexes N-Cor and SMRT. These rare alterations were the most prevalent in Luminal A tumors and may predict resistance to endocrine therapy. Our work provides for a further molecular stratification of Luminal A breast tumors, with potential direct clinical implications.

Research paper thumbnail of Comparative Genomic Analysis of Primary Versus Metastatic Colorectal Carcinomas

Purpose To compare the mutational and copy number profiles of primary and metastatic colorectal c... more Purpose To compare the mutational and copy number profiles of primary and metastatic colorectal carcinomas (CRCs) using both unpaired and paired samples derived from primary and metastatic disease sites.

Patients and Methods We performed a multiplatform genomic analysis of 736 fresh frozen CRC tumors from 613 patients. The cohort included 84 patients in whom tumor tissue from both primary and metastatic sites was available and 31 patients with pairs of metastases. Tumors were analyzed for mutations in the KRAS, NRAS, BRAF, PIK3CA, and TP53 genes, with discordant results between paired samples further investigated by analyzing formalin-fixed, paraffin-embedded tissue and/or by 454 sequencing. Copy number aberrations in primary tumors and matched metastases were analyzed by comparative genomic hybridization (CGH).

Results TP53 mutations were more frequent in metastatic versus primary tumors (53.1% v 30.3%, respectively; P < .001), whereas BRAF mutations were significantly less frequent (1.9% v 7.7%, respectively; P = .01). The mutational status of the matched pairs was highly concordant (> 90% concordance for all five genes). Clonality analysis of array CGH data suggested that multiple CRC primary tumors or treatment-associated effects were likely etiologies for mutational and/or copy number profile differences between primary tumors and metastases.

Conclusion For determining RAS, BRAF, and PIK3CA mutational status, genotyping of the primary CRC is sufficient for most patients. Biopsy of a metastatic site should be considered in patients with a history of multiple primary carcinomas and in the case of TP53 for patients who have undergone interval treatment with radiation or cytotoxic chemotherapies.

Research paper thumbnail of Evaluating cell lines as tumour models by comparison of genomic profiles

Cancer cell lines are frequently used as in vitro tumour models. Recent molecular profiles of hun... more Cancer cell lines are frequently used as in vitro tumour models. Recent molecular profiles of hundreds of cell lines from The Cancer Cell Line Encyclopedia and thousands of tumour samples from the Cancer Genome Atlas now allow a systematic genomic comparison of cell lines and tumours. Here we analyse a panel of 47 ovarian cancer cell lines and identify those that have the highest genetic similarity to ovarian tumours. Our comparison of copy-number changes, mutations and mRNA expression profiles reveals pronounced differences in molecular profiles between commonly used ovarian cancer cell lines and high-grade serous ovarian cancer tumour samples. We identify several rarely used cell lines that more closely resemble cognate tumour profiles than commonly used cell lines, and we propose these lines as the most suitable models of ovarian cancer. Our results indicate that the gap between cell lines and tumours can be bridged by genomically informed choices of cell line models for all tumour types.

Research paper thumbnail of Identification and characterization of NAGNAG alternative splicing in the moss Physcomitrella patens

BMC Plant Biology, 2010

Background Alternative splicing (AS) involving tandem acceptors that are separated by three nucle... more Background Alternative splicing (AS) involving tandem acceptors that are separated by three nucleotides (NAGNAG) is an evolutionarily widespread class of AS, which is well studied in Homo sapiens (human) and Mus musculus (mouse). It has also been shown to be common in the model seed plants Arabidopsis thaliana and Oryza sativa (rice). In one of the first studies involving sequence-based prediction of AS in plants, we performed a genome-wide identification and characterization of NAGNAG AS in the model plant Physcomitrella patens, a moss. Results Using Sanger data, we found 295 alternatively used NAGNAG acceptors in P. patens. Using 31 features and training and test datasets of constitutive and alternative NAGNAGs, we trained a classifier to predict the splicing outcome at NAGNAG tandem splice sites (alternative splicing, constitutive at the first acceptor, or constitutive at the second acceptor). Our classifier achieved a balanced specificity and sensitivity of ≥ 89%. Subsequently, a classifier trained exclusively on data well supported by transcript evidence was used to make genome-wide predictions of NAGNAG splicing outcomes. By generation of more transcript evidence from a next-generation sequencing platform (Roche 454), we found additional evidence for NAGNAG AS, with altogether 664 alternative NAGNAGs being detected in P. patens using all currently available transcript evidence. The 454 data also enabled us to validate the predictions of the classifier, with 64% (80/125) of the well-supported cases of AS being predicted correctly. Conclusion NAGNAG AS is just as common in the moss P. patens as it is in the seed plants A. thaliana and O. sativa (but not conserved on the level of orthologous introns), and can be predicted with high accuracy. The most informative features are the nucleotides in the NAGNAG and in its immediate vicinity, along with the splice sites scores, as found earlier for NAGNAG AS in animals. Our results suggest that the mechanism behind NAGNAG AS in plants is similar to that in animals and is largely dependent on the splice site and its immediate neighborhood.

Research paper thumbnail of TassDB2 - A comprehensive database of subtle alternative splicing events

BMC Bioinformatics, 2010

Background Subtle alternative splicing events involving tandem splice sites separated by a short ... more Background Subtle alternative splicing events involving tandem splice sites separated by a short (2-12 nucleotides) distance are frequent and evolutionarily widespread in eukaryotes, and a major contributor to the complexity of transcriptomes and proteomes. However, these events have been either omitted altogether in databases on alternative splicing, or only the cases of experimentally confirmed alternative splicing have been reported. Thus, a database which covers all confirmed cases of subtle alternative splicing as well as the numerous putative tandem splice sites (which might be confirmed once more transcript data becomes available), and allows to search for tandem splice sites with specific features and download the results, is a valuable resource for targeted experimental studies and large-scale bioinformatics analyses of tandem splice sites. Towards this goal we recently set up TassDB (Tandem Splice Site DataBase, version 1), which stores data about alternative splicing events at tandem splice sites separated by 3 nt in eight species. Description We have substantially revised and extended TassDB. The currently available version 2 contains extensive information about tandem splice sites separated by 2-12 nt for the human and mouse transcriptomes including data on the conservation of the tandem motifs in five vertebrates. TassDB2 offers a user-friendly interface to search for specific genes or for genes containing tandem splice sites with specific features as well as the possibility to download result datasets. For example, users can search for cases of alternative splicing where the proportion of EST/mRNA evidence supporting the minor isoform exceeds a specific threshold, or where the difference in splice site scores is specified by the user. The predicted impact of each event on the protein is also reported, along with information about being a putative target for the nonsense-mediated decay (NMD) pathway. Links are provided to the UCSC genome browser and other external resources. Conclusion TassDB2, available via http://www.tassdb.info, provides comprehensive resources for researchers interested in both targeted experimental studies and large-scale bioinformatics analyses of short distance tandem splice sites.

Research paper thumbnail of Assessing the fraction of short-distance tandem splice sites under purifying selection

Rna-a Publication of The Rna Society, 2008

Research paper thumbnail of Improved identification of conserved cassette exons using Bayesian networks

BMC Bioinformatics, 2008

Background: Alternative splicing is a major contributor to the diversity of eukaryotic transcript... more Background: Alternative splicing is a major contributor to the diversity of eukaryotic transcriptomes and proteomes. Currently, large scale detection of alternative splicing using expressed sequence tags (ESTs) or microarrays does not capture all alternative splicing events. Moreover, for many species genomic data is being produced at a far greater rate than corresponding transcript data, hence in silico methods of predicting alternative splicing have to be improved.

Research paper thumbnail of Accurate prediction of NAGNAG alternative splicing

Nucleic Acids Research, 2009

Alternative splicing (AS) involving NAGNAG tandem acceptors is an evolutionarily widespread class... more Alternative splicing (AS) involving NAGNAG tandem acceptors is an evolutionarily widespread class of AS. Recent predictions of alternative acceptor usage reported better results for acceptors separated by larger distances, than for NAGNAGs. To improve the latter, we aimed at the use of Bayesian networks (BN), and extensive experimental validation of the predictions. Using carefully constructed training and test datasets, a balanced sensitivity and specificity of !92% was achieved. A BN trained on the combined dataset was then used to make predictions, and 81% (38/47) of the experimentally tested predictions were verified. Using a BN learned on human data on six other genomes, we show that while the performance for the vertebrate genomes matches that achieved on human data, there is a slight drop for Drosophila and worm. Lastly, using the prediction accuracy according to experimental validation, we estimate the number of yet undiscovered alternative NAGNAGs. State of the art classifiers can produce highly accurate prediction of AS at NAGNAGs, indicating that we have identified the major features of the 'NAGNAG-splicing code' within the splice site and its immediate neighborhood. Our results suggest that the mechanism behind NAGNAG AS is simple, stochastic, and conserved among vertebrates and beyond.