Leo Lee - Academia.edu (original) (raw)
Papers by Leo Lee
Biomolecules
Although database search tools originally developed for shotgun proteome have been widely used in... more Although database search tools originally developed for shotgun proteome have been widely used in immunopeptidomic mass spectrometry identifications, they have been reported to achieve undesirably low sensitivities or high false positive rates as a result of the hugely inflated search space caused by the lack of specific enzymic digestions in immunopeptidome. To overcome such a problem, we developed a motif-guided immunopeptidome database building tool named IntroSpect, which is designed to first learn the peptide motifs from high confidence hits in the initial search, and then build a targeted database for refined search. Evaluated on 18 representative HLA class I datasets, IntroSpect can improve the sensitivity by an average of 76%, compared to conventional searches with unspecific digestions, while maintaining a very high level of accuracy (~96%), as confirmed by synthetic validation experiments. A distinct advantage of IntroSpect is that it does not depend on any external HLA da...
Functional coordination of alternative splicing in the mammalian
ArXiv, 2017
We propose generative neural network methods to generate DNA sequences and tune them to have desi... more We propose generative neural network methods to generate DNA sequences and tune them to have desired properties. We present three approaches: creating synthetic DNA sequences using a generative adversarial network; a DNA-based variant of the activation maximization ("deep dream") design method; and a joint procedure which combines these two approaches together. We show that these tools capture important structures of the data and, when applied to designing probes for protein binding microarrays, allow us to generate new sequences whose properties are estimated to be superior to those found in the training data. We believe that these results open the door for applying deep generative models to advance genomics research.
MotivationDetermining RNA binding protein(RBP) binding specificity is crucial for understanding m... more MotivationDetermining RNA binding protein(RBP) binding specificity is crucial for understanding many cellular processes and genetic disorders. RBP binding is known to be affected by both the sequence and structure of RNAs. Deep learning can be used to learn generalizable representations of raw data and has improved state of the art in several fields such as image classification, speech recognition and even genomics. Previous work on RBP binding has either used shallow models that combine sequence and structure or deep models that use only the sequence. Here we combine both abilities by augmenting and refining the original Deepbind architecture to capture structural information and obtain significantly better performance.ResultsWe propose two deep architectures, one a lightweight convolutional network for transcriptome wide inference and another a Long Short-Term Memory(LSTM) network that is suitable for small batches of data. We incorporate computationally predicted secondary struct...
BackgroundAccurate prediction of epitopes presented by human leukocyte antigen (HLA) is crucial f... more BackgroundAccurate prediction of epitopes presented by human leukocyte antigen (HLA) is crucial for personalized cancer immunotherapies targeting T cell epitopes. Mass spectrometry (MS) profiling of eluted HLA ligands, which provides high-throughput measurements of HLA associated peptides in vivo, can be used to faithfully model the presentation of epitopes on the cell surface. In addition, gene expression profiles measured by RNA-seq data in a specific cell/tissue type can significantly improve the performance of epitope presentation prediction. However, although large amount of high-quality MS data of HLA-bound peptides is being generated in recent years, few provide matching RNA-seq data, which makes incorporating gene expression into epitope prediction difficult.MethodsWe collected publicly available HLA peptidome and matching RNA-seq data of 34 cell lines derived from various sources. We built position score specific matrixes (PSSMs) for 21 HLA-I alleles based on these MS data,...
GigaScience, May 2, 2017
With the advancement of second generation sequencing techniques, our ability to detect and quanti... more With the advancement of second generation sequencing techniques, our ability to detect and quantify RNA editing on a global scale has been vastly improved. As a result, RNA editing is now being studied under a growing number of biological conditions so that its biochemical mechanisms and functional roles can be further understood. However, a major barrier that prevents RNA editing from being a routine RNA-seq analysis, similar to gene expression and splicing analysis for example, is the lack of user-friendly and effective computational tools. Based on years of experience of analyzing RNA editing using diverse RNA-seq datasets, we have developed a software tool RED-ML: RNA Editing Detection based on Machine learning (pronounced as "red ML"). The input to RED-ML can be as simple as a single BAM file, while it can also take advantage of matched genomic variant information when available. The output not only contains detected RNA editing sites, but also a confidence score to f...
Oncotarget, Jan 21, 2016
RNA editing results in post-transcriptional modification and could potentially contribute to carc... more RNA editing results in post-transcriptional modification and could potentially contribute to carcinogenesis. However, RNA editing in advanced lung adenocarcinomas has not yet been studied. Based on whole genome and transcriptome sequencing data, we identified 1,071,296 RNA editing events from matched normal, primary and metastatic samples contributed by 24 lung adenocarcinoma patients, with 91.3% A-to-G editing on average, and found significantly more RNA editing sites in tumors than in normal samples. To investigate cancer relevant editing events, we detected 67,851 hyper-editing sites in primary and 50,480 hyper-editing sites in metastatic samples. 46 genes with hyper-editing in coding regions were found to result in amino acid alterations, while hundreds of hyper-editing events in non-coding regions could modulate splicing or gene expression, including genes related to tumor stage or clinic prognosis. Comparing RNA editome of primary and metastatic samples, we also discovered hyp...
Genome Biology, 2007
Alternative splicing in the central nervous system
A microarray analysis provides new evidence... more Alternative splicing in the central nervous system
A microarray analysis provides new evidence suggesting that specific cellular processes in the mammalian CNS are coordinated at the level of alternative splicing, and that a complex splicing code underlies CNS-specific alternative splicing regulation.
Genome Biology, 2013
Transcriptome complexity and its relation to numerous diseases underpins the need to predict in s... more Transcriptome complexity and its relation to numerous diseases underpins the need to predict in silico splice variants and the regulatory elements that affect them. Building upon our recently described splicing code, we developed AVISPA, a Galaxy-based web tool for splicing prediction and analysis. Given an exon and its proximal sequence, the tool predicts whether the exon is alternatively spliced, displays tissue-dependent splicing patterns, and whether it has associated regulatory elements. We assess AVISPA's accuracy on an independent dataset of tissue-dependent exons, and illustrate how the tool can be applied to analyze a gene of interest. AVISPA is available at http://avispa.biociphers.org.
Science, 2014
Predicting defects in RNA splicing Most eukaryotic messenger RNAs (mRNAs) are spliced to remove i... more Predicting defects in RNA splicing Most eukaryotic messenger RNAs (mRNAs) are spliced to remove introns. Splicing generates uninterrupted open reading frames that can be translated into proteins. Splicing is often highly regulated, generating alternative spliced forms that code for variant proteins in different tissues. RNA-binding proteins that bind specific sequences in the mRNA regulate splicing. Xiong et al. develop a computational model that predicts splicing regulation for any mRNA sequence (see the Perspective by Guigó and Valcárcel). They use this to analyze more than half a million mRNA splicing sequence variants in the human genome. They are able to identify thousands of known disease-causing mutations, as well as many new disease candidates, including 17 new autism-linked genes. Science , this issue 10.1126/science.1254806 ; see also p. 124
PLoS ONE, 2011
Background: Transcriptome profiling of patterns of RNA expression is a powerful approach to ident... more Background: Transcriptome profiling of patterns of RNA expression is a powerful approach to identify networks of genes that play a role in disease. To date, most mRNA profiling of tissues has been accomplished using microarrays, but nextgeneration sequencing can offer a richer and more comprehensive picture. Methodology/Principal Findings: ECO is a rare multi-system developmental disorder caused by a homozygous mutation in ICK encoding intestinal cell kinase. We performed gene expression profiling using both cDNA microarrays and nextgeneration mRNA sequencing (mRNA-seq) of skin fibroblasts from ECO-affected subjects. We then validated a subset of differentially expressed transcripts identified by each method using quantitative reverse transcription-polymerase chain reaction (qRT-PCR). Finally, we used gene ontology (GO) to identify critical pathways and processes that were abnormal according to each technical platform. Methodologically, mRNA-seq identifies a much larger number of differentially expressed genes with much better correlation to qRT-PCR results than the microarray (r 2 = 0.794 and 0.137, respectively). Biologically, cDNA microarray identified functional pathways focused on anatomical structure and development, while the mRNA-seq platform identified a higher proportion of genes involved in cell division and DNA replication pathways. Conclusions/Significance: Transcriptome profiling with mRNA-seq had greater sensitivity, range and accuracy than the microarray. The two platforms generated different but complementary hypotheses for further evaluation.
Nature Genetics, 2013
Users may view, print, copy, and download text and data-mine the content in such documents, for t... more Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use:
BMC Bioinformatics, 2012
Transcript quantification is a long-standing problem in genomics and estimating the relative abun... more Transcript quantification is a long-standing problem in genomics and estimating the relative abundance of alternatively-spliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by high-throughput RNA sequencing experiments which are quickly generating large amounts of data. However, much of the signal present in this data is corrupted or obscured by biases resulting in non-uniform and non-proportional representation of sequences from different transcripts. Many existing analyses attempt to deal with these and other biases with various task-specific approaches, which makes direct comparison between them difficult. However, two popular tools for isoform quantification, MISO and Cufflinks, have adopted a general probabilistic framework to model and mitigate these biases in a more general fashion. These advances motivate the need to investigate the effects of RNA-seq biases on the accuracy of different approaches for isoform qu...
Cancers
Tumor-specific antigens can activate T cell-based antitumor immune responses and are ideal target... more Tumor-specific antigens can activate T cell-based antitumor immune responses and are ideal targets for cancer immunotherapy. However, their identification is still challenging. Although mass spectrometry can directly identify human leukocyte antigen (HLA) binding peptides in tumor cells, it focuses on tumor-specific antigens derived from annotated protein-coding regions constituting only 1.5% of the genome. We developed a novel proteogenomic integration strategy to expand the breadth of tumor-specific epitopes derived from all genomic regions. Using the colorectal cancer cell line HCT116 as a model, we accurately identified 10,737 HLA-presented peptides, 1293 of which were non-canonical peptides that traditional database searches could not identify. Moreover, we found eight tumor neo-epitopes derived from somatic mutations, four of which were not previously reported. Our findings suggest that this new proteogenomic approach holds great promise for increasing the number of tumor-spec...
Bioinformatics (Oxford, England), Jan 15, 2014
Alternative splicing (AS) is a regulated process that directs the generation of different transcr... more Alternative splicing (AS) is a regulated process that directs the generation of different transcripts from single genes. A computational model that can accurately predict splicing patterns based on genomic features and cellular context is highly desirable, both in understanding this widespread phenomenon, and in exploring the effects of genetic variations on AS. Using a deep neural network, we developed a model inferred from mouse RNA-Seq data that can predict splicing patterns in individual tissues and differences in splicing patterns across tissues. Our architecture uses hidden variables that jointly represent features in genomic sequences and tissue types when making predictions. A graphics processing unit was used to greatly reduce the training time of our models with millions of parameters. We show that the deep architecture surpasses the performance of the previous Bayesian method for predicting AS patterns. With the proper optimization procedure and selection of hyperparamete...
BMC Genomics, 2010
Background: Genome-wide computational analysis of alternative splicing (AS) in several flowering ... more Background: Genome-wide computational analysis of alternative splicing (AS) in several flowering plants has revealed that pre-mRNAs from about 30% of genes undergo AS. Chlamydomonas, a simple unicellular green alga, is part of the lineage that includes land plants. However, it diverged from land plants about one billion years ago. Hence, it serves as a good model system to study alternative splicing in early photosynthetic eukaryotes, to obtain insights into the evolution of this process in plants, and to compare splicing in simple unicellular photosynthetic and non-photosynthetic eukaryotes. We performed a global analysis of alternative splicing in Chlamydomonas reinhardtii using its recently completed genome sequence and all available ESTs and cDNAs. Results: Our analysis of AS using BLAT and a modified version of the Sircah tool revealed AS of 498 transcriptional units with 611 events, representing about 3% of the total number of genes. As in land plants, intron retention is the most prevalent form of AS. Retained introns and skipped exons tend to be shorter than their counterparts in constitutively spliced genes. The splice site signals in all types of AS events are weaker than those in constitutively spliced genes. Furthermore, in alternatively spliced genes, the prevalent splice form has a stronger splice site signal than the non-prevalent form. Analysis of constitutively spliced introns revealed an over-abundance of motifs with simple repetitive elements in comparison to introns involved in intron retention. In almost all cases, AS results in a truncated ORF, leading to a coding sequence that is around 50% shorter than the prevalent splice form. Using RT-PCR we verified AS of two genes and show that they produce more isoforms than indicated by EST data. All cDNA/EST alignments and splice graphs are provided in a website at http://combi.cs.colostate.edu/as/chlamy.
Biomolecules
Although database search tools originally developed for shotgun proteome have been widely used in... more Although database search tools originally developed for shotgun proteome have been widely used in immunopeptidomic mass spectrometry identifications, they have been reported to achieve undesirably low sensitivities or high false positive rates as a result of the hugely inflated search space caused by the lack of specific enzymic digestions in immunopeptidome. To overcome such a problem, we developed a motif-guided immunopeptidome database building tool named IntroSpect, which is designed to first learn the peptide motifs from high confidence hits in the initial search, and then build a targeted database for refined search. Evaluated on 18 representative HLA class I datasets, IntroSpect can improve the sensitivity by an average of 76%, compared to conventional searches with unspecific digestions, while maintaining a very high level of accuracy (~96%), as confirmed by synthetic validation experiments. A distinct advantage of IntroSpect is that it does not depend on any external HLA da...
Functional coordination of alternative splicing in the mammalian
ArXiv, 2017
We propose generative neural network methods to generate DNA sequences and tune them to have desi... more We propose generative neural network methods to generate DNA sequences and tune them to have desired properties. We present three approaches: creating synthetic DNA sequences using a generative adversarial network; a DNA-based variant of the activation maximization ("deep dream") design method; and a joint procedure which combines these two approaches together. We show that these tools capture important structures of the data and, when applied to designing probes for protein binding microarrays, allow us to generate new sequences whose properties are estimated to be superior to those found in the training data. We believe that these results open the door for applying deep generative models to advance genomics research.
MotivationDetermining RNA binding protein(RBP) binding specificity is crucial for understanding m... more MotivationDetermining RNA binding protein(RBP) binding specificity is crucial for understanding many cellular processes and genetic disorders. RBP binding is known to be affected by both the sequence and structure of RNAs. Deep learning can be used to learn generalizable representations of raw data and has improved state of the art in several fields such as image classification, speech recognition and even genomics. Previous work on RBP binding has either used shallow models that combine sequence and structure or deep models that use only the sequence. Here we combine both abilities by augmenting and refining the original Deepbind architecture to capture structural information and obtain significantly better performance.ResultsWe propose two deep architectures, one a lightweight convolutional network for transcriptome wide inference and another a Long Short-Term Memory(LSTM) network that is suitable for small batches of data. We incorporate computationally predicted secondary struct...
BackgroundAccurate prediction of epitopes presented by human leukocyte antigen (HLA) is crucial f... more BackgroundAccurate prediction of epitopes presented by human leukocyte antigen (HLA) is crucial for personalized cancer immunotherapies targeting T cell epitopes. Mass spectrometry (MS) profiling of eluted HLA ligands, which provides high-throughput measurements of HLA associated peptides in vivo, can be used to faithfully model the presentation of epitopes on the cell surface. In addition, gene expression profiles measured by RNA-seq data in a specific cell/tissue type can significantly improve the performance of epitope presentation prediction. However, although large amount of high-quality MS data of HLA-bound peptides is being generated in recent years, few provide matching RNA-seq data, which makes incorporating gene expression into epitope prediction difficult.MethodsWe collected publicly available HLA peptidome and matching RNA-seq data of 34 cell lines derived from various sources. We built position score specific matrixes (PSSMs) for 21 HLA-I alleles based on these MS data,...
GigaScience, May 2, 2017
With the advancement of second generation sequencing techniques, our ability to detect and quanti... more With the advancement of second generation sequencing techniques, our ability to detect and quantify RNA editing on a global scale has been vastly improved. As a result, RNA editing is now being studied under a growing number of biological conditions so that its biochemical mechanisms and functional roles can be further understood. However, a major barrier that prevents RNA editing from being a routine RNA-seq analysis, similar to gene expression and splicing analysis for example, is the lack of user-friendly and effective computational tools. Based on years of experience of analyzing RNA editing using diverse RNA-seq datasets, we have developed a software tool RED-ML: RNA Editing Detection based on Machine learning (pronounced as "red ML"). The input to RED-ML can be as simple as a single BAM file, while it can also take advantage of matched genomic variant information when available. The output not only contains detected RNA editing sites, but also a confidence score to f...
Oncotarget, Jan 21, 2016
RNA editing results in post-transcriptional modification and could potentially contribute to carc... more RNA editing results in post-transcriptional modification and could potentially contribute to carcinogenesis. However, RNA editing in advanced lung adenocarcinomas has not yet been studied. Based on whole genome and transcriptome sequencing data, we identified 1,071,296 RNA editing events from matched normal, primary and metastatic samples contributed by 24 lung adenocarcinoma patients, with 91.3% A-to-G editing on average, and found significantly more RNA editing sites in tumors than in normal samples. To investigate cancer relevant editing events, we detected 67,851 hyper-editing sites in primary and 50,480 hyper-editing sites in metastatic samples. 46 genes with hyper-editing in coding regions were found to result in amino acid alterations, while hundreds of hyper-editing events in non-coding regions could modulate splicing or gene expression, including genes related to tumor stage or clinic prognosis. Comparing RNA editome of primary and metastatic samples, we also discovered hyp...
Genome Biology, 2007
Alternative splicing in the central nervous system
A microarray analysis provides new evidence... more Alternative splicing in the central nervous system
A microarray analysis provides new evidence suggesting that specific cellular processes in the mammalian CNS are coordinated at the level of alternative splicing, and that a complex splicing code underlies CNS-specific alternative splicing regulation.
Genome Biology, 2013
Transcriptome complexity and its relation to numerous diseases underpins the need to predict in s... more Transcriptome complexity and its relation to numerous diseases underpins the need to predict in silico splice variants and the regulatory elements that affect them. Building upon our recently described splicing code, we developed AVISPA, a Galaxy-based web tool for splicing prediction and analysis. Given an exon and its proximal sequence, the tool predicts whether the exon is alternatively spliced, displays tissue-dependent splicing patterns, and whether it has associated regulatory elements. We assess AVISPA's accuracy on an independent dataset of tissue-dependent exons, and illustrate how the tool can be applied to analyze a gene of interest. AVISPA is available at http://avispa.biociphers.org.
Science, 2014
Predicting defects in RNA splicing Most eukaryotic messenger RNAs (mRNAs) are spliced to remove i... more Predicting defects in RNA splicing Most eukaryotic messenger RNAs (mRNAs) are spliced to remove introns. Splicing generates uninterrupted open reading frames that can be translated into proteins. Splicing is often highly regulated, generating alternative spliced forms that code for variant proteins in different tissues. RNA-binding proteins that bind specific sequences in the mRNA regulate splicing. Xiong et al. develop a computational model that predicts splicing regulation for any mRNA sequence (see the Perspective by Guigó and Valcárcel). They use this to analyze more than half a million mRNA splicing sequence variants in the human genome. They are able to identify thousands of known disease-causing mutations, as well as many new disease candidates, including 17 new autism-linked genes. Science , this issue 10.1126/science.1254806 ; see also p. 124
PLoS ONE, 2011
Background: Transcriptome profiling of patterns of RNA expression is a powerful approach to ident... more Background: Transcriptome profiling of patterns of RNA expression is a powerful approach to identify networks of genes that play a role in disease. To date, most mRNA profiling of tissues has been accomplished using microarrays, but nextgeneration sequencing can offer a richer and more comprehensive picture. Methodology/Principal Findings: ECO is a rare multi-system developmental disorder caused by a homozygous mutation in ICK encoding intestinal cell kinase. We performed gene expression profiling using both cDNA microarrays and nextgeneration mRNA sequencing (mRNA-seq) of skin fibroblasts from ECO-affected subjects. We then validated a subset of differentially expressed transcripts identified by each method using quantitative reverse transcription-polymerase chain reaction (qRT-PCR). Finally, we used gene ontology (GO) to identify critical pathways and processes that were abnormal according to each technical platform. Methodologically, mRNA-seq identifies a much larger number of differentially expressed genes with much better correlation to qRT-PCR results than the microarray (r 2 = 0.794 and 0.137, respectively). Biologically, cDNA microarray identified functional pathways focused on anatomical structure and development, while the mRNA-seq platform identified a higher proportion of genes involved in cell division and DNA replication pathways. Conclusions/Significance: Transcriptome profiling with mRNA-seq had greater sensitivity, range and accuracy than the microarray. The two platforms generated different but complementary hypotheses for further evaluation.
Nature Genetics, 2013
Users may view, print, copy, and download text and data-mine the content in such documents, for t... more Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use:
BMC Bioinformatics, 2012
Transcript quantification is a long-standing problem in genomics and estimating the relative abun... more Transcript quantification is a long-standing problem in genomics and estimating the relative abundance of alternatively-spliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by high-throughput RNA sequencing experiments which are quickly generating large amounts of data. However, much of the signal present in this data is corrupted or obscured by biases resulting in non-uniform and non-proportional representation of sequences from different transcripts. Many existing analyses attempt to deal with these and other biases with various task-specific approaches, which makes direct comparison between them difficult. However, two popular tools for isoform quantification, MISO and Cufflinks, have adopted a general probabilistic framework to model and mitigate these biases in a more general fashion. These advances motivate the need to investigate the effects of RNA-seq biases on the accuracy of different approaches for isoform qu...
Cancers
Tumor-specific antigens can activate T cell-based antitumor immune responses and are ideal target... more Tumor-specific antigens can activate T cell-based antitumor immune responses and are ideal targets for cancer immunotherapy. However, their identification is still challenging. Although mass spectrometry can directly identify human leukocyte antigen (HLA) binding peptides in tumor cells, it focuses on tumor-specific antigens derived from annotated protein-coding regions constituting only 1.5% of the genome. We developed a novel proteogenomic integration strategy to expand the breadth of tumor-specific epitopes derived from all genomic regions. Using the colorectal cancer cell line HCT116 as a model, we accurately identified 10,737 HLA-presented peptides, 1293 of which were non-canonical peptides that traditional database searches could not identify. Moreover, we found eight tumor neo-epitopes derived from somatic mutations, four of which were not previously reported. Our findings suggest that this new proteogenomic approach holds great promise for increasing the number of tumor-spec...
Bioinformatics (Oxford, England), Jan 15, 2014
Alternative splicing (AS) is a regulated process that directs the generation of different transcr... more Alternative splicing (AS) is a regulated process that directs the generation of different transcripts from single genes. A computational model that can accurately predict splicing patterns based on genomic features and cellular context is highly desirable, both in understanding this widespread phenomenon, and in exploring the effects of genetic variations on AS. Using a deep neural network, we developed a model inferred from mouse RNA-Seq data that can predict splicing patterns in individual tissues and differences in splicing patterns across tissues. Our architecture uses hidden variables that jointly represent features in genomic sequences and tissue types when making predictions. A graphics processing unit was used to greatly reduce the training time of our models with millions of parameters. We show that the deep architecture surpasses the performance of the previous Bayesian method for predicting AS patterns. With the proper optimization procedure and selection of hyperparamete...
BMC Genomics, 2010
Background: Genome-wide computational analysis of alternative splicing (AS) in several flowering ... more Background: Genome-wide computational analysis of alternative splicing (AS) in several flowering plants has revealed that pre-mRNAs from about 30% of genes undergo AS. Chlamydomonas, a simple unicellular green alga, is part of the lineage that includes land plants. However, it diverged from land plants about one billion years ago. Hence, it serves as a good model system to study alternative splicing in early photosynthetic eukaryotes, to obtain insights into the evolution of this process in plants, and to compare splicing in simple unicellular photosynthetic and non-photosynthetic eukaryotes. We performed a global analysis of alternative splicing in Chlamydomonas reinhardtii using its recently completed genome sequence and all available ESTs and cDNAs. Results: Our analysis of AS using BLAT and a modified version of the Sircah tool revealed AS of 498 transcriptional units with 611 events, representing about 3% of the total number of genes. As in land plants, intron retention is the most prevalent form of AS. Retained introns and skipped exons tend to be shorter than their counterparts in constitutively spliced genes. The splice site signals in all types of AS events are weaker than those in constitutively spliced genes. Furthermore, in alternatively spliced genes, the prevalent splice form has a stronger splice site signal than the non-prevalent form. Analysis of constitutively spliced introns revealed an over-abundance of motifs with simple repetitive elements in comparison to introns involved in intron retention. In almost all cases, AS results in a truncated ORF, leading to a coding sequence that is around 50% shorter than the prevalent splice form. Using RT-PCR we verified AS of two genes and show that they produce more isoforms than indicated by EST data. All cDNA/EST alignments and splice graphs are provided in a website at http://combi.cs.colostate.edu/as/chlamy.