Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation - PubMed (original) (raw)
Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation
Virag Sharma et al. Nucleic Acids Res. 2016.
Abstract
Identifying coding genes is an essential step in genome annotation. Here, we utilize existing whole genome alignments to detect conserved coding exons and then map gene annotations from one genome to many aligned genomes. We show that genome alignments contain thousands of spurious frameshifts and splice site mutations in exons that are truly conserved. To overcome these limitations, we have developed CESAR (Coding Exon-Structure Aware Realigner) that realigns coding exons, while considering reading frame and splice sites of each exon. CESAR effectively avoids spurious frameshifts in conserved genes and detects 91% of shifted splice sites. This results in the identification of thousands of additional conserved exons and 99% of the exons that lack inactivating mutations match real exons. Finally, to demonstrate the potential of using CESAR for comparative gene annotation, we applied it to 188 788 exons of 19 865 human genes to annotate human genes in 99 other vertebrates. These comparative gene annotations are available as a resource (http://bds.mpi-cbg.de/hillerlab/CESAR/). CESAR (https://github.com/hillerlab/CESAR/) can readily be applied to other alignments to accurately annotate coding genes in many other vertebrate and invertebrate genomes.
© The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Figures
Figure 1.
Limitations of genome alignments for assessing exon conservation. (A) The genome alignment shows a 4 bp frameshifting insertion (red font). This is an alignment ambiguity, as an equivalent alignment exists where this insertion is in the intron (‘ideal alignment’). Upper case letters are exonic bases, lower case letters are intronic bases. (B) The genome alignment shows two close frameshifts that compensate each other. These two frameshifts are likely spurious and did not happen in evolution as an alternative alignment with 12 versus 13 identical bases (grey background) exist that lacks these indels. (C) Two examples where the acceptor (left) or donor (right) splice site is mutated. In both cases, the exon is conserved but its splice site has shifted by 9 bp into the intron. The ideal alignment would align the original and the shifted splice site. By aligning non-orthologous but ‘functionally equivalent’ splice site bases, the ideal alignment correctly identifies the exon boundaries in the other species. (D) In contrast to (B), two real compensating frameshifts change the reading frame for 15 codons. The alignment with both frameshifts has a much higher number of identical bases (grey background) than the alignment without both frameshifts, which strongly suggests that these compensating frameshifts did occur in evolution.
Figure 2.
Schematic representation of CESAR. The Hidden-Markov-Model consists of states that emit the up- and downstream intronic bases, the splice sites and the exon body in between. The exon body consists of states that match entire codons with emission probabilities reflecting the similarity to the codon in the reference exon (47), states that emit partial 1 or 2 bp codons that represent frameshifting deletions, states that insert any of the 61 non-stop codons, and nucleotide insertion states that insert in-frame stop codons or insert frameshifts. Codon deletions are modeled by transitions that skip between 1 and 10 codon units (blue transitions; only 1 to 3 codon deletions are illustrated here for clarity). The non-emitting (silent) black-circle states allow deleting more than 10 successive codons, similar to delete states in a profile HMM (46). All transitions representing exon-inactivating mutations (splice site mutations or frameshifting indels) are shown in red, transitions to codon insertion states are green and transitions that loop in insert states are black. The grey transitions are not free parameters but are fixed by the constraint that the sum of all out-going transition probabilities of a state must be 1.
Figure 3.
Close pairs of compensatory frameshifts are abundant in genome alignments. The distance between two compensating frameshifts is plotted as a histogram for the human-mouse alignment. Compensating frameshifts are pairs of frameshifts where the second frameshift returns to the original reading frame and the sequence between the two frameshifts is translatable (no in-frame stop codon in the new reading frame).
Figure 4.
Comparative evaluation of CESAR's alignment accuracy. The accuracy of aligning five data sets of simulated exons with or without frameshifts and with shifted splice sites is shown. Nearly identical alignments are defined as being identical or differing from the true alignment only in a ≤6 bp shift in the position of indels. The five different data sets are: (A) intact exons without frameshifts and splice site shifts, (B) two spurious compensating frameshifts, (C) two real compensating frameshifts, (D) one real frameshift and (E) splice site shifts (see Materials and Methods).
Figure 5.
CESAR drastically reduces the number of exon inactivating mutations. (A) Number of frameshift mutations and (B) number of splice site mutations in genes that have a 1:1 orthology relationship and are annotated in mouse, rat, cow and dog.
Figure 6.
UCSC Genome Browser screenshot showing human exons realigned with CESAR. Top: Several exons of SLC24A3. Bottom: Realignment of one exon, together with unaligned flanking sequence on either side of the exon. Only a subset of the 99 non-human vertebrates is shown. Dots refer to bases that are identical to the aligned human base.
Figure 7.
Summary of comparative gene annotation in 99 non-human vertebrates. The proportion of the 19 865 human genes (blue triangles) and 188 788 exons (red circles) that we annotate in 99 vertebrate genomes after realignment by CESAR. (A) 61 mammalian species. (B) 38 non-mammalian species. Placental mammals, birds and teleost fish are highlighted by a light yellow background.
Figure 8.
Genome browser screenshot of human genes annotated in other vertebrates. The UCSC genome browser screenshot of a 605 kb locus in the human genome (hg19, chr19:34 287 751–34 893 318) with four genes is shown at the top. Genome browser annotation tracks of human exons mapped by CESAR are shown below for 8 of the 99 genomes covering different clades. The phylogenetic tree of all 100 species is shown on the right.
Similar articles
- Coding Exon-Structure Aware Realigner (CESAR): Utilizing Genome Alignments for Comparative Gene Annotation.
Sharma V, Hiller M. Sharma V, et al. Methods Mol Biol. 2019;1962:179-191. doi: 10.1007/978-1-4939-9173-0_10. Methods Mol Biol. 2019. PMID: 31020560 - CESAR 2.0 substantially improves speed and accuracy of comparative gene annotation.
Sharma V, Schwede P, Hiller M. Sharma V, et al. Bioinformatics. 2017 Dec 15;33(24):3985-3987. doi: 10.1093/bioinformatics/btx527. Bioinformatics. 2017. PMID: 28961744 - Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation.
Sharma V, Hiller M. Sharma V, et al. Nucleic Acids Res. 2017 Aug 21;45(14):8369-8377. doi: 10.1093/nar/gkx554. Nucleic Acids Res. 2017. PMID: 28645144 Free PMC article. - A brief review of software tools for pangenomics.
Xiao J, Zhang Z, Wu J, Yu J. Xiao J, et al. Genomics Proteomics Bioinformatics. 2015 Feb;13(1):73-6. doi: 10.1016/j.gpb.2015.01.007. Epub 2015 Feb 23. Genomics Proteomics Bioinformatics. 2015. PMID: 25721608 Free PMC article. Review. - Comparative genomics as a tool for gene discovery.
Windsor AJ, Mitchell-Olds T. Windsor AJ, et al. Curr Opin Biotechnol. 2006 Apr;17(2):161-7. doi: 10.1016/j.copbio.2006.01.007. Epub 2006 Feb 3. Curr Opin Biotechnol. 2006. PMID: 16459073 Review.
Cited by
- CONSERVATION ASSESSMENT OF HUMAN SPLICE SITE ANNOTATION BASED ON A 470-GENOME ALIGNMENT.
Minkin I, Salzberg SL. Minkin I, et al. bioRxiv [Preprint]. 2024 May 14:2023.12.01.569581. doi: 10.1101/2023.12.01.569581. bioRxiv. 2024. PMID: 38076842 Free PMC article. Preprint. - High-quality haploid genomes corroborate 29 chromosomes and highly conserved synteny of genes in Hyles hawkmoths (Lepidoptera: Sphingidae).
Hundsdoerfer AK, Schell T, Patzold F, Wright CJ, Yoshido A, Marec F, Daneck H, Winkler S, Greve C, Podsiadlowski L, Hiller M, Pippel M. Hundsdoerfer AK, et al. BMC Genomics. 2023 Aug 7;24(1):443. doi: 10.1186/s12864-023-09506-y. BMC Genomics. 2023. PMID: 37550607 Free PMC article. - ncOrtho: efficient and reliable identification of miRNA orthologs.
Langschied F, Leisegang MS, Brandes RP, Ebersberger I. Langschied F, et al. Nucleic Acids Res. 2023 Jul 21;51(13):e71. doi: 10.1093/nar/gkad467. Nucleic Acids Res. 2023. PMID: 37260093 Free PMC article. - Integrating gene annotation with orthology inference at scale.
Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK; Zoonomia Consortium‡; Hiller M. Kirilenko BM, et al. Science. 2023 Apr 28;380(6643):eabn3107. doi: 10.1126/science.abn3107. Epub 2023 Apr 28. Science. 2023. PMID: 37104600 Free PMC article. - Building the Chordata Olfactory Receptor Database using more than 400,000 receptors annotated by Genome2OR.
Han W, Wu Y, Zeng L, Zhao S. Han W, et al. Sci China Life Sci. 2022 Dec;65(12):2539-2551. doi: 10.1007/s11427-021-2081-6. Epub 2022 Jun 10. Sci China Life Sci. 2022. PMID: 35696018
References
- Robinson G.E., Hackett K.J., Purcell-Miramontes M., Brown S.J., Evans J.D., Goldsmith M.R., Lawson D., Okamuro J., Robertson H.M., Schneider D.J. Creating a buzz about insect genomes. Science. 2011;331:1386. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous