Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation - PubMed (original) (raw)

Comparative Study

. 2017 Aug 21;45(14):8369-8377.

doi: 10.1093/nar/gkx554.

Affiliations

Comparative Study

Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation

Virag Sharma et al. Nucleic Acids Res. 2017.

Abstract

Genome alignments provide a powerful basis to transfer gene annotations from a well-annotated reference genome to many other aligned genomes. The completeness of these annotations crucially depends on the sensitivity of the underlying genome alignment. Here, we investigated the impact of the genome alignment parameters and found that parameters with a higher sensitivity allow the detection of thousands of novel alignments between orthologous exons that have been missed before. In particular, comparisons between species separated by an evolutionary distance of >0.75 substitutions per neutral site, like human and other non-placental vertebrates, benefit from increased sensitivity. To systematically test if increased sensitivity improves comparative gene annotations, we built a multiple alignment of 144 vertebrate genomes and used this alignment to map human genes to the other 143 vertebrates with CESAR. We found that higher alignment sensitivity substantially improves the completeness of comparative gene annotations by adding on average 2382 and 7440 novel exons and 117 and 317 novel genes for mammalian and non-mammalian species, respectively. Our results suggest a more sensitive alignment strategy that should generally be used for genome alignments between distantly-related species. Our 144-vertebrate genome alignment and the comparative gene annotations (https://bds.mpi-cbg.de/hillerlab/144VertebrateAlignment\_CESAR/) are a valuable resource for comparative genomics.

© The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Alignment parameter sensitivity is crucial to align exons to their orthologous genomic locus. (A) UCSC genome browser screenshot showing the CATSPERD (cation channel sperm associated auxiliary subunit delta) gene locus in the human genome and genome alignments (chains of co-linear local alignments) to opossum computed with three different parameter sets (see text). Several exons of this gene only align to the opossum ortholog with more sensitive parameters (blue boxes) or using a subsequent round of highly sensitive alignments in addition (red boxes). (BD) Three examples of local alignments covering exons of CATSPERD. Exonic bases are in upper case, intronic bases are in lower case.

Figure 2.

Figure 2.

Sensitive alignment parameters can uncover thousands of new alignments between exons of orthologous genes. The figure compares the number of exons that align between orthologous genes for nine species at various evolutionary distances to human (axis at the bottom). Three alignment parameter sets were tested that differ in their sensitivity. The Y-axis shows the percent increase relative to the number of aligning exons with parameter set 1. The absolute number of aligning exons with parameter set 1 is given below the black dots, the absolute increase obtained with parameter set 2 and 3 is given alongside or above the black dots.

Figure 3.

Figure 3.

Highly sensitive alignment parameters detect additional alignments between human and non-mammalian vertebrates. UCSC genome browser screenshots compare the UCSC 100-way alignment (27) with our 144-vertebrate alignment for two genomic loci (A and B). Aligning sequence is visualized by black and grey boxes. The darker the color of the box, the higher is the sequence similarity in the alignment. Double horizontal lines indicate sequence that does not align between the reference (human) and the query species. Yellow background indicates regions where exon alignments can only be detected with sensitive parameters in our 144-way alignment. Orange background indicates additional non-exonic conserved regions. For visualization, only a subset of all 70 non-mammalian vertebrates is shown. (C) Representative additional exon alignment between human and frog that was only detected with highly-sensitive parameters (marked with a star in B).

Figure 4.

Figure 4.

Comparative gene annotation in 143 vertebrate genomes. The X-axis shows the proportion of human exons (red circles) and genes for which CESAR annotated at least one exon (blue triangle) in 73 mammals (A) and 70 non-mammalian vertebrates (B). Species in blue font are not contained in the UCSC 100-way or primate alignment.

Figure 5.

Figure 5.

Increased alignment sensitivity detects thousands of additional conserved exons and hundreds of conserved genes between evolutionarily distant species. The figure shows the absolute number of exons (A) and genes (B) that are additionally annotated using our 144-vertebrate alignment, compared to the UCSC 100-way alignment. Only species for which the same assembly is included in both genome alignments are shown. Major clades are highlighted. Wallaby, parrot, scarlet macaw and spiny softshell turtle that have rather incomplete and fragmented genome assemblies are the only species were fewer exons or genes are annotated in our alignment. The reason is that fragmented assemblies result in short and low-scoring co-linear alignments that can be discarded by our more stringent filtering thresholds (see Methods). Manual inspection shows that such short co-linear alignments include paralogous gene alignments that would lead to incorrect gene annotations (Supplementary Figure S1). Given that our approach provides a consistent improvement in comparative gene annotation, better genome assemblies should substantially improve the gene annotation of these four species.

Similar articles

Cited by

References

    1. Picardi E., Pesole G.. Computational methods for ab initio and comparative gene finding. Methods Mol. Biol. 2010; 609:269–284. - PubMed
    1. Burge C., Karlin S.. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997; 268:78–94. - PubMed
    1. Parra G., Blanco E., Guigo R.. GeneID in Drosophila. Genome Res. 2000; 10:511–515. - PMC - PubMed
    1. Stanke M., Waack S.. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003; 19(Suppl. 2):ii215–i225. - PubMed
    1. Birney E., Clamp M., Durbin R.. GeneWise and genomewise. Genome Res. 2004; 14:988–995. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources