A hidden reservoir of integrative elements is the major source of recently acquired foreign genes and ORFans in archaeal and bacterial genomes - PubMed (original) (raw)

A hidden reservoir of integrative elements is the major source of recently acquired foreign genes and ORFans in archaeal and bacterial genomes

Diego Cortez et al. Genome Biol. 2009.

Abstract

Background: Archaeal and bacterial genomes contain a number of genes of foreign origin that arose from recent horizontal gene transfer, but the role of integrative elements (IEs), such as viruses, plasmids, and transposable elements, in this process has not been extensively quantified. Moreover, it is not known whether IEs play an important role in the origin of ORFans (open reading frames without matches in current sequence databases), whose proportion remains stable despite the growing number of complete sequenced genomes.

Results: We have performed a large-scale survey of potential recently acquired IEs in 119 archaeal and bacterial genomes. We developed an accurate in silico Markov model-based strategy to identify clusters of genes that show atypical sequence composition (clusters of atypical genes or CAGs) and are thus likely to be recently integrated foreign elements, including IEs. Our method identified a high number of new CAGs. Probabilistic analysis of gene content indicates that 56% of these new CAGs are likely IEs, whereas only 7% likely originated via horizontal gene transfer from distant cellular sources. Thirty-four percent of CAGs remain unassigned, what may reflect a still poor sampling of IEs associated with bacterial and archaeal diversity. Moreover, our study contributes to the issue of the origin of ORFans, because 39% of these are found inside CAGs, many of which likely represent recently acquired IEs.

Conclusions: Our results strongly indicate that archaeal and bacterial genomes contain an impressive proportion of recently acquired foreign genes (including ORFans) coming from a still largely unexplored reservoir of IEs.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Markov model-based strategy. (a) An optimal core genes dataset is determined, and (b) a Markov probability matrix is built. (c) For a given genome, each ORF is analyzed using a Markov model that takes into account the Markov probability matrix of the core gene dataset and the composition of the ORF under study. (d) Fore each ORF the model calculates an index that represents the likelihood of that ORF having a composition similar to the core genes dataset. (e) One million random sequences are generated based on the Markov probability matrix of the core genes dataset, and their Markov indexes are calculated. (f) ORFs having a Markov index below a defined threshold of the distribution of random sequence indexes are considered as atypical.

Figure 2

Figure 2

HGT simulations. (a) Eleven core gene datasets for each analyzed genome were determined and, for each genome, 11 Markov models were built based on these different gene datasets. (b) The efficiency of our MM approach, the BM approach, and a GC% approach to detect foreign ORFs was tested by performing in silico HGT simulations using a variety of core gene datasets. (c) For the HTG simulations, 100 genes were chosen from the other 118 genomes and 100 random core ORFs were in silico introduced in the genome under analysis. (d) The average number of these ORFs that were detected as atypical (false positives, expected to be low) was determined. (e) After 100 simulations we searched for the core genes dataset and the cut-off where the average detection of simulated HGT was the highest but the average detection of native core genes was the lowest. (f-h) Average result after 100 HGT simulations for the 119 analyzed genomes using the MM, BM and GC% methods with species-specific core gene datasets and cut-offs. Blue dots represent the average number of true positives detected. Green dots represent the average number of false positives detected. The MM method had a significantly higher rate of detection of true positives than the BM method (Wilcoxon test W = 11,849, _P_-value < 2.2 e-16, means = 86.8 and 74.8 for the MM and BM methods, respectively; and Wilcoxon-test W = 13,824, _P_-value < 2.2 e-16, means = 86.8 and 52.6 for MM and GC%, methods, respectively). No significant differences were found between the MM and BM methods in the detection of false positives (Wilcoxon-test W = 8,359, _P_-value = 0.0311, means = 12.4 and 11.0 for the MM and BM methods, respectively).

Figure 3

Figure 3

Number of identified CAGs, CAG size distribution and proportion of already annotated IEs. (a) Average number and standard deviations of CAGs in the different analyzed groups of Archaea and Bacteria. In red are represented, for each group, the average numbers of annotated IEs. (b) CAGs size distribution.

Figure 4

Figure 4

Proportion of homologues from annotated IEs, newly identified CAGs, and core genes in various databases. Proportion of homologues of ORFs from annotated IEs in the core genes database, the viral database, the plasmid database and the annotated IE database, as well as the proportion of homologues of core genes in the viral database and the plasmid database.

Figure 5

Figure 5

CAGs of likely IE origin based on probabilistic analysis. (a) Proportion of newly identified CAGs of plasmid origin (green) for each analyzed group; proportion of identified CAGs of viral origin (red); proportion of newly identified CAGs of viral/plasmid origin (yellow); proportion of newly identified CAGs of cellular origin (blue); proportion of newly identified CAGs that are unassigned (violet). (b) Same as in (a) but after database correction. Each group's average number of CAGs is indicated in parentheses. Grey arrows indicate the groups with the highest proportions of newly identified CAGs classified as IEs.

Figure 6

Figure 6

ORFan distribution. (a) Distribution of ORFans in CAGs: ORFans in CAGs of viral origin (red); ORFans in CAGs of plasmid origin (green); ORFans in CAGs of viral/plasmid origin (yellow); ORFans in CAGs of cellular origin (blue); and ORFans in unassigned CAGs (violet). (b) Proportion of ORFans inside CAGs of different sizes. Data were normalized according to the number of CAGs in each category.

Similar articles

Cited by

References

    1. Gogarten JP, Townsend JP. Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol. 2005;3:679–687. - PubMed
    1. Canchaya C, Fournous G, Chibani-Chennoufi S, Dillmann ML, Brussow H. Phage as agents of lateral gene transfer. Curr Opin Microbiol. 2003;6:417–424. - PubMed
    1. Frost LS, Leplae R, Summers AO, Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol. 2005;3:722–732. - PubMed
    1. Canchaya C, Fournous G, Brussow H. The impact of prophages on bacterial chromosomes. Mol Microbiol. 2004;53:9–18. - PubMed
    1. Karlin S. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol. 2001;9:335–343. - PubMed

Publication types

MeSH terms

LinkOut - more resources