An ORFome assembly approach to metagenomics sequences analysis - PubMed (original) (raw)

An ORFome assembly approach to metagenomics sequences analysis

Yuzhen Ye et al. J Bioinform Comput Biol. 2009 Jun.

Abstract

Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e. ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increases the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for metagenomic projects when the genome assembly does not work because of the low sequence coverage.

PubMed Disclaimer

Figures

Fig. 1

Fig. 1

A schematic comparison of the ORFome assembly approach with the Whole Genome Assembly (WGA) pipeline for the metagenomic sequence analysis. Both approaches attempt to characterize the protein coding genes in the shotgun sequencing reads from the metagenomic analysis of an environmental sample containing a number of different microorganisms (the reads are shown as double-barreled, as currently several NGS techniques are capable of generating such data; however, some early metagnomics projects, including the datasets used in this paper, did not produce double-barreled sequencing reads, and thus the scaffolding step is not feasible) (a). The whole genome assembly (WGS) pipeline (b-d) first assembles the reads into contigs and scaffolds, and then annotates the genes in the assembled sequences. In comparison, ORFome assembly approach (e-g) first applies gene finding in the unassembled reads, and then assembles only those annotated (partial) ORFs into peptides. These peptides may be further connected to form scaffolds if there are mate-pairs available from double-barreled sequencing (g).

Fig. 2

Fig. 2

A synthetic example for the ORFome assembly resulting into a protein family graph. Two homologous proteins are encoded in the metagenome. Due to the short read length, it is difficult to reconstruct the complete sequences of these two proteins. The EULER-ORFA approach assembles them into a protein family graph, in which the common and distinct parts between two proteins are represented by separate edges, and each protein corresponds to a path in the graph.

Fig. 3

Fig. 3

Comparison of the MetaORFA performance using different length cutoffs of input ORFs as shown in the total number of long assembled peptides (of at least 60aa)(a), and the length of the longest peptide (b).

Fig. 4

Fig. 4

A long peptide with 155 aa (contig196081, highlighted in bold line) assembled from 18 putative ORFs (represented as thin lines below the contig) in the Gulf of Mexico dataset shows strong similarity with proteins in IMG database with known function (a). (b) shows the BLAST alignment between the peptide and the PhoH-like protein from Roseophage SIO1 in IMG database.

Fig. 5

Fig. 5

Detailed comparison of the total number of read hits in IMG database using unassembled and the total number of read hits including those read hits belonging to the assembled peptides at different BLAST E-value cutoffs. The deviation between the two lines indicates the gain of read hits by using assembled peptides from the ORFome assembly.

Fig. 6

Fig. 6

A peptide involving 11 synonymous polymorphic sites (starting from position 30, ending at position 60) assembled from the Sargasso Sea dataset. In the graph, the aligned protein sequences are shown on the top and the corresponding DNA sequences are shown on the bottom; the mutations are highlighted in bold and italic (if there are only two sequences covering the site, one arbitrary codon is highlighted).

Similar articles

Cited by

References

    1. Mardis E. Anticipating the 1,000 dollar genome. Genome Biol. 2006;7:112. - PMC - PubMed
    1. Lane D, Pace B, Olsen G, et al. Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci USA. 1985;82:6955–6959. - PMC - PubMed
    1. Breitbart M, Salamon P, Andresen B, et al. Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci USA. 2002;99:14250–14255. - PMC - PubMed
    1. Galperin M. Metagenomics: from acid mine to shining sea. Environ Microbiol. 2004;6:543–545. - PubMed
    1. Eyers L, George I, Schuler L, et al. Environmental genomics: exploring the unmined richness of microbes to degrade xenobiotics. Appl Microbiol Biotechnol. 2004;66:123–130. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources