WebMGA: a customizable web server for fast metagenomic sequence analysis (original) (raw)

BMC Genomics volume 12, Article number: 444 (2011)Cite this article

Abstract

Background

The new field of metagenomics studies microorganism communities by culture-independent sequencing. With the advances in next-generation sequencing techniques, researchers are facing tremendous challenges in metagenomic data analysis due to huge quantity and high complexity of sequence data. Analyzing large datasets is extremely time-consuming; also metagenomic annotation involves a wide range of computational tools, which are difficult to be installed and maintained by common users. The tools provided by the few available web servers are also limited and have various constraints such as login requirement, long waiting time, inability to configure pipelines etc.

Results

We developed WebMGA, a customizable web server for fast metagenomic analysis. WebMGA includes over 20 commonly used tools such as ORF calling, sequence clustering, quality control of raw reads, removal of sequencing artifacts and contaminations, taxonomic analysis, functional annotation etc. WebMGA provides users with rapid metagenomic data analysis using fast and effective tools, which have been implemented to run in parallel on our local computer cluster. Users can access WebMGA through web browsers or programming scripts to perform individual analysis or to configure and run customized pipelines. WebMGA is freely available at http://weizhongli-lab.org/metagenomic-analysis.

Conclusions

WebMGA offers to researchers many fast and unique tools and great flexibility for complex metagenomic data analysis.

Background

Metagenomics is a new field that studies the environmental microorganism populations using culture-independent sequencing technologies. It provides revolutionary and unprecedented view of the identities, dynamics and functions of microbial communities in various environments such as marine [1], human gut [2] and many others [35].

The recent advances in next-generation sequencing technologies [6] such as 454, Illumina, SOLiD and HeliScope significantly promoted the development of metagenomics by offering low-cost and ultra-high throughput sequencing. Huge amounts of available metagenomic sequence data create tremendous challenges in data analysis. Some challenges are computational and result from the huge quantity of sequence data. It can easily consume 104~5 CPU hours to query a regular 454 sample with 106 reads against NCBI's non-redundant (NR) database using BLAST [7]. Other challenges are due to the high complexity of metagenomic sequence data: (a) a sample may contain hundreds or thousands of species at dramatically different abundance levels; (b) many species are unknown; (c) next-generation sequencers produce shorter reads with higher error rate compared to Sanger sequencers; and (d) sequence data contain other experimental bias, artifacts and contaminations [8]. To address these problems, many methods have been developed such as taxonomy binning [911], use of simulated datasets [12], diversity analysis [13], ORF calling [14, 15], rRNA prediction [16], sequence clustering [1720], assembly [21], statistical comparison [22], fragment recruitment [1, 8, 23] and so on. For example, Megan [11] assigns taxonomic groups to query sequences based on BLAST search against a reference database, usually the NCBI NR. CD-HIT has been used in clustering raw reads and ORFs to identify non-redundant sequences or gene families [24]. Mothur [25] is a software package with several functions such as identification of Operational Taxonomic Units (OTUs). QIIME [26] is another useful package for the investigation of microbial diversity using rRNAs. Software package RAMMCAP [27] provides a very fast sequence clustering and annotation pipeline.

It is very difficult for common researchers to install and maintain so many software tools needed in metagenome annotation. Many users simply do not have the required computational resources to run some of the tools. The available online resources that provide metagenomic data analysis are also limited. Currently, MG-RAST [28] and CAMERA [29] are the major sites where users can submit datasets for analysis. MG-RAST only provides a fixed pipeline and the waiting time for its jobs is often very long (sometimes weeks). CAMERA offers a list of workflows, but many useful tools are still missing from CAMERA's site. In addition, both MG-RAST and CAMERA require user registration and login, so it is difficult to access their web servers using scripts.

In order to provide a fast, easy and flexible solution for metagenomic data analysis, we developed WebMGA, a web server that allows users to submit metagenomic datasets and to run many kinds of analysis, or to perform a user-customized annotation pipeline. WebMGA is freely available at http://weizhongli-lab.org/metagenomic-analysis to all users without any login requirement.

Implementation

WebMGA consists of a web user interface, web service interface, server scripts, a MySQL relational database, an email server, daemon processes, application software packages, wrapping and parsing scripts and a computer cluster (Figure 1). The WebMGA web front-end is an Apache HTTP server, which accepts jobs submitted through web browsers. WebMGA's web services, which are implemented with Mojolicious software, accept client-side scripts following Representational State Transfer (REST) protocol. Job requests are processed by server scripts, which submit jobs to a queue and return a unique job identifier with a web link for each request. If an email address is provided (optional), the user will be notified by email of job status change. All the job-related data such as job identifiers, status, date and time are stored in the MySQL database, and managed by server scripts and daemon processes. The daemon processes handle the job queue, submit jobs to computer cluster and check job status. A user can query the status or retrieve the results of a job, using web browser or scripts, by submitting a job identifier. The latest versions of software packages are locally installed on our computer cluster, which runs Linux operating system and Sun Grid Engine job queuing system. We implemented scripts to run these applications in parallel and parse the outputs.

Figure 1

figure 1

Illustration of WebMGA and its metagenomic analysis functions. The major input of WebMGA is either a DNA sequence file or a protein sequence file. A user can run a single analysis at a time such as to prediction ORFs from the uploaded DNA sequences. A user can also use a script to call WebMGA to run multiple analyses or run a pipeline where one job can use the output of another job.

Full size image

Results and Discussion

Computational tools

As outlined in Figure 1, WebMGA includes a wide range of tools for analyzing large and complex metagenomic sequence datasets. WebMGA is implemented with many tested tools that can process millions of sequences in minutes to hours. The key features of WebMGA are: (a) rapid analysis enabled by very fast algorithms and methods, (b) a large collection of computational tools, (c) flexibility to run individual tools or configure a pipeline consisting of individual tools, and (d) compatibility of application and pipelines with both web browsers and client-side scripts.

WebMGA currently has 26 individual tools that cover the following categories:

Figure 2

figure 2

A screenshot and examples of output results of WebMGA. (a) A screenshot of WebMGA server (b) A plot of distribution of clusters by CD-HIT (c) COG annotation results are in several "TAB" delimited text files, which can be easily viewed locally. (d) A plot of length distribution by sequence statistical tool

Full size image

Individual web servers

Each of the 26 tools introduced above was implemented as a standalone web server. As illustrated in the screenshot of WebMGA web server (Figure 2a), each tool has its own web page so that users can upload DNA or protein sequences for analysis, e.g. to call ORFs from raw reads using FragGeneScan. Different applications generate different type of files including sequence files in FASTA or FASTQ format (e.g. ORF or RNA prediction), TAB delimited text files (e.g. COG annotation, Figure 2c), graphic files (Figure 2b, d), raw output files and so on. Due to the great diversity of the output types, particular visualization pages are not available for all tools. The results produced by WebMGA and documentation are packed into a zip file for a user to download and analyze at client-side.

Interactively perform analysis pipelines

Most metagenomic data analysis pipelines include many processes using different tools. Figure 3 gives a simplified pipeline as an example. With WebMGA, users can run complex pipelines by interactively using the individual web servers. For example, to run the pipeline in Figure 3, a user can upload the raw reads to the quality control tool and then input the high-quality reads into "sequence statistics", "rRNA prediction" and "clustering" servers and run them in parallel. Once the rRNA prediction is completed, the user can download the result and use the masked sequences (one of the output files from rRNA prediction) as input to run tRNA prediction followed by ORF-finder. When ORF-finder is finished, function and pathway annotation jobs can be submitted in parallel using the predicted ORFs as input.

Figure 3

figure 3

A simple example pipeline configured with tools in WebMGA.

Full size image

Client-side scripting

One advantage of interactively running a pipeline is that a user can monitor and control the annotation process, for example, by checking the results and choosing suitable programs and parameters in the next step. But this way may be too tedious for routine analyses. WebMGA offers RESTful web services for all the tools through which a complex pipeline can be automatically executed using one client-side script. Two template Perl scripts, client_submit_job.pl and Rammcap_client_submit.pl, are available at WebMGA web site. A user can straightforwardly use the template Perl scripts to configure an annotation pipeline and run it locally.

The first template script runs a single tool: it submits DNA sequences in a FASTA file to CD-HIT-EST web service and downloads the clustering results. The second script performs a more extensive annotation using RAMMCAP pipeline [27], which is also used by CAMERA project. This script starts with a FASTA file of reads and then runs a list of web services such as sequence statistics, clustering, rRNA and tRNA finding, ORF calling, and function annotation and finally downloads all the annotation results.

Computational time and throughput

Three datasets were used to test the performance of tools in WebMGA. The first one is a metagenomic sample (ID F3T1Le1) selected from a core gut microbiome study [39], which contains 555,853 reads with an average length of 252 bps. The second dataset contains 571,261 ORFs with an average length of 66 letters predicted from the first dataset using Metagene [14] with default parameters. The third dataset, which contains 33 16S rRNA samples from study [39], has 817,942 16S rRNA reads spanning the V6 variable region (average length 78 bps).

The wall time and total CPU time for each tool to process the above datasets are listed in Table 1. Fast tools like sequence statistics, file conversion, quality control, rRNA-scan and ORF calling use only one CPU core; clustering tools use 4 cores in parallel; other relatively time-consuming jobs use up to 40 cores. When our cluster has enough free cores for WebMGA, about 50% and 75% of jobs can complete within 10 minutes and 1 hour respectively. All jobs need less than 3 hours except the slowest pathway annotation against KEGG, which needs about 20 hours.

Table 1 Computational time and throughput for each tool of WebMGA

Full size table

We allocated 80 CPU cores from our cluster for WebMGA server to use exclusively. With this computational capacity, WebMGA can process hundreds of jobs with most tools per day (Table 1). For example, the daily throughput for ORF-finder is about 23,000 based on the second dataset. Function and pathway annotations are the bottlenecks, but WebMGA can still process 3 (i.e. KEGG) to more than one hundred datasets (e.g. COG) in a day. WebMGA only allows 1 KEGG job to run with up to 40 cores at the same time so that other fast jobs can be completed quickly.

Example

To illustrate the application of WebMGA, we annotated the first test dataset (i.e. the core gut microbiome sample F3T1Le1) using the template script Rammcap_client_submit.pl. Since this dataset was already filtered by the original authors, we skipped the quality control, duplicates clustering and FILTER-HUMAN steps. The annotation summaries are outlined in Table 2. The results are comparable to those published in the reference [39]. For example, the relative abundance of COG categories annotated in this example shows no visible difference to that in original literature (Supplementary Figure 17b) [39].

Table 2 Annotation summary for example dataset

Full size table

Comparison to other web servers

In metagenomics, MG-RAST and CAMERA are the dominating web servers that provide online data analysis. Both resources have been constantly busy and many jobs submitted to them need to wait long time for completion. For example, we also submitted gut sample F3T1Le1 to both MG-RAST and CAMERA for annotation and it took them 5 days and 12 hours respectively. WebMGA used 4.5 hours to annotate the same dataset using RAMMCAP pipeline. WebMGA adds additional computational resources for the increasing need in metagenomic data analysis.

Compared with both MG-RAST and CAMERA, the most important advantage of WebMGA is the flexibility to run user-customized analysis pipelines with client scripts besides web server interface. MG-RAST has a fixed annotation pipeline that users cannot modify, which is essential to compare annotations of different samples. However a fixed pipeline is not suitable for all the diverse requirements in metagenomic studies, where researchers need to use different tools and different parameters. CAMERA has many analysis workflows that can process user-uploaded data. But these tools can only be used interactively by users that are logged in.

MG-RAST and WebMGA share many common procedures such as quality control, filtering and clustering, but they also apply different methods or resources for the same type of annotations. Here are some examples: (1) MG-RAST treats the reads whose first 50 bases are identical as duplicates, but WebMGA uses CD-HIT-454 for this purpose. MG-RAST's method is faster but may miss the duplicates with sequence errors (indels and wrong base calls) within the first 50 bases. CD-HIT-454 is slightly slower, but is more sensitive and can pick the duplicates missed by MG-RAST. (2) For host associated samples, MG-RAST uses bowtie [40] to identify near identical matches to host reference sequences and removes these reads as host contaminations. WebMGA uses a slower but more sensitive method, FR-HIT, for human-contamination removal. (3) For ORF calling, MG-RAST uses FragGeneScan; while WebMGA allows users to choose from ORF-finder, Metagene and FragGeneScan.

CAMERA and WebMGA also have many common methods, mostly because CAMERA also adopted the RAMMCAP pipeline we developed. But WebMGA has many unique tools such as Filter-HUMAN, RDP-binning, FR-HIT-binning and CD-HIT-OTU that CAMERA doesn't have.

Conclusions

In order to assist researchers in the metagenomics field to deal with data analysis challenges, we implemented WebMGA with very fast algorithms and effective methods. With WebMGA, users can use many individual tools and assemble the tools into a pipeline for more complicated analysis through web browsers or client-side scripts. We are in the process of developing new tools and validating more public tools so that, in the future, more rapid tools and pipelines will be added to WebMGA server.

Availability and requirements

References

  1. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, Jaroszewski L, Cieplak P, Miller CS, Li H, Mashiyama ST, Joachimiak MP, van Belle C, Chandonia JM, Soergel DA, Zhai Y, Natarajan K, Lee S, Raphael BJ, Bafna V, Friedman R, Brenner SE, Godzik A, Eisenberg D, Dixon JE, Taylor SS, et al: The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families. PLoS Biol. 2007, 5 (3): e16-10.1371/journal.pbio.0050016.
    Article PubMed PubMed Central Google Scholar
  2. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE: Metagenomic analysis of the human distal gut microbiome. Science. 2006, 312 (5778): 1355-1359. 10.1126/science.1124234.
    Article CAS PubMed PubMed Central Google Scholar
  3. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RV, Wegley L, White BA, Rohwer F: Functional metagenomic profiling of nine biomes. Nature. 2008, 452 (7187): 629-632. 10.1038/nature06810.
    Article CAS PubMed Google Scholar
  4. Pennisi E: Metagenomics. Massive microbial sequence project proposed. Science. 2007, 315 (5820): 1781-
    Article CAS PubMed Google Scholar
  5. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative metagenomics of microbial communities. Science. 2005, 308 (5721): 554-557. 10.1126/science.1107851.
    Article CAS PubMed Google Scholar
  6. Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.
    Article CAS PubMed Google Scholar
  7. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.
    Article CAS PubMed Google Scholar
  8. Niu B, Zhu Z, Fu L, Wu S, Li W: FR-HIT, a Very Fast Program to Recruit Metagenomic Reads to Homologous Reference Genomes. Bioinformatics. 2011, 27 (12):
  9. McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I: Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods. 2007, 4 (1): 63-72. 10.1038/nmeth976.
    Article CAS PubMed Google Scholar
  10. Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F, Edwards RA, Stoye J: Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 2008, 36 (7): 2230-2239. 10.1093/nar/gkn038.
    Article CAS PubMed PubMed Central Google Scholar
  11. Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res. 2007, 17 (3): 377-386. 10.1101/gr.5969107.
    Article CAS PubMed PubMed Central Google Scholar
  12. Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007, 4 (6): 495-500. 10.1038/nmeth1043.
    Article CAS PubMed Google Scholar
  13. Schloss PD, Handelsman J: A statistical toolbox for metagenomics: assessing functional diversity in microbial communities. BMC Bioinformatics. 2008, 9: 34-10.1186/1471-2105-9-34.
    Article PubMed PubMed Central Google Scholar
  14. Noguchi H, Park J, Takagi T: MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. 2006, 34 (19): 5623-5630. 10.1093/nar/gkl723.
    Article CAS PubMed PubMed Central Google Scholar
  15. Rho M, Tang H, Ye Y: FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010
    Google Scholar
  16. Huang Y, Gilna P, Li WZ: Identification of ribosomal RNA genes in metagenomic fragments. Bioinformatics. 2009, 25 (10): 1338-1340. 10.1093/bioinformatics/btp161.
    Article CAS PubMed PubMed Central Google Scholar
  17. Huang Y, Niu B, Gao Y, Fu L, Li W: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26 (5): 680-682. 10.1093/bioinformatics/btq003.
    Article CAS PubMed PubMed Central Google Scholar
  18. Li WZ, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.
    Article CAS PubMed Google Scholar
  19. Li WZ, Jaroszewski L, Godzik A: Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001, 17 (3): 282-283. 10.1093/bioinformatics/17.3.282.
    Article CAS PubMed Google Scholar
  20. Li WZ, Jaroszewski L, Godzik A: Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics. 2002, 18 (1): 77-82. 10.1093/bioinformatics/18.1.77.
    Article CAS PubMed Google Scholar
  21. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Yang H, Wang J: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20 (2): 265-272. 10.1101/gr.097261.109.
    Article CAS PubMed PubMed Central Google Scholar
  22. Rodriguez-Brito B, Rohwer F, Edwards RA: An application of statistics to comparative metagenomics. BMC Bioinformatics. 2006, 7: 162-10.1186/1471-2105-7-162.
    Article PubMed PubMed Central Google Scholar
  23. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcon LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, et al: The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 2007, 5 (3): e77-10.1371/journal.pbio.0050077.
    Article PubMed PubMed Central Google Scholar
  24. Li W, Wooley JC, Godzik A: Probing metagenomics by rapid cluster analysis of very large datasets. PLoS ONE. 2008, 3 (10): e3375-10.1371/journal.pone.0003375.
    Article PubMed PubMed Central Google Scholar
  25. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009, 75 (23): 7537-7541. 10.1128/AEM.01541-09.
    Article CAS PubMed PubMed Central Google Scholar
  26. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R: QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010, 7 (5): 335-336. 10.1038/nmeth.f.303.
    Article CAS PubMed PubMed Central Google Scholar
  27. Li W: Analysis and comparison of very large metagenomes with fast clustering and functional annotation. BMC Bioinformatics. 2009, 10: 359-10.1186/1471-2105-10-359.
    Article PubMed PubMed Central Google Scholar
  28. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O: The RAST Server: rapid annotations using subsystems technology. BMC Genomics. 2008, 9: 75-10.1186/1471-2164-9-75.
    Article PubMed PubMed Central Google Scholar
  29. Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M: CAMERA: A Community Resource for Metagenomics. PLoS Biol. 2007, 5 (3): e75-10.1371/journal.pbio.0050075.
    Article PubMed PubMed Central Google Scholar
  30. Cox MP, Peterson DA, Biggs PJ: SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics. 2010, 11: 485-10.1186/1471-2105-11-485.
    Article PubMed PubMed Central Google Scholar
  31. Niu B, Fu L, Sun S, Li W: Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics. 2010, 11: 187-10.1186/1471-2105-11-187.
    Article PubMed PubMed Central Google Scholar
  32. Szymanski M, Barciszewska MZ, Erdmann VA, Barciszewski J: 5S Ribosomal RNA Database. Nucleic Acids Res. 2002, 30 (1): 176-178. 10.1093/nar/30.1.176.
    Article CAS PubMed PubMed Central Google Scholar
  33. Wuyts J, Perriere G, Van De Peer Y: The European ribosomal RNA database. Nucleic Acids Res. 2004, 32 (Database): D101-103.
    Article CAS PubMed PubMed Central Google Scholar
  34. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glockner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35 (21): 7188-7196. 10.1093/nar/gkm864.
    Article CAS PubMed PubMed Central Google Scholar
  35. Lowe TM, Eddy SR: tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997, 25 (5): 955-964. 10.1093/nar/25.5.955.
    Article CAS PubMed PubMed Central Google Scholar
  36. Eddy SR: A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009, 23 (1): 205-211.
    PubMed Google Scholar
  37. Wang Q, Garrity GM, Tiedje JM, Cole JR: Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007, 73 (16): 5261-5267. 10.1128/AEM.00062-07.
    Article CAS PubMed PubMed Central Google Scholar
  38. Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Farmerie W: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res. 2009, 37 (10): e76-10.1093/nar/gkp285.
    Article PubMed PubMed Central Google Scholar
  39. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC, Knight R, Gordon JI: A core gut microbiome in obese and lean twins. Nature. 2009, 457 (7228): 480-U487. 10.1038/nature07540.
    Article CAS PubMed Google Scholar
  40. Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.
    Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This study was supported by Award R01HG005978 from the National Human Genome Research Institute (NHGRI) and Award R01RR025030 from National Center for Research Resources (NCRR). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NHGRI, NCRR or the National Institutes of Health.

Author information

Authors and Affiliations

  1. Center for Research in Biological Systems, University of California San Diego, La Jolla, California, 92093, USA
    Sitao Wu, Zhengwei Zhu, Liming Fu, Beifang Niu & Weizhong Li

Authors

  1. Sitao Wu
    You can also search for this author inPubMed Google Scholar
  2. Zhengwei Zhu
    You can also search for this author inPubMed Google Scholar
  3. Liming Fu
    You can also search for this author inPubMed Google Scholar
  4. Beifang Niu
    You can also search for this author inPubMed Google Scholar
  5. Weizhong Li
    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toWeizhong Li.

Additional information

Authors' contributions

SW, ZZ and WL contributed to system concept. SW and ZZ implemented the system and performed major programming work. LM and BL contributed to the development of CD-HIT software and FR-HIT software, respectively. SW, ZZ and WL coordinated this work, contributed the data analysis and wrote the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Wu, S., Zhu, Z., Fu, L. et al. WebMGA: a customizable web server for fast metagenomic sequence analysis.BMC Genomics 12, 444 (2011). https://doi.org/10.1186/1471-2164-12-444

Download citation

Keywords