The SILVA ribosomal RNA gene database project: improved data processing and web-based tools (original) (raw)
Abstract
SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.
INTRODUCTION
Sequencing the ribosomal RNA gene (rRNA) is the method of choice for nucleic acid-based detection and identification of microbes, their taxonomic assignment, phylogenetic analysis and the investigation of microbial diversity. Consequently, vast amounts of rRNA gene sequence data—more than 3.5 million sequences (July 2012)—have been accumulated and are publicly available via the International Nucleotide Sequence Database Collaboration (INSDC) databases (1). While the quantity of data further increases the relevance of rRNA genes for marker gene studies for all domains of life, it also creates significant challenges for data management and curation. For optimal utility, the sequences must be extracted and checked for quality, the annotations must be updated and extended to reflect current understanding and finally all the data must be prepared in a coherent, easily accessible manner. These tasks are beyond the scope of the INSDC databases and therefore performed by domain-specific databases. The Ribosomal Database Project (RDP-II) (2,3) and greengenes (4) both cover the domains Archaea and Bacteria for small subunit rRNA gene (SSU) sequences. The SILVA project also includes the Eukaryota, thus covering all three domains of life. Furthermore, SILVA offers databases for both the SSU and the large subunit rRNA gene (LSU).
The SILVA databases are made available as releases, rather than being updated continuously, to enhance the comparability of the studies employing these databases. Each release is numbered according to the EMBL-Bank release from which the sequence data were extracted and is permanently available for download via the SILVA website. Best efforts are made to provide two full releases per year. The database releases are structured into two datasets for each gene: SILVA Parc and SILVA Ref. The Parc datasets comprise the entire SILVA databases for the respective gene, whereas the Ref datasets represent a subset of the Parc comprising only high-quality nearly full-length sequences.
All SILVA datasets contain a rich set of contextual and sequence-associated information. This includes taxonomic classifications from several taxonomy providers, a multiple sequence alignment, type strain information and the latest valid nomenclature. All sequences are quality checked. The corresponding data are made available as ARB files (5) as well as in FASTA and comma-separated value (CSV) formats. Finally, they can be browsed directly via the SILVA website. The combination of SILVA datasets with the ARB software suite provides an advanced workbench for researchers to perform in-depth sequence analysis and phylogenetic reconstructions, as well as manual curation of rRNA gene datasets. The flat-file exports of the SILVA datasets make it easy to integrate SILVA as a source for reference data in next-generation sequencing (NGS) analysis pipelines such as MOTHUR (6), QIIME (7) or MG-RAST (8).
Since its first release in 2007, 16 full releases have been published by the SILVA project. Many improvements to both the release preparation process and the features offered by the project website (http://www.arb-silva.de) have been made. The group of SILVA users has grown to include thousands of researchers worldwide who visit the website regularly to obtain the recent database releases and to employ the SILVA online services in their work. In the first part of this update paper, we describe the most significant changes to data processing within SILVA. The second part outlines the new or improved functions available on the SILVA website.
DATABASES
rRNA gene prediction
The selection of rRNA gene candidate sequences based on annotations in the EMBL-Bank source database is now complemented by hidden Markov model-based rRNA gene prediction. All sequences in EMBL-Bank are scanned for rRNA gene sequences using the models and parameters from the RNAmmer software package (9), HMMER2 (http://hmmer.janelia.org/) and a custom pipeline component. The gene boundaries of the predicted rRNA gene are determined during sequence alignment. Conflicts between EMBL-Bank annotations and predictions are resolved by giving priority to the EMBL-Bank annotation. The source for the SILVA annotation is documented in the field ann_src_slv. As of release 111, the SILVA SSU database contains 53 950 sequences detected solely by the RNAmmer models and 1 537 342 sequences that were both annotated as rRNA and detected by the RNAmmer models. The LSU database contains 17 828 sequences detected solely by RNAmmer models and 17 563 sequences both annotated as rRNA and detected by the RNAmmer models.
Quality control/quality assurance
The quality criteria employed by SILVA to ensure that only reliable sequence information is included in the SILVA databases have been improved and fine-tuned. The sequence alignment is now used to determine which parts of EMBL-Bank annotated rRNA gene sequences extend beyond the boundaries of the SSU or LSU gene. The fraction of ambiguous bases and the fraction of bases comprising long (>4 bp) homopolymers are now confined to the region within the respective rRNA gene. Vector contaminations are now only searched for outside the rRNA gene boundaries of a sequence, with giving the length of the vector contaminant relative to the number of in-gene bases. All three values are reported in percent. The overall ‘Sequence Quality’ value gives the averaged fraction to which the thresholds for each criterion were expended in percent. Having as the respective thresholds, is calculated as follows:
Sequences with or for any criterion are rejected.
The ‘Alignment Quality’ of a sequence is determined by three values reported by the SINA alignment software (10): the alignment score, the base pair score and, as of release 111, the alignment identity (for details, see ‘Aligner’ section). Sequences are rejected based on these values to achieve specificity of the SILVA databases, correcting over-prediction by the RNAmmer models as well as removing sequences wrongly annotated as rRNA genes. The alignment quality thresholds for the different datasets can be found in Table 1.
Table 1.
List of alignment quality thresholds used to exclude sequences from the different SILVA datasets
LSU Parc | LSU Ref | SSU Parc | SSU Ref | |
---|---|---|---|---|
Alignment length (bp) | 300 | 1900 | 300 | Bacteria/Eukaryota: 1200 Archaea: 900 |
Alignment identity (%) | 40 | 60 | 50 | 70 |
Alignment score (quality) | 30 | 30 | 40 | 50 |
Base pair score | 30 | 30 | 30 | 30 |
The thresholds upon which sequences are rejected are based on the statistical analyses performed in Schweer (11). They were selected as a conservative balance between rejecting too many valid sequences and keeping too many questionable sequences. A common threshold of 2% was found to be best for the sequence quality metrics.
Prediction of potentially anomalous sequences, such as chimeras, remains unchanged with respect to the original SILVA publication. No filtering is performed using this metric due to the difficulty of clearly differentiating between artefacts of the sequence acquisition process and unusual yet natural evolutionary events. The correct choice—whether and at which threshold potentially anomalous sequences need to be excluded—should be made in light of the specific research question and the experimental setup. We must therefore relegate such filtering to the individual researchers.
SILVA taxonomy
A substantial revision of the classification of all bacterial and archaeal sequences in the Ref datasets was first published with SILVA release 100. Based on the ‘guide trees’, all taxonomic assignments are manually curated and follow the Bergey’s Manual of Systematic Bacteriology (12). Specifically, Archaea, Cyanobacteria, Chloroflexi and Chlorobi are based on volume 1; Proteobacteria on volume 2; Firmicutes on volume 3; Bacteroidetes_, Spirochaetes, Tenericutes (Mollicutes), Acidobacteria, Fibrobacteres, Fusobacteria, Dictyoglomi, Gemmatimonadetes, Lentisphaerae, Verrucomicrobia, Chlamydiae_ and Planctomycetes on volume 4 and finally Actinobacteria on volume 5. Since taxonomy and species are dynamic entities with rapid turnover, name changes and taxonomic outlines are also adapted from List of Prokaryotic Names with Standing in Nomenclature (LPSN) (13).
Although the classification is mainly based on these authoritative resources, deviations from their recommendations do exist: the classification is a phylogenetic tree-based process and differences from the original description and classifications are to be expected. For example, the genus Ahrensia (type species accession: D88524) is classified under family Rhodobacteraceae of Alphaproteobacteria; however, in the SSU Ref guide tree, this genus is grouped together with members of family Phyllobacteriaceae. Normally, such discrepancies are accommodated by introducing polyphyletic groups; however, in this case genus Ahrensia is classified within Phyllobacteriaceae due to high sequence identities (>94%) observed with other members of this family.
The LPSN resource is further used to track down names without standing in nomenclature (not-validly published taxa) and Candidatus taxa. The inclusion of the two latter categories is a unique feature of the SILVA taxonomy. Furthermore, collaborations with domain experts have been established to annotate uncultured clades. A number of examples are the OCS116 clade (14), the SAGMC and SAGME groups (15) and the termite clusters (16).
For an improved and unified taxonomy for Eukaryota based on 18S rRNA gene sequences, the Eukaryotic Taxonomy Working Group (ETWG) has been founded in October 2011. The first version of the new eukaryotic taxonomy was deployed with SILVA release 111. Specifically, the taxonomy of protist lineages has been reconciled with the International Society of Protistologists (ISOP) publication (17). An early draft from the ISOP committee (Adl et al. 2012, manuscript in preparation) was used to further improve protist classification where possible. Higher-level ranks have been revised for higher plants, fungi and animals. By implementing the classification of ISOP, their concept of ‘rankless’ taxa was introduced to the SILVA taxonomy. That is, the position of a taxon in the taxonomic hierarchy does no longer necessarily imply a rank. Although this concept is biologically sound, we recognize the difficulties that this may bring in computational analyses. Therefore, a file containing classification rank mappings is provided with the new eukaryotic taxonomy. These mappings assign reasonable ranks to taxa in order to make the different levels comparable.
Third-party contextual data
Several fields containing additional contextual data have been added to the SILVA databases over the last years. Basic fields include organism name, author, title, publication ID, collection, submission and modification dates as well as latitude/longitude, depth, habitat and taxonomic classifications by various other databases. Tables detailing the fields available in the current release can be found in the Frequently Asked Questions (FAQ) section of the webpage (http://www.arb-silva.de/documentation/background/faqs/).
The ‘strain’ field, carrying the strain data imported from EMBL-Bank, is now extended with information from third-party sources. In addition to the tag ‘(T)’ used by EMBL-Bank to mark a sequence as type strain, the following tags are used by SILVA:
- the label ‘e[G]’ is added if an entry is part of the list of genomes offered by the EMBL-Bank,
- the label ‘l[T]’ is added if the entry is part of the type strain datasets of ‘The All-Species Living Tree‘project (18,19),
- the label ‘s[T]’ is added if an entry is listed as a type strain by the StrainInfo project (20),
- the label ‘s[C]’ is added if an entry is a cultured strain according to the StrainInfo project and
- the label ‘r[T]’ is added if an entry is listed as a type strain by the RDP-II project.
Furthermore, manually curated habitat descriptors and other contextual information are incorporated where available based on information provided by the megx.net project (21).
Datasets
Ref
The basic criteria for inclusion of sequences in the high-quality full-length Ref datasets have remained unchanged since 2007. Briefly, for SSU Ref archaeal sequences must have at least 900 bases length, bacterial and eukaryotic sequences must have at least 1200 bases length and all sequences must have an alignment score of at least 50. Since release 111, sequences must also have an alignment identity score of at least 70. Furthermore, sequences from large-scale submissions such as made by the mouse wound microbiome project, the human skin microbiome project or the Guerrero Negro hypersaline microbial mat project are removed from the SSU Ref and provided in a separate dataset. Please refer to the SILVA website for information on which projects were removed from each respective database release. Criteria for LSU Ref datasets can be found in Table 1.
SSU Ref NR
For users interested in a representative collection of SSU rRNA gene sequences, the SILVA project offers a non-redundant (NR) version of the SSU Ref dataset. The Ref NR dataset is created by clustering at 99% (up to SILVA 108) and 98% (SILVA 111) sequence identity using UCLUST (22). Of each cluster, only the longest sequence is kept. Sequences from cultivated species including type strains and multiple operons are preserved in all cases to serve as an anchor for taxonomy. The resulting SSU Ref NR dataset is significantly smaller (25% of the Ref dataset as of release 111) than the full Ref dataset and has a more even phylogenetic distribution of sequences. We recommend this dataset to be used as the standard SILVA reference dataset for rRNA gene-based classification, phylogenetic analysis and probe design.
Living tree project
The ‘All-Species’ Living Tree Project (LTP) is a multi-partner initiative coordinated by the Journal Systematic and Applied Microbiology (Elsevier publishers) in cooperation with the ARB, LPSN and SILVA projects and promoted by the SILVA web resource. Its main objective is to provide highly curated ribosomal 16S and 23S RNA sequence datasets of all type strains representing the up-to-date described bacterial and archaeal diversity. The LTP database is kept updated according to the changes in nomenclature and new descriptions of taxa that are effectively published in the International Journal of Systematic and Evolutionary Microbiology. New type-strain sequences are carefully examined by means of their sequence quality, associated (meta) data and manual inspection of the alignment. This process results in: (i) finding the best available SSU/LSU entry that may represent a species; (ii) providing corrected organism-name information plus other LTP-specific (meta) data and (iii) propagating the alignment improvements to the SILVA seed alignment. These very small but taxonomically ‘comprehensive’ datasets are frequently used for taxonomic and classification purposes and are useful as test datasets for developers.
Further developments
Data retrieval
The semantic interpretations of ‘gene’, ‘product’ and ‘note’ feature qualifiers were modified to avoid overlapping/duplicate entries in the SILVA database due to insufficiently annotated EMBL-Bank entries.
Aligner
The SILVA database preparation pipeline now employs an updated version (1.2.10) of SINA to compute the multiple sequence alignment provided with the databases. Please refer to Pruesse et al. (10) for a detailed description of SINA. Briefly, SINA is a reference-based alignment tool, designed to maintain high alignment accuracy while allowing for volume sequence processing. For each sequence, SINA selects a set of similar sequences from the given reference alignment and constructs a directed acyclical graph (DAG) representation of the alignment of these reference sequences to be used as alignment template. The computed alignment of the query sequence and the DAG template is optimal under the constraint that no columns may be added to the alignment.
The base pair score reported by SINA is an indicator for the degree to which the expected secondary structure is met by the aligned sequence. The SILVA alignments each include a global secondary structure, defining which pairs of alignment columns are expected to bond. The SINA ‘bp score’ is calculated as the average binding score of the column pairs covered by the aligned sequence. The alignment identity (SINA version 1.2.10 and above) reports the highest fractional identity between the query sequence and any sequence in the reference alignment.
Internal reference datasets
Several reference datasets are used during database preparation. These are the alignment SEED, a vector sequence database and a collection of non-chimeric sequences. Each of these datasets was continuously improved and extended with trusted data. The vector sequence database is available for download via the website archive. The alignment SEED and the collection of non-chimeric sequences derived from the SEED, however, must remain undisclosed. While we would prefer to make this dataset available, we are not free to do so as it contains sequences obtained under the promise of confidentiality. The SEED is extended with additional sequences when a specific phylogenetic branch is found to require a more detailed alignment. The sequences leading to such findings are typically novel and entailed much effort by the respective author to ensure their validity, making it impossible for us to also ask for publication rights. However, once those sequences do become published, they will automatically appear in the next SILVA release. Obtaining the intersection between the SEED and the Ref can be done via the field ‘align_log_slv’. SINA guarantees that sequences identical to (or sub-sequences of) reference sequences retain the exact alignment of the respective reference sequence and marks the sequences accordingly. The dataset used to evaluate SINA was prepared by selecting those sequences from the Ref 108 and is available for download on the SILVA website. However, for optimal alignment and classification results we recommend to use the Ref NR dataset as reference dataset. While this dataset is about five times larger than the SEED and is not purely comprised of manually curated sequences, it has by way of its construction a relatively even sequence distribution and good phylogenetic completeness.
SILVA WEBSITE
The SILVA website comprises core database access features, several online tools and an extensive, regularly updated set of documentation pages. The documentation pages provide tutorials for all SILVA tools and functionalities, FAQs and a detailed documentation page for each released database. The website also hosts information for partner projects and collaborations such as LTP and ETWG.
Core database access features
The SILVA databases can be accessed online via the Taxonomy Browser and Search pages. The Browser implements a hierarchical view on the database contents, similar to a file browser, visualizing any of the taxonomies included with SILVA (SILVA, LTP, RDP-II, greengenes and EMBL-Bank). The Search page supports keyword matches on a variety of fields (publication details, organism name, DOI/PubMed identifiers, etc.) as well as range filters on numeric descriptors (sequence length, quality values and dates). Multiple accession numbers can be matched by pasting comma separated lists or ranges of accession numbers in the respective field. Furthermore, searches can be constrained to the Ref, Ref NR or LTP datasets and restricted to the contents of the Cart.
The Cart system connects the different tools on the SILVA website by storing a set of sequences of interest. The Cart’s contents can be modified and displayed in both the Browser and the Search. When sequences are added to the Cart, these sequences will be highlighted by the Browser and sequence counts will be shown for each displayed taxonomic group. In line with the Cart metaphor, it is also possible to have the Cart’s contents prepared for download. Export files can be generated in ARB and FASTA formats, with and without alignment and optionally compressed.
Within the Search, the Cart allows to express complex questions as a series of simple queries. For example, searching for all sequences marked as type strain by StrainInfo but not by RDP-II can be achieved by first adding all sequences that contain the strain tag ‘s[T]’ to the cart and then removing those sequences for which strain field is marked with ‘r[T]’.
Alignment, sequence-based search and classification
The Aligner page allows submitting sequence data for processing with SINA. FASTA files containing up to 1000 sequences (≤6000 nt each) can be uploaded, and the sequences will be aligned using the same reference alignments that are employed to prepare the SILVA databases. While all SINA parameters are configurable in the web form, the parameters will default to the same values that were used to prepare the most recent SILVA release.
Optionally, the aligned sequences can be passed through the search stage of SINA to find closely related sequences in the Parc, Ref or Ref NR datasets. The query sequences are compared with the sequences in the selected dataset based on the SINA alignment. The search results can then be added to the Cart, and thereby accessed from the other components of the website, for example to prepare a download or to inspect the taxonomic groups containing sequences similar to the submitted query sequences.
It is also possible to classify the submitted sequences based on the search results. For each query sequence, the classifications assigned to the matched sequences are consolidated using a lowest common ancestor approach. The resulting classification is included with the aligned sequence data.
Probe and primer evaluation
Signature sequences are essential to many methods employed in the investigation of microbial communities. Since these signatures are derived from previously characterized sequences, they can only be accurate to the degree to which diversity was covered by available data at the time of their design. As more data become available over time, signature sequences must be regularly re-evaluated and their suitability re-affirmed. In order to make this process as simple as possible, the TestProbe and TestPrime tools are now offered on the SILVA website. Users can base their calculations on the entire SSU or LSU Parc datasets, or on the Ref or Ref NR subsets. The results can be downloaded as CSV files and matched sequences can be added to the Cart for subsequent download. Both TestProbe and TestPrime are cross-linked with the oligonucleotide signature database probeBase (23).
TestProbe
The probe match and evaluation tool tests and visualizes in silico target group coverage of rRNA gene-targeting probes and single primers, optionally with ambiguity codons, against the SILVA datasets. The tool can be configured to allow up to five mismatches between the probe and target sequences. Mismatches can also be weighted. The results are shown in three tables: an overview of the number and position of mismatches, a per-taxon summary table and a third table listing individual matching sequences with sequence names, accession numbers and a graphical representation of the probe’s binding site within all matches.
TestPrime
Similar to the TestProbe tool, TestPrime allows searching for all sequences within the SILVA databases which are targeted by a pair of primers (optionally with ambiguity codons). The maximal number of allowed base mismatches can be configured, as well as a ‘zero tolerance zone’ at the 3′-end of the primers. Coverage and match, mismatch and no data information are shown in overview pie charts (Figure 1). The graphical results are complemented by two tables similar to TestProbe.
Figure 1.
Screenshot of the Taxonomy Browser showing TestPrime results for two universal primers for evaluation.
The in silico stage of TestProbe and TestPrime is built upon the ARB ‘PT server’. First, the signature sequences are resolved into sets of ambiguity-free oligonucleotides. Each oligonucleotide is then analyzed by the PT server according to the configured stringency parameters. The results are sorted into three groups: match, mismatch and no data. The last group contains all sequences for which no clear decision could be made. Most commonly, this is the case when the sequence in question does not cover the oligonucleotide match position (short or partial sequences).
SILVA direct link API
Direct linking into the browser is supported by URLs of the form: http://www.arb-silva.de/browser/{lsu,ssu}/.
Direct linking into the search is supported by URLs of the form: http://www.arb-silva.de/search/show/{lsu,ssu}//[/[…]].
Up to four pairs of search fields and search terms are allowed. A description of the available search fields is documented on the website. They include ‘acc’ for INSDC accession numbers, ‘name’ for organism name and ‘pubid’ for PubMedID or DOI.
SILVA entries are already directly linked from various sources, including the EMBL-Bank, the GenBank, probeBase and StrainInfo.net.
OUTLOOK
The full impact of the ‘data deluge’ originating by the advent of NGS technologies has not yet influenced the amount of long, assembled sequence data such as full-length rRNA gene sequences. However, the growth of rRNA gene data has already rendered many comparative analysis methods impossible to be applied on comprehensive datasets.
We expect that tree reconstruction for the complete SSU Ref datasets will become infeasible in the near future. Taxonomy curation will then be based on the smaller SSU Ref NR dataset. Classifications for all other sequences in the SILVA databases will be created with a high-throughput approach using this curated Ref NR taxonomy as a reference.
Our databases are already popular in high-throughput analysis pipelines and we expect the importance of these applications to further increase in the future. We are, therefore, committed to enhancing the usability of our reference datasets for these applications. The work on our taxonomy, in particular on its eukaryotic branch, will be of direct and transparent benefit to analysis procedures relying on the SILVA databases.
FUNDING
Max Planck Society; Deutsche Forschungsgemeinschaft [GL 553/4-1]. Funding for open access charge: Deutsche Forschungsgemeinschaft.
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
The authors would like to thank Wolfgang Ludwig and Ralf Westram for expert assistance with the ARB software suite, the alignments and phylogeny. They greatly appreciate the help the SILVA users have rendered with critical evaluation and feedback on the SILVA databases and tools. They would also like to thank the RDP-II, StrainInfo and probeBase teams, as well as our taxonomy collaboration partners for their support and many fruitful discussions.
REFERENCES
- 1.Cochrane G, Karsch-Mizrachi I, Nakamura Y. The international nucleotide sequence database collaboration. Nucleic Acids Res. 2011;39:D15–D18. doi: 10.1093/nar/gkq1150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-Mohideen AS, McGarrell DM, Bandela AM, Cardenas E, Garrity GM, Tiedje JM. The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data. Nucleic Acids Res. 2007;35:D169–D172. doi: 10.1093/nar/gkl889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, et al. The ribosomal database project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009;37:D141–D145. doi: 10.1093/nar/gkn879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 2006;72:5069–5072. doi: 10.1128/AEM.03006-05. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ludwig W, Strunk O, Westram R, Richter L, Meier H, Yadhukumar, Buchner A, Lai T, Steppi S, Jobb G, et al. ARB: a software environment for sequence data. Nucleic Acids Res. 2004;32:1363–1371. doi: 10.1093/nar/gkh293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009;75:7537–7541. doi: 10.1128/AEM.01541-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI, et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010;7:335–336. doi: 10.1038/nmeth.f.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, et al. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008;9:386. doi: 10.1186/1471-2105-9-386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Lagesen K, Hallin P, Andreas Rodland E, Staerfeldt H-H, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35:3100–3108. doi: 10.1093/nar/gkm160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Pruesse E, Peplies J, Glöckner FO. SINA: accurate high throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics. 2012;28:1823–1829. doi: 10.1093/bioinformatics/bts252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Schweer T. Thesis. Germany: University of Applied Sciences Bingen; 2011. Qualitätsmanagement ribosomaler RNA sequencen in der SILVA datenbank. [Google Scholar]
- 12.Garrity GM, Jonson KL, Bell J, Searles DB. Taxonomic Outline of the Prokaryotes. New York: Springer-Verlag; 2002. [Google Scholar]
- 13.Euzéby JP. List of bacterial names with standing in nomenclature: a folder available on the internet. Int. J. Syst. Bacteriol. 1997;47:590–592. doi: 10.1099/00207713-47-2-590. [DOI] [PubMed] [Google Scholar]
- 14.Morris RM, Vergin KL, Cho JC, Rappe MS, Carlson CA, Giovannoni SJ. Temporal and spatial response of bacterioplankton lineages to annual convective overturn at the Bermuda atlantic time-series study site. Limnol. Oceanogr. 2005;50:1687–1696. [Google Scholar]
- 15.Takai K, Moser DP, DeFlaun M, Onstott TC, Fredrickson JK. Archaeal diversity in waters from deep South African gold mines. Appl. Environ. Microbiol. 2001;67:5750–5760. doi: 10.1128/AEM.67.21.5750-5760.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Köhler T, Stingl U, Meuser K, Brune A. Novel lineages of planctomycetes densely colonize the alkaline gut of soil-feeding termites (Cubitermes spp.) Environ. Microbiol. 2008;10:1260–1270. doi: 10.1111/j.1462-2920.2007.01540.x. [DOI] [PubMed] [Google Scholar]
- 17.Adl SM, Simpson AGB, Farmer MA, Andersen RA, Anderson OR, Barta JR, Bowser SS, Brugerolle GUY, Fensome RA, Fredericq S, et al. The new higher level classification of eukaryotes with emphasis on the taxonomy of protists. J. Eukaryot. Microbiol. 2005;52:399–451. doi: 10.1111/j.1550-7408.2005.00053.x. [DOI] [PubMed] [Google Scholar]
- 18.Yarza P, Richter M, Peplies J, Euzeby J, Amann R, Schleifer KH, Ludwig W, Glöckner FO, Rossello-Mora R. The all-species living tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. Syst. Appl. Microbiol. 2008;31:241–250. doi: 10.1016/j.syapm.2008.07.001. [DOI] [PubMed] [Google Scholar]
- 19.Munoz R, Yarza P, Ludwig W, Euzeby J, Amann R, Schleifer KH, Glockner FO, Rossello-Mora R. Release LTPs104 of the all-species living tree. Syst. Appl. Microbiol. 2011;34:169–170. doi: 10.1016/j.syapm.2011.03.001. [DOI] [PubMed] [Google Scholar]
- 20.Dawyndt P, Vancanneyt M, De Meyer H, Swings J. Knowledge accumulation and resolution of data inconsistencies during the integration of microbial information sources. IEEE T Knowl. Data En. 2005;17:1111–1126. [Google Scholar]
- 21.Kottmann R, Kostadinov I, Duhaime MB, Buttigieg PL, Yilmaz P, Hankeln W, Waldmann J, Glöckner FO. Megx.net: integrated database resource for marine ecological genomics. Nucleic Acids Res. 2010;38:D391–D395. doi: 10.1093/nar/gkp918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. doi: 10.1093/bioinformatics/btq461. [DOI] [PubMed] [Google Scholar]
- 23.Loy A, Maixner F, Wagner M, Horn M. probeBase–an online resource for rRNA-targeted oligonucleotide probes: new features 2007. Nucleic Acids Res. 2007;35:D800–D804. doi: 10.1093/nar/gkl856. [DOI] [PMC free article] [PubMed] [Google Scholar]