The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy (original) (raw)

Abstract

The interrogation of genetic markers in environmental meta-barcoding studies is currently seriously hindered by the lack of taxonomically curated reference data sets for the targeted genes. The Protist Ribosomal Reference database (PR2, http://ssu-rrna.org/) provides a unique access to eukaryotic small sub-unit (SSU) ribosomal RNA and DNA sequences, with curated taxonomy. The database mainly consists of nuclear-encoded protistan sequences. However, metazoans, land plants, macrosporic fungi and eukaryotic organelles (mitochondrion, plastid and others) are also included because they are useful for the analysis of high-troughput sequencing data sets. Introns and putative chimeric sequences have been also carefully checked. Taxonomic assignation of sequences consists of eight unique taxonomic fields. In total, 136 866 sequences are nuclear encoded, 45 708 (36 501 mitochondrial and 9657 chloroplastic) are from organelles, the remaining being putative chimeric sequences. The website allows the users to download sequences from the entire and partial databases (including representative sequences after clustering at a given level of similarity). Different web tools also allow searches by sequence similarity. The presence of both rRNA and rDNA sequences, taking into account introns (crucial for eukaryotic sequences), a normalized eight terms ranked-taxonomy and updates of new GenBank releases were made possible by a long-term collaboration between experts in taxonomy and computer scientists.

INTRODUCTION

The modern definition of the term ‘protist’ refers to unicellular eukaryotes that are either free-living or parasitic, sometimes forming colonies, but without clear differentiation into tissues. This includes all eukaryotes other than land plants (and macro-algae), animals and fungi with differentiated tissues. Protists are notoriously paraphyletic and include a wide range of microorganisms using a huge variety of reproductive, nutritional and life-history strategies. Nevertheless, the term protist has pragmatic uses and has recently gained in popularity. Large-scale analysis of protistan diversity is complicated by their heterogeneity, which reflects their extremely broad distribution and implication in multiple ecological and functional processes. This difficulty is exacerbated by the following facts: (i) species delineation is often obscure owing to lack of clear morphological criteria and paucity of knowledge concerning processes of sexual recombination; (ii) the taxonomy of protists has been radically modified in recent decades in light of new phylogenetic data; and (iii) a large proportion of protists are probably still not cultivable or yet unknown. Molecular barcoding using SSU rRNA (Small Sub-Unit Ribosomal) gene sequences consequently has become extremely popular among protistologists. Environmental barcoding has unveiled an extensive genetic diversity of protists in a wide range of ecosystems (1,2), including lineages only known by their genetic signatures (orphan environmental sequences). Recently, the use of next generation sequencing (NGS) technologies targeting selected domains of the SSU rRNA gene has permitted ecological studies of complex assemblages at ever increasing scales (3–7). However, interpretation of such data is currently seriously hindered by the lack of taxonomically curated reference data sets. Unassigned and incorrectly assigned sequences are accumulating at an increasing and alarming rate in public databases, to the extent that in early 2012, almost 20% of submitted SSU rRNA eukaryotic gene sequences had no or a very poor taxonomic assignation (see the website for more details). Undetected chimeric sequences (8), as well as the presence of introns in gene sequences (9), are also problematic.

To facilitate and increase the efficiency and accuracy of NGS data sets analyses, we here present the first comprehensive-curated database that places eukaryotic SSU rRNA gene sequences within a coherent ranked taxonomic framework covering eukaryotic diversity. Every sequence was quality checked and annotated using a multi-level taxonomic assignation. As a lot of protists are still only known by their environmental sequences, cluster names were retained when the formal taxonomy was missing [such as Syndiniales (10) and Marine STramenopiles, MAST (11)]. Although curated in less detail, sequences from metazoa, land plants and macrosporic fungi, as well as eukaryotic organelles (mitochondria, plastids, etc.), are also included in the database for their ecological interests. For example, protists may live in close association with metazoan (commensalisms, symbioses, etc.), and very small metazoan exists, inhabiting similar ecological niches. For example, copepods and polychaetes, as well as benthic animal larvae coexist with planktonic protists in aquatic systems. They may also have a great interest in ecological studies (as predators for example), even for protistologists. Even if this database is dedicated to protists, such outgroup sequences are of high relevance for extracting these groups in further analyses of NGS data sets when ‘universal’ eukaryotic primers are used for polymerase chain reaction (PCR) amplifications. Metazoan sequences in PR2 allow not identifying them wrongly as new deep lineages of protists.

MATERIALS AND METHODS

The construction of this database started >10 years ago, and our procedure has been optimized over time (for more details, recent history detailed at http://ssu-rrna.org/method.html). Here, we briefly describe the present general architecture of the database.

Entries containing at least one partial SSU rRNA gene sequence of eukaryotic origin are retrieved from three public databases using keywords. Our last update retrieved 484.657, 496.462 and 123 such entries from GenBank, EMBL and WGS-EMBL, respectively. An INSDC (http://www.insdc.org/) entry as defined by its accession number in public databases may contain several rRNA gene sequences, e.g. in long genomic fragments containing several partial or complete ribosomal operons. To allow such duplicated sequences within a single entry, each sequence was given a unique identifier, acc.p1.p2, where acc is the accession number of the entry containing the sequence, and p1 and p2 are the first and last positions of the sub-sequence within the complete sequence.

A majority of extracted sequences were shorter than 100 nucleotides or around 500 nucleotides (63% of retrieved sequences), likely resulting from the recent integration of short environmental sequences derived from clone libraries. Only sequences longer than 799 nt were considered.

The first step was the identification of sequences originating from organelles. A reference database of SSU-rRNA gene sequences from chloroplasts and mitochondria was constructed using entire genomes or genomic fragments that contained a SSU-rRNA gene sequence and a protein-coding gene specific either of mitochondria or of chloroplasts. For derived-organelle sequences such as apicoplasts, hydrogenosomes and nucleomorphs, databases were manually built, using information found in scientific publications. These databases were used to determine by sequence similarity the origin of every sequence in the database. These sequences were assigned to a reduced taxonomic framework, including their location (such as: |Organelle|chloro-SSU| or |Organelle|mito-SSU|). These sequences are not more detailed in the database.

Introns were found to be a major problem in eukaryotic rRNA sequences compared with prokaryotic sequences (1536 sequences with intron(s) described, 10 644 sequences with introns found by computation). A dedicated C++ algorithm was developed to identify the presence of introns in the remaining sequences (9). When detected, sequences with and without the intron(s) were generated (rRNA and rDNA sequences).

Sequences in the PR2 database are assigned an identifier in the form accession.p1.p2_X, where accession is the accession number of an entry, p1 and p2 are the positions of this sequence in a larger genomic entry and X corresponding to introns treatment of the sequence [X = G: genomic sequence containing a described intron (rDNA); X = R: the previous genomic rRNA sequence, without the intron(s); X = U: no intron described, but intron(s) may be present; X = UC: introns were detected in silico and removed from the sequence (putative rRNA)].

Taxonomy of nuclear-encoded sequences

As all SSU-rRNA genes are orthologs, a global phylogeny can be built, and essential past speciation events can be evidenced. This property is essential to build a ranked taxonomy. For example, at rank 1, there is a world-wide agreement to recognize three clades, Bacteria, Archaea and Eukaryota. We chose to additionally use ‘Organelle’ as rank 1. Organelles have a eukaryote origin when they are nucleomorphs and a bacterial origin when they are mitochondrion and plastid. Because evolution of organelles and their hosts differ over time, their taxonomy is different too. In addition, scientists working on diversity are more interested in the identification of the cells that bear such organelles. Our choice was thus to allow their easy identification (and filtering out) during the first step of an analysis, targeting them as ‘Organelle’ at rank 1.

Nomenclature and terms of the following ranks mainly follows the classification of eukaryotes proposed by Adl et al. (12). Thus, the second rank describes each eukaryotic ‘Super-Group’ or Phylum (both terms are in use in different communities): Alveolata, Amoebozoa, Apusozoa, Archaeplastida, Excavata, Opisthonkonta, Rhizaria or stramenopiles. The taxonomic descriptions are structured by the use of eight ranks, and following ranks mainly correspond to the division, class, order, family, genus and species.

The terms used for each rank are non-ambiguous (a term cannot be found in two different clades), contain no space (that may pose problems to computers) and whenever possible retained if monophyletic. When monophyly could not be insured, the term of rank above was used, appended with suffix _X (suffix X if the above rank was already _X). As the same species name frequently occurs in different genera, the species name is composed of the genus and species, using ‘+’ as a separator (e.g. genus = Diderma, species = Diderma + niveum). Genus and species names from public databases are stored in separate fields for comparison.

For protists and unicellular fungi, a taxonomy was proposed by the group of experts, authoring this article. For multicellular fungi, plants and metazoans, the taxonomy was built mostly using the taxonomy assigned in National Center for Biotechnology Information (NCBI)’s GenBank database entries. We first built a core reference database containing 23 116 manually analysed sequences representative of eukaryotic diversity. These analyses included reading published articles and phylogenetic analyses done by the authors of this article when necessary. This core reference database was subsequently used to automatically annotate the remaining sequences using different methods.

We are aware that for some clades such as metazoa, plants and fungi, our eight terms taxonomy is probably not as precise as it should be. Barcoding of metazoa and plants using SSU-rRNA sequences is not often used (normally only to complement Internal Transcribed Spacer (ITS) sequences). We will therefore try in a next release to propose an extended, still ranked and unified, taxonomy for fungi.

An outcrop of PR2 is the web-based tool KeyDNAtools (http://keydnatools.com/). It uses 159 982 specific short (15 nt) oligonucleotide sequences (named keys) generated from the core reference database. Each key is a signature present in sequences of a given clade, but not in those of other clades. Besides providing a very fast taxonomic identification, it also allows for detecting putative chimeric sequences, as when different identifications are obtained from the 5′ and 3′ ends of sequences.

Specific new computer programs mostly in C, C++ and Python have been developed. First, a new parallel distributed computing Needleman–Wunsch-based C program allowing to compute pair-wise distances not taking into account terminal gaps (partially overlapping sequences) and long internal gaps (introns). This was coupled to a newly rewritten C average linkage clustering program. Second, a new parallel distributed computing Needleman–Wunsch-based C++/Python program allowing to assign a consensus taxonomy to new sequences by comparison to a reference database (Crunch_Assign).

When a conflict between taxonomies assigned using the different methods was found, it was manually solved. In the end, each nuclear encoded sequence is assigned an identifier in the form of this example:

>AY827845.1.1765_U|Eukaryota|Apusozoa|Hilomonadea|Planomonadida|Planomonadidae|Planomonadidae_Group-1|Ancyromonas|Ancyromonas + sigmoides

RESULTS

In total, we found 136 866 nuclear encoded sequences, five pseudo-genes (FJ854546, FJ854545, D14632, AF310844, AJ404858, not included in PR2) and 34 sequences we could only assign as putative rRNA sequences (HM538255, GU385678, AB275106, AJ628837, AY180011, CP000499, CP000499, AY256215, EU402432, AB017015, GQ330639, GU820811, JF488788, AF239231, DQ423737, DQ104596, AY835700, DQ423728, EU545797, GU072272, GU072526, GQ247249, HM174255, DQ104594, EU174762, FN598473, EU726200, EF695080, GQ483783, GQ462590, EU173354, EF567390, EF695215, HQ871039, not included in PR2). Manual analyses of some of them allowed concluding for the presence of artefactual sequence internal or at the 5′ or 3′ end. Among nuclear-encoded sequences, we detected 1756 putative chimeric sequences, either using the KeyDNAtools and/or by manual inspection (listed on the website). For example, sequence EF023694.1.1975_U is a chimera between parent sequences of Opisthokonta, Amoebozoa and Rhizaria in position 179-471, 623-1264 and 1536-1925, respectively. Other ‘18S’ sequences are nucleomorphs (262 sequences). In all, 9657 sequences have a chloroplastic origin, 33 051 are from mitochondria, six from hydrogenosomes (AJ237907, AJ237908, AJ871215, AJ871217 AJ871267, Y16670) and 26 from apicoplasts (U87145, AB471801, AB471802, AB471803, AB471804, AB471805, AB471806, AB471807, AB471808, AB471809, AB471810, AB471811, AB471812, AB649417, AB649418, AB649419, AB649420, AB649421, AB649422, AB649423, AB649424, HQ110105, JQ437257, JQ437258, JQ437259, U28056).

Within nuclear-encoded sequences, 54 data entries remained unassigned at the Super-Group level (Table 1), meaning that they could not be assigned to any specific taxon group within the domain Eukaryota (Eukaryota_X). The Super-Group ‘Eukaryota_Mikro’ was created for sequences HM563060, AF477623 and HM563061, for which no consensus has been reached for their affiliation, although Haplosporidiidae has been suggested (13). BLAST analyses conducted at NCBI against non-redundant or at DNA Data Bank of Japan (DDBJ) against all showed extremely weak sequence similarity with sequences of fungi. Using our global similarity tool (Crunch_Assign) showed no other sequence similar at ≥80% along the entire sequence. These results conducted to the creation of this new Super-Group (rank 2). For unassigned nuclear-encoded sequences (Eukaryota_X), either no other similar sequence was found or similar sequences were detected but also annotated by us as Eukaryota_X. A BLAST on NCBI non-redundant (excluding environmental sequences) and at DDBJ (all) revealed that a large number of them probably contained undescribed introns. Therefore, these sequences probably require a manual curation, but again highlight the importance of intron identification in eukaryotic sequences.

Table 1.

Number of nuclear-encoded sequences in PR2 as annotated at the Super-Group taxonomic level

Super-group n1 n2
Alveolata 20 760 20 255
Amoebozoa 1902 1880
Apusozoa 254 242
Archaeplastida 16 309 16 092
Eukaryota_Mikro 3 3
Eukaryota_X 54 54
Excavata 2871 2869
Hacrobia 2192 2132
Opisthokonta 75 056 74 484
Rhizaria 7581 7459
Stramenopiles 9884 9640
Total nuclear-encoded Eukaryota 136 866 135 110
Apicoplast 26 26
Chloroplast SSU 9657 9657
Hydrogenosome SSU 6 6
Mitochondrion SSU 36 051 36 051
Nucleomorph SSU (18S) 264 262

For lower taxonomic ranks, there were primarily two types of cases resulting in a failure to assign a taxonomic identity:

  1. No agreement between experts to resolve at a given rank. For example, the genus (rank 7) is assigned, the order (rank 5) is assigned, but a family (rank 6) has not yet been described, or this rank is in fact polyphyletic, with no proper descriptions of the different families.
  2. A given sequence is similar at the family level with several sequences from different families; however, they agree at the order level.

In such cases, this sequence was assigned as … |Order| Order_X[Genus|Genus + species. If a genus was not described (i.e. uncultured), the taxonomy becomes: … |Order| Order_X[Order_XX|Order_XX + sp.

More than 74 000 sequences (54% of total number of sequences in the PR2 database) belong to Opisthonkonta (Figure 1). Alveolata and Archaeplastida are second in abundances (15 and 12%, respectively). Stramenopiles and Rhizaria represent 7.2 and 5.6 %, respectively. Others SuperGroups represent less than 2.2%. Only 29.4% are complete or nearly complete. In total, 63.7% of sequences include the V4 region and only 12.1% and 11.7% include the V9 region as recognized by primers Biomarks and Wamps (see the legend of Figure 1), respectively. Apusozoa, Hacrobia, Excavata and Opisthokonta have <10% of their sequences that include the V9 region. V9 region of Amoebozoa and Archaeplastida are better represented (34% and 25%, respectively, using the Biomarks primers).

Figure 1.

Figure 1.

Total number of SSU rDNA gene sequences in the PR2 database for each main eukaryotic lineage (all sequences = grey + black, complete or nearly complete sequences in light-grey). Note that nucleomorphs were extracted from Archaeplastida. Numbers indicated after bars indicate percentages of sequences that include the following: (i) the V4 region as defined by primers forward CCAGCASCYGCGGTAATTCC and reverse ACTTTCGTTCTTGATYRA used during the European Biomarks project; (ii) the V9 region as defined by primers forward GTACACACCGCCCGTC and reverse TGATCCTTCTGCAGGTTCACCTAC used during the European Biomarks project; and (iii) the V9 region defined by primers forward TTGTACACACCGCCC and reverse CCTTCYGCAGGTTCACCTAC used by the WAMPS project. For Opithokonta, number in white = total number of sequences.

DOWNLOADS

We provide several different ways of downloading the database or part of it (see more explanations at http://ssu-rrna.org/downloads_eukaryotic_main_page.html).

  1. The entire database or sequences of a specific clade can be downloaded using a taxonomy browser under fasta format, with sequence identifiers as described above. Putative chimera have been removed.
  2. The entire database or sequences of major groups can be downloaded under fasta format, with only the short unique identifier. The corresponding taxonomy is then downloaded as a tabulated file. This fasta format is appropriate to use in tools that do not allow for long sequence identifiers. They are also easier to use in large computations, as they spare the memory required. Finally, they are easier to use in pipelines or web sites (see below).
  3. The entire database, taxonomies and sequences under tabulated format, for easy import in relational databases.
  4. The entire database or sequences of a specific clade under fasta format, with sequence identifiers as described above, but after a clustering by sequence similarity (98, 96, 92%) and choosing only the longest sequence as representative of the cluster.
  5. Phylogenetic trees are available for the main groups. They were built using pair-wise distance computations (not taking introns as differences as explained above) and FastMe (14).
  6. Finally, we provide an ‘arb’ filter that allows to easily import a fasta file (with taxonomy in the identifier) into an arb database, separating sequences and taxonomy as required.
  7. In silico extracted domains corresponding to regions widely used in published articles and corresponding to several couples of primers.

SEARCHING THE DATABASE

We provide the following additional kinds of tools:

  1. A search by keywords, allowing to search according to taxonomy, accession number and PMID (PubMed ID: retrieval of sequences described in a given publication). Retrieven sequences can be filtered according to length, quality and when containing the variable V4 of V9 domains (often used in conjunction with deep sequencing).
  2. A search by ‘sequence signature’, with a link to the KeyDNAtools website (http://keydnatools.com/). This tool provides very fast results even for files containing many sequences. It also allows for detection of putative chimera as explained above.
  3. A BLAST search against the database, as usually found on most sites.
  4. A search (Crunch_Assign) using our modified global (Needleman–Wunsch based) algorithm that returns the most similar hits based on the entire alignment of the sequences, and not based on a good local alignment (high scoring pair, in BLAST). As a result, the percentage of similarity computed is more in agreement with what would be found using a Multiple Sequence Alignment [Clustal (15), Muscle (16), MAFFT (17),…] before computing distances. It allows or does not allow accounting for introns as described above.
  5. A search of one or two primer motifs in sequences, returning every sequence that contains the primer(s) with International Union of Pure and Applied Chemistry (IUPAC) encoding allowed and also the possibility of mismatches between primer and sequence (a C program).
  6. In silico extracted domains corresponding to regions widely used in published articles and corresponding to several couples of primers.

Both BLAST and Crunch_Assign similarity searches are coupled to BLAST2Tree or Crunch_Assign2Tree that use our Scriptree software (18). Similarity search results can simply be copied and then pasted in the ‘2Tree section’; a phylogenetic tree is built and displayed on the fly, with taxonomic assignations (as chosen by the user) displayed in regard of each leaf. This section also allows downloading the sequences that have been pasted and the taxonomy as a tabulated file (19).

CONCLUSION AND PERSPECTIVES

There are presently three databases, SILVA (20), RDP (21) and Green genes (22), offering a curated taxonomy for prokaryotic SSU rRNA sequences. Only SILVA additionally provides reference sequences for SSU-rRNA sequences of eukaryotic origin, curated for sequence quality but using the NCBI taxonomy (although recently a ‘SILVA’ taxonomy is now proposed). Because our sequence identifier, i.e. accession.p1.p2, is similar to that used by SILVA, both databases can be easily compared.

Based on the last release 111, 1518 of the 71 787 eukaryotic SILVA reference sequences are not present in the PR2 database. Manual checks showed that these sequences correspond to sequences extracted from entries in which no annotation allowed to identify the presence of a SSU-rRNA sequence, annotated as mRNA or annotated as prokaryotes. In all, 670 sequences identified as mitochondria were not in PR2; none of the SILVA chloroplast sequences was absent from PR2. Missing sequences will be soon analysed and incorporated in PR2.On the other hand, 53 735/7774 nuclear, 31 492/29 763 mitochondrial, 462/18 chloroplastic and 133/80 other organelle sequences present in PR2 were not in SILVA reference sequences and SILVA entire database, respectively. This can be largely explained by the use of drastic filtering steps used by SILVA both in minimal length and sequence quality. However, because we are also users of such databases to analyse NGS data sets, we detected two major reasons not to use too drastic quality filtering. First, representatives of novel environmental clades are often found within clone libraries with length of <1000 nt. Also, use of extreme quality filters may remove important sequences representatives of environmental groups, too short and/or having poor quality at one of the end of a sequence (one-step Sanger sequencing without enough noise treatment for example). In PR2, sequence quality was indirectly inferred by the quality of the taxonomic assignation because bad-quality sequences became poorly assigned. Again, as sequence identifiers are similar between both databases, sequences can be easily compared between both databases.

The PR2 database possesses several valuable complementary tools or databases lacking in other databases.

A ranked taxonomy

As for the PR2 database, SILVA taxonomy for eukaryotes now offers a taxonomy based on the structure proposed by Adl et al. (12). However, contrarily to SILVA, we proposed a normalized eight terms ranked taxonomy for every sequence in the database. We proceeded to this ‘normalization’ from our experience in dealing with very large data sets using automated pipelines, and a depth of sequencing that revealed organisms spanning the entire spectrum of known living organisms. When considering the NCBI taxonomy for example, two sequences of Perciformes were found described using 22 ranks (AY263842 and EF470892 for Perciformes), whereas another Perciforme (AF112595) was described using only 15 ranks, and 10 360 sequences of Perciformes had between 16 and 21 ranks. Numerous examples exist for protists. A very good example is for the genus Carpediomonas. NCBI classify this genus within Eukaryota (rank 1), Fornicata (rank 2), Carpediomonas (rank 3). However, sequence AY117416 (Carpediemonas membranifera, 23) has no rank 2 taxonomy in its entry. As a result, it becomes extremely difficult using a computer and the lists of terms provided by a non-ranked taxonomy to identify for two different sequences, which members of the two lists indeed correspond to the same rank. This is the problem solved by our ranked taxonomy, thanks to a worldwide list of taxonomic experts. As an example, taxonomy of sequence AY117416 becomes Eukaryota|Excavata|Metamonada|Fornicata|Fornicata_Group-2|Carpediomonas-like|Carpediemonas|Carpediemonas + membranifera in PR2. In SILVA, this sequence is linked to a 7 terms taxonomy, but taxonomy is seemingly not ranked and unified.

When occurring, missing ranks are automatically replaced in PR2 (labeled as clade-i_X, where clade-i is the term for the next higher rank). This strategy allows rapidly inferring the taxonomy at the most probable higher rank and provides a rapid method for screening putative novel lineages at each taxonomic level.

Introns

Most SSU rRNA databases and biodiversity analyses of prokaryotes understandably neglect introns. Although found even in Escherichia coli (24,25), introns are rare in Bacteria and not very abundant in Archaea. Even when present, they have not yet been, to our knowledge, described in rRNA gene sequences. However, in Eukaryota, introns can be relatively abundant in rRNA gene sequences at least in some groups (9). This led us to incorporate in our database both the rRNA and the rDNA sequences. As most NGS (or clone library) analyses of the biodiversity are dealing with PCR amplification of extracted gDNA, introns may represent a large part of the variability observed. Having genomic sequences, in addition to the rRNA transcript, in the database is important, not only for searching by similarity but also for the in silico estimation of expected amplicon lengths.

Organelles

Organelles are often poorly treated in reference databases. For hydrogenosomes (AJ237907, AJ871215, AJ871217, AJ871267, Y16670), only sequence AJ871217 can be found in SILVA labeled as ‘Unclassified’. For GreenGenes, sequences were not found when searching by accession number. At RDP, the classifier resulted in every case into ‘unclassified_Bacteria’. For the 26 apicoplast sequences, none was found in SILVA reference sequences or in the ‘ssu-accession-parc.acs’, release 111 (3 186 762 accession numbers). Even for better-known organelles, taxonomic assignation is not really better. For example, sequence AB000109 mitochondrion of Dictyostelium discoideum is labeled as ‘Unclassified’ in SILVA. Chloroplasts are generally well identified in SILVA. However, among the chloroplastic sequences detected in this study, 263 were found in SILVA reference sequences as chloroplasts. Our approach to build independent databases for these organelles allowed us to probably reach a more precise taxonomic affiliation of organelles. Having such prokaryotic organelles in our database is essential with NGS data sets of both prokaryotes and eukaryotes because the use of ‘Bacteria’ or ‘Eukaryota’ specific primers resulted in some cases in a significant proportion of amplicons that are in fact of Organelle origin (3–7). Even if Organelle sequences are simply discarded from the final analysis, this database avoids identifying these sequences as some new deep lineages.

Chimeric sequences

Chimeric sequences are PCR-generated hybrid products between multiple parent sequences that can be falsely interpreted as novel organisms, thus inflating apparent diversity (8,26). The two algorithms most widely used for 16S chimera detection are Pintail (27), included in RDP and SILVA databases, and Bellerophon (28) included in GreenGenes. In all cases, chimera are detected by comparing independent regions of a sequence alignment. The KeyDNAtools does not require the prior alignment of sequences, and it is particularly efficient to detect complex chimera having more than two parent sequences, or between two closely related parents. This tool can be used in concert with other detection methods. Our database, which has been screened for putative chimera, offers two possibilities of download: either including or excluding putative chimeric sequences.

Similarity searches

BLAST is a widely used tool that finds regions of local similarity between sequences. However, such search based on a good local high scoring pair could lead to very bad results. We thus developed two independent methods of assignation. The first one, the Crunch_Assign software is using a Needleman–Wunsch algorithm. It is also faster than BLAST and returns a score computed on the entire alignment. Because we are working on Eukaryotes, we also included the possibility of ignoring putative introns (to our knowledge, this possibility is not included in any other software). The second one, the KeyDNAtools is also very fast and offers additionally chimera detection as discussed above. In >95% of cases, both assignations provide similar results. Sequences not annotated by the KeyDNAtools likely result from the absence of the corresponding clade in the core reference database, low quality sequences or novel variants of the gene present in newly available sequences, not yet included in the core data set. Conversely, sequences not assigned by the Crunch_Assign software are often chimera or low-quality sequences. After a search by similarity, we offer the possibility to build a phylogenetic tree on the fly, using most similar sequences found by BLAST or Crunch_Assign.

Updates

We have developed a pipeline that allows to analyse a GenBank new release within a week. Most of the time spent is indeed in manual checking of conflicts after average linkage clusterings, as explained previously. As a result, updates of the PR2 database will be done shortly after each GenBank new release. As a result, numbers provided in this article will probably differ from that available from PR2 at publication time of this manuscript.

FUNDING

The European Union’s Seventh Framework Programmes (FP7) BIOMARKS (2008-6530, ERA-net Biodiversa) and MicroB3 [287589] and the following ANR (France) projects: AQUAPARADOX, PARALEX and GIME. Funding for open access charge: ANR Paralex (French) and BIOMARKS (FP7).

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

The authors warmly thank Marion Viprey for her pioneering helps with the construction of the PR2 database. Computations have been done on the ‘Mésocentre SIGAMM’ machine, hosted by Observatoire de la Côte d'Azur, Nice, France.

REFERENCES