UCHIME improves sensitivity and speed of chimera detection (original) (raw)

Journal Article

,

1Tiburon, CA, USA, 2Genome Sequencing and Analysis Program, The Broad Institute, Cambridge, MA 02142, 3Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309, USA and 4School of Engineering, University of Glasgow, Glasgow G12 8LT, UK

* To whom correspondence should be addressed.

Search for other works by this author on:

,

1Tiburon, CA, USA, 2Genome Sequencing and Analysis Program, The Broad Institute, Cambridge, MA 02142, 3Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309, USA and 4School of Engineering, University of Glasgow, Glasgow G12 8LT, UK

Search for other works by this author on:

,

1Tiburon, CA, USA, 2Genome Sequencing and Analysis Program, The Broad Institute, Cambridge, MA 02142, 3Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309, USA and 4School of Engineering, University of Glasgow, Glasgow G12 8LT, UK

Search for other works by this author on:

,

1Tiburon, CA, USA, 2Genome Sequencing and Analysis Program, The Broad Institute, Cambridge, MA 02142, 3Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309, USA and 4School of Engineering, University of Glasgow, Glasgow G12 8LT, UK

Search for other works by this author on:

1Tiburon, CA, USA, 2Genome Sequencing and Analysis Program, The Broad Institute, Cambridge, MA 02142, 3Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309, USA and 4School of Engineering, University of Glasgow, Glasgow G12 8LT, UK

Search for other works by this author on:

Revision received:

30 May 2011

Cite

Robert C. Edgar, Brian J. Haas, Jose C. Clemente, Christopher Quince, Rob Knight, UCHIME improves sensitivity and speed of chimera detection, Bioinformatics, Volume 27, Issue 16, August 2011, Pages 2194–2200, https://doi.org/10.1093/bioinformatics/btr381
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Motivation: Chimeric DNA sequences often form during polymerase chain reaction amplification, especially when sequencing single regions (e.g. 16S rRNA or fungal Internal Transcribed Spacer) to assess diversity or compare populations. Undetected chimeras may be misinterpreted as novel species, causing inflated estimates of diversity and spurious inferences of differences between populations. Detection and removal of chimeras is therefore of critical importance in such experiments.

Results: We describe UCHIME, a new program that detects chimeric sequences with two or more segments. UCHIME either uses a database of chimera-free sequences or detects chimeras de novo by exploiting abundance data. UCHIME has better sensitivity than ChimeraSlayer (previously the most sensitive database method), especially with short, noisy sequences. In testing on artificial bacterial communities with known composition, UCHIME de novo sensitivity is shown to be comparable to Perseus. UCHIME is >100× faster than Perseus and >1000× faster than ChimeraSlayer.

Contact: robert@drive5.com

Availability: Source, binaries and data: http://drive5.com/uchime.

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

1.1 Background

Current sequencing technologies often require DNA samples to be amplified using the polymerase chain reaction (PCR). Amplification produces chimeric sequences that stem from two or more original sequences (the parents of the chimera). The most common mechanism is incomplete template extension, when a partially extended sequence from one sequence reanneals to another parent in the next cycle of PCR. The resulting chimeras are often difficult to identify during downstream analysis (Ashelford et al., 2005). This problem is particularly acute in population studies that sequence a single region, such as the bacterial 16S ribosomal RNA gene (16S) or the fungal Internal Transcribed Spacer (ITS) region, to estimate diversity or find differences between populations, e.g. between diseased and control samples. In the case of 16S, published studies report that curated databases may contain up to 46% chimeric sequences (Ashelford et al., 2005, 2006; Huber et al., 2004). Factors including sequence similarity, number of PCR cycles and relative abundance of gene-specific PCR templates influence chimera formation (Acinas et al., 2005; Haas et al, 2011; Lahr and Katz, 2009; Thompson et al., 2002; Wang and Wang, 1996, 1997). While chimeras with two segments (bimeras) are most common, chimeras with >2 segments (multimeras) may form at comparable rates and account for a significant fraction of the unique sequences in an amplified sample (Lahr and Katz, 2009).

1.2 Previous work

Previous chimera detection methods include CHIMERA_CHECK (Maidak et al., 1999), Pintail (Ashelford et al., 2005), Mallard (Ashelford et al., 2006), Bellerophon (Huber et al., 2004), ChimeraChecker (Nilsson et al., 2010), ChimeraSlayer (Haas et al., 2011) and Perseus (Quince et al., 2011). Pintail and Mallard are 16S-specific programs that use a reference database of trusted chimera-free reference sequences. The query sequence is aligned to all (Pintail) or all pairs (Mallard) of reference sequences. Evolutionary distance is computed in a sliding window across the query sequence and variations in distance are compared with the known rate variability in the 16S gene, with larger variations indicating a chimera. ChimeraChecker is an ITS-specific method using BLAST (Altschul et al., 1997) to search a reference database for taxonomic anomalies. If, for example, the closest match to the ITS1 region is different from the closest match to the ITS2 region, the query is flagged as potentially chimeric. ChimeraSlayer searches a multiple alignment of chimera-free reference sequences and constructs three-way alignments with candidate parents. ChimeraSlayer was shown to be more sensitive than earlier methods (Haas et al., 2011). Although ChimeraSlayer is presented as a 16S-specific method, it would likely perform well with another sequence type if a reference multiple alignment is available. Perseus is designed to detect chimeras in 454 pyrosequencing reads that have been filtered by the AmpliconNoise algorithm (Quince et al., 2011). Assuming that a chimera has undergone fewer rounds of amplification than its parents, the query is compared with all pairs of sequences having higher abundance. The closest pair is selected, and its three-way alignment with the query sequence is made. Supervised learning is employed to determine the parameters of the model.

1.3 UCHIME

To improve speed and accuracy of chimera detection, we created a new algorithm, UCHIME. In our tests, UCHIME achieved higher sensitivity than the best previous method based on a reference database (ChimeraSlayer), while maintaining lower or comparable error rates. In particular, UCHIME has much better performance on short, noisy sequences and on multimeras. The algorithm has no explicit dependencies on any one region and should perform well on different sequence types. UCHIME can use a trusted reference database of non-chimeric sequences (like ChimeraSlayer) and also offers a de novo mode (like Perseus). UCHIME does not require a multiple alignment of the reference database. UCHIME reports a score for each sequence, allowing the user to trade sensitivity for specificity by adjusting the minimum score threshold used to discriminate chimeras from biological sequences. No training is required as we have found the UCHIME score parameters to be robust when presented with different types of input data. The default score threshold gave good sensitivity with low error rates (0–3%) on our tests.

2 METHODS

2.1 UCHIME algorithm

The UCHIME algorithm is illustrated in Figure 1. The query sequence is divided into four non-overlapping segments (chunks), each of which is used to search a reference database, which is assumed to be chimera free. The best matches to each chunk are noted, and the two best candidate parents are identified from matches to all chunks. A three-way multiple alignment of the query to these two candidates is constructed. If a pair of segments extracted from these two candidates has identity ≥0.8% closer to the query sequence than either candidate alone, a score is computed from the alignment and a chimera is reported if the score exceeds a predetermined threshold. In reference mode, the user provides a database of trusted sequences. In de novo mode, the database is constructed on the fly using a strategy similar to Perseus: sequences are considered in the order of decreasing abundance, and candidate parents must have abundance at least 2× that of the query sequence, assuming chimeras are less abundant than their parents because they undergo fewer rounds of amplification. Sequences not classified as chimeric are added to the reference database.

UCHIME schematic. The query sequence is divided into four chunks, each of which is used to search the reference database. The best few hits to each chunk are saved, and the closest two sequences are found by calculating smoothed identity with the query. A three-way chimeric alignment is constructed, and a chimera is reported if its score [Equation (2)] exceeds a preset threshold.

Fig. 1.

UCHIME schematic. The query sequence is divided into four chunks, each of which is used to search the reference database. The best few hits to each chunk are saved, and the closest two sequences are found by calculating smoothed identity with the query. A three-way chimeric alignment is constructed, and a chimera is reported if its score [Equation (2)] exceeds a preset threshold.

2.2 Chimeric alignments and models

UCHIME searches for a chimeric alignment between a query sequence (Q) and two candidate parents (A and B). We identify three types of alignment as shown in Figure 2, which we call local, local-X and global-X, respectively. We aggressively reduce the number of chimeric alignments that are forwarded to the classification stage because with a given error rate, false positives increase with the number of classifications, while the number of true positives is at most one. We limit the number of classifications by (i) searching for global-X alignments, as fewer global-X alignments usually exist compared with local or local-X; (ii) examining only two candidate parents; and (iii) discarding models having distance to the closest parent (divergence) <0.8%, as classification is harder when differences are small and a failure to detect a chimera with very small divergence only rarely degrades experimental results. If parents or close proxies (step-parents) are present in the reference database, then it is usually possible to construct a chimeric alignment. However, the existence of a chimeric alignment is not sufficient to reliably classify a sequence as an amplification artifact. Chimeric alignments may alternatively be explained by (i) chance biological similarity, e.g. in fast-evolving regions; (ii) convergent evolution due to similar selection pressure in different lineages; (iii) naturally occurring chimeras due to biological processes such as lateral gene transfer; (iv) sequencer error; or (v) poor-quality alignments. One might naively expect that a global-X search would fail to find most multimeras, but in practice global-X proved to have surprisingly good sensitivity and was more effective for finding multimeras than other approaches we have tried, including local-X, which is available as an option in UCHIME. The effectiveness of global-X may be explained by the fact that many multimeras resemble noisy bimeras, and UCHIME is tolerant of noise. All results reported here were obtained using global-X search unless otherwise stated.

Chimeric alignments. We identify three types of chimeric alignment between a query sequence Q and two candidate parents A and B: local, local-X and global-X. A chimeric alignment has two non-overlapping segments of Q, one of which is closer to A than to B by some measure of evolutionary distance while the other is closer to B than to A. In a local chimeric alignment, these two segments can be non-contiguous and may only cover a part of Q. In a local-X alignment, the segments are contiguous with an intervening crossover segment (X) which is identical in Q, A and B. A global-X alignment is a special case of a local-X alignment that covers all of Q, but not necessarily all of A or B.

Fig. 2.

Chimeric alignments. We identify three types of chimeric alignment between a query sequence Q and two candidate parents A and B: local, local-X and global-X. A chimeric alignment has two non-overlapping segments of Q, one of which is closer to A than to B by some measure of evolutionary distance while the other is closer to B than to A. In a local chimeric alignment, these two segments can be non-contiguous and may only cover a part of Q. In a local-X alignment, the segments are contiguous with an intervening crossover segment (X) which is identical in Q, A and B. A global-X alignment is a special case of a local-X alignment that covers all of Q, but not necessarily all of A or B.

2.3 Scoring function

In a typical chimeric alignment, most columns are identities _q_=_a_=b, where q, a and b are letters from Q, A and B, respectively. A column in which at least one sequence differs from the other two is called a diff. Diffs can be considered as votes for or against the model (Fig. 3). For example, a diff _q_=a, q_≠_b increases the distance d(Q,B) while leaving d(Q,A) unchanged. If such a diff is found in the segment that is closer to A, it can be regarded as a ‘yes’ vote supporting the model; if it is found in the segment that is closer to B then it contradicts the model and is regarded as a ‘no’ vote. A diff in which all three sequences differ or in which _a_=b, q_≠_a, q_≠_b increases the distance of Q to both A and B and is regarded as an ‘abstain’ vote that neither supports nor contradicts the model. Let Y g, N g and A g be the total number of yes, no and abstain votes in segment g of the model, where g is L (left) or R (right). If Y _L_>N L and Y _R_>N R, the alignment is chimeric and the model is closer to Q than A or B alone. The number of diffs may be very small in more challenging cases. For example, in a 16S experiment using 200 nt reads, clusters of radius ~3% might be used in an attempt to identify species (Stackebrandt and Goebel, 1994). It would then be important to identify chimeras with divergences as low as ~2%, which could have a few as four diffs with their closest parents. In such cases, the small amount of evidence available should increase the uncertainty of the classification. UCHIME uses a numerical score for discrimination, as follows. Each segment is assigned a score:

formula

(1)

Intuitively, this can be understood as a generalization of the ratio Y/N, which must be >1 for the alignment to be chimeric. The β parameter (which should be ≥1 and is set to 8 by default) gives a no vote a higher weight than a yes vote, and the n parameter (which should be >0 and is set to 1.4 by default) acts as a pseudocount prior (Durbin et al., 1998) on the number of no votes. A positive value of n reduces H, especially when Y is small; this models increased uncertainty with reduced evidence. Abstain votes also lower the score as they indicate noise or the use of a step-parent, either of which should increase uncertainty. The query is classified as a chimera if:

formula

(2)

Here, h is the minimum score threshold (0.28 by default). This score is ad hoc; i.e. was not derived from a theoretical model. It was chosen because it is conceptually simple, fast to compute, has only two tunable parameters (β and n) plus an adjustable threshold (h) and was found to perform well empirically.

Chimeric alignment showing diffs and votes. This figure shows a region from an alignment generated by UCHIME. Diffs and votes are annotated. The ‘Model’ row indicates the three segments of the alignment which are closer to A, the crossover (X) and closer to B, respectively. Diffs are ‘A’ = diff with Q closer to A in the A segment, ‘a’ = diff with Q closer to A in the B segment, and similarly for ‘B’ and ‘b’. A ‘p’ diff indicates that the parents agree but are different from Q. Votes are ‘+’ (yes), ‘!’ (no) and ‘0’ (abstain), indicating whether the corresponding diff supports or contradicts the model.

Fig. 3.

Chimeric alignment showing diffs and votes. This figure shows a region from an alignment generated by UCHIME. Diffs and votes are annotated. The ‘Model’ row indicates the three segments of the alignment which are closer to A, the crossover (X) and closer to B, respectively. Diffs are ‘A’ = diff with Q closer to A in the A segment, ‘a’ = diff with Q closer to A in the B segment, and similarly for ‘B’ and ‘b’. A ‘p’ diff indicates that the parents agree but are different from Q. Votes are ‘+’ (yes), ‘!’ (no) and ‘0’ (abstain), indicating whether the corresponding diff supports or contradicts the model.

2.4 Parent selection and alignment construction

Candidate parents are found by (i) splitting the query into subsequences (chunks); (ii) using each chunk to search the database; and (iii) saving the best few hits to each chunk. We have found any reasonable procedure to be effective for this stage. More difficult is to reduce the number of candidates in order to suppress the false positives caused by attempting to classify too many models. UCHIME selects the best two candidates according to the following procedure. A pair-wise alignment is computed between the query Q and each candidate parent P. The identity between P and Q is smoothed over a window (default size 32). For each position in Q, the highest value for the smoothed identity among the parents is recorded. The best candidate is then identified as the one with most positions having highest smoothed identity. Note that this does not require the positions to be contiguous. This can be effective in the case of a multimera where multiple disjoint segments are derived from a single parent sequence, which may occur when a sequence is highly abundant in the sample. The positions in which the best candidate has highest smoothed identity are removed from Q, and the second candidate is identified in the same way from the remaining positions. UCHIME then constructs a star multiple alignment (Altschul, 1989), i.e. one that preserves the pair-wise alignments of Q to the two candidate parents. Following ChimeraSlayer, columns in the three-way alignment containing a gap or adjacent to a column containing a gap are discarded as these tend to occur in regions that are less reliably aligned. Diffs are identified in the remaining columns. Finally, dynamic programming on the vector of diffs is used to find segments of a global-X or local-X labeling of the alignment that maximizes H.

2.5 De novo mode and abundance skew

In de novo mode, UCHIME starts with an empty reference database. Sequences are considered in the order of decreasing abundance. If a sequence is classified as chimeric, it is discarded; otherwise it is added to the reference database. Candidate parents are required to have abundance at least λ times that of the query sequence, on the assumption that a chimera has undergone fewer rounds of amplification and will therefore be less abundant than its parents. The parameter λ is called the abundance skew, and by default λ=2, assuming at least one more round of amplification for the parents.

2.6 Training and validation datasets

Three test datasets were used in this work. (i) SIM2 is a selected subset of the simulated bimeras and control sequences used to train and evaluate ChimeraSlayer. (ii) MOCK is the Uneven datasets used to evaluate Perseus (Quince et al., 2011). They are derived from pyrosequencing reads of ‘mock’ communities, i.e. experimentally mixed DNAs of known composition. These reads were processed by AmpliconNoise (Quince et al., 2011), which attempts to remove sequencing error and generates a set of predicted sequences for the amplicons. Sequences in this set were classified as biological or chimeric by comparing them to reference sequences for the species in each community, and chimera detection algorithms were assessed by their success in reproducing this classification. (iii) SIMM is a new set of simulated _m_-meras created for this work. SIM2 and SIMM were used to compare the performance of the reference database mode of UCHIME with ChimeraSlayer, MOCK was used to compare the de novo mode of UCHIME with Perseus. The parameters of UCHIME were trained on SIM2; the score threshold h was set to a value giving an average error rate over the whole SIM2 dataset lower than the error rate of ChimeraSlayer on the same data. UCHIME was trained by an exhaustive search over manually selected pairs (β,n). The optimal pair (β+,n+) was identified by maximizing the area under a receiver operating characteristic curve (Mason and Graham, 2002). Given β+ and n+, an optimal score threshold _h_+ is determined by (i) specifying a maximum desired error rate or minimum desired sensitivity and (ii) maximizing sensitivity or minimizing error rate, respectively. After training, the sensitivity of UCHIME averaged over all SIM2 sets was 70.6% with an error rate of 0.49%, compared with 54.6% sensitivity and 0.62% errors for ChimeraSlayer.

2.7 Creation of the SIMM dataset

In order to test UCHIME on multimeras, we implemented CHSIM, a simulator capable of creating _m_-meras with any number of segments. Input to CHSIM is a set of chimera-free parent sequences. In each iteration of the simulation, a preset number of chimeras (default 100) are created at crossovers where parents have an identical _k_-mer; in this experiment, we used _k_=10. Crossover points are selected at random, weighted by the frequency of the _k_-mer in the set of parent sequences. This biases crossovers to occur between similar sequences in regions of higher sequence similarity, as presumably happens in real experiments. Non-homologous crossovers are permitted, and exactly one occurred in the simulations used to create SIMM (ch646_m4_90_95). At the end of each iteration, chimeras are added to the pool of parent sequences, allowing multimeras to form when one or two existing chimeras cross over. To create the SIMM dataset, parents were the set of 86 reference sequences for species in the Uneven sets of MOCK. These have length ~250 nt and cover the V2 hypervariable region of the 16S gene. These relatively short parents were chosen to model the short sequences obtained by current sequencing technologies, which can be more challenging for chimera detection algorithms owing to the smaller number of diffs needed to cause divergences that are experimentally relevant (Haas et al., 2011). Several simulations were performed using the same set of parent sequences with different random number seeds. Segments in a chimera were required to be unique to one parent, otherwise an _m_-mera may be identical to an (_m_−1)-mera. Chimeras with _m_>4 were found to be very rare due to the short sequence length. Chimeras with _m_=2, 3 and 4 in three divergence ranges (90–95%, 95–97% and 97–99%) were identified, for a total of nine bins, each containing 100 simulated chimeras.

2.8 Program versions

Unless otherwise stated, UCHIME results were obtained using the USEARCH v4.2.52. Perseus results were obtained using v1.24 of the AmpliconNoise package. MAFFT v6.853 (Katoh and Toh, 2008) was used by Perseus to create alignments. The reference database used for both ChimeraSlayer and UCHIME was the ‘gold’ set in http://sourceforge.net/projects/microbiomeutil/files/, version 2011-11-02. Unless otherwise stated, Perseus results were obtained using PerseusD v1.24, a variant of the original Perseus algorithm that follows UCHIME by only testing parents that have been classified as non-chimeric and are at least twice as abundant as the query. For a comparison of Perseus with PerseusD, see the Supplemental Material.

3 RESULTS

3.1 Assessment on SIM2

The SIM2 dataset contains simulated bimeras and control sequences with lengths 200, 300 and full-length (FL). Bimeras are created by selecting two random segments of the control sequences. Ten additional sets are provided for each length in which from 1% to 5% of sites were mutated by introducing simulated substitutions or indels, respectively. These mutations model cases where reference sequences are diverged from the true parents due to biological variation, sequencing error or other factors. Results are presented in Table 1, Supplementary Table S1 and in Figure 4, which shows sensitivity and specificity on the length 200 sets, which are the shortest and therefore most difficult. As seen in Figure 4, UCHIME has higher sensitivity on all length 200 sets, with increasing improvement at higher mutation rates. The sensitivity of ChimeraSlayer falls rapidly as substitutions are introduced, even at the relatively low rate of 1%, while the sensitivity of UCHIME degrades only slightly. At a substitution rate of 2%, which is well within the range observed for 16S genes within strains of a single bacterial species, the sensitivity of ChimeraSlayer drops by more than half (from 71% to 25%), compared with a reduction of only 8% for UCHIME (from 72% to 66%).

Performance of UCHIME and ChimeraSlayer on length 200 tests in SIM2. These results show that UCHIME has higher sensitivity than ChimeraSlayer on all length 200 sets, with increasing improvement at higher mutation rates, especially when substitutions are present. The UCHIME error rate is <1% on all sets.

Fig. 4.

Performance of UCHIME and ChimeraSlayer on length 200 tests in SIM2. These results show that UCHIME has higher sensitivity than ChimeraSlayer on all length 200 sets, with increasing improvement at higher mutation rates, especially when substitutions are present. The UCHIME error rate is <1% on all sets.

Table 1.

Performance of UCHIME and ChimeraSlayer (CS) on the SIM2 benchmark

Length Mutations CS Sens. (Err.) UCHIME Sens. (Err.)
FL None 90.3 (1.0) 90.8 (0.5)
FL 1% indels. 83.6 (0.9) 94.3 (0.3)
FL 1% subs. 87.4 (0.4) 90.4 (0.2)
300 nt None 77.5 (1.9) 81.3 (1.9)
300 nt 1% indels. 66.6 (1.9) 76.4 (1.3)
300 nt 1% subs. 55.5 (0.4) 78.5 (1.0)
200 nt None 70.7 (1.6) 72.7 (0.9)
200 nt 1% indels. 60.4 (1.4) 66.6 (0.6)
200 nt 1% subs. 38.6 (0.3) 69.6 (0.6)
Length Mutations CS Sens. (Err.) UCHIME Sens. (Err.)
FL None 90.3 (1.0) 90.8 (0.5)
FL 1% indels. 83.6 (0.9) 94.3 (0.3)
FL 1% subs. 87.4 (0.4) 90.4 (0.2)
300 nt None 77.5 (1.9) 81.3 (1.9)
300 nt 1% indels. 66.6 (1.9) 76.4 (1.3)
300 nt 1% subs. 55.5 (0.4) 78.5 (1.0)
200 nt None 70.7 (1.6) 72.7 (0.9)
200 nt 1% indels. 60.4 (1.4) 66.6 (0.6)
200 nt 1% subs. 38.6 (0.3) 69.6 (0.6)

Sensitivity (Sens.) and error rate (Err.) are shown for selected subsets of the SIM2 benchmark: lengths 200, 300 and FL (full-length genes) with no added mutations and with 1% substitutions and indels respectively, as indicated in the Mutations column. For full results, see Supplementary Table S1. UCHIME has higher sensitivity on all these subsets; both programs have similar error rates in the range ~0.5% to ~2%. UCHIME is more tolerant of noise, especially with substitutions in short sequences where the sensitivity is improved from 38.6% to 69.6% (200 nt) and from 55.5% to 78.5% (300 nt). Values are given in percentages.

Table 1.

Performance of UCHIME and ChimeraSlayer (CS) on the SIM2 benchmark

Length Mutations CS Sens. (Err.) UCHIME Sens. (Err.)
FL None 90.3 (1.0) 90.8 (0.5)
FL 1% indels. 83.6 (0.9) 94.3 (0.3)
FL 1% subs. 87.4 (0.4) 90.4 (0.2)
300 nt None 77.5 (1.9) 81.3 (1.9)
300 nt 1% indels. 66.6 (1.9) 76.4 (1.3)
300 nt 1% subs. 55.5 (0.4) 78.5 (1.0)
200 nt None 70.7 (1.6) 72.7 (0.9)
200 nt 1% indels. 60.4 (1.4) 66.6 (0.6)
200 nt 1% subs. 38.6 (0.3) 69.6 (0.6)
Length Mutations CS Sens. (Err.) UCHIME Sens. (Err.)
FL None 90.3 (1.0) 90.8 (0.5)
FL 1% indels. 83.6 (0.9) 94.3 (0.3)
FL 1% subs. 87.4 (0.4) 90.4 (0.2)
300 nt None 77.5 (1.9) 81.3 (1.9)
300 nt 1% indels. 66.6 (1.9) 76.4 (1.3)
300 nt 1% subs. 55.5 (0.4) 78.5 (1.0)
200 nt None 70.7 (1.6) 72.7 (0.9)
200 nt 1% indels. 60.4 (1.4) 66.6 (0.6)
200 nt 1% subs. 38.6 (0.3) 69.6 (0.6)

Sensitivity (Sens.) and error rate (Err.) are shown for selected subsets of the SIM2 benchmark: lengths 200, 300 and FL (full-length genes) with no added mutations and with 1% substitutions and indels respectively, as indicated in the Mutations column. For full results, see Supplementary Table S1. UCHIME has higher sensitivity on all these subsets; both programs have similar error rates in the range ~0.5% to ~2%. UCHIME is more tolerant of noise, especially with substitutions in short sequences where the sensitivity is improved from 38.6% to 69.6% (200 nt) and from 55.5% to 78.5% (300 nt). Values are given in percentages.

3.2 Assessment on SIMM

The SIMM dataset contains 900 simulated chimeras of length ~250 nt divided into nine bins by divergence and the number of segments. As in SIM2, 10 additional sets were created by adding from 1% to 5% substitutions or indels. Results are presented in Supplementary Table S2 and Figure 5, which shows sensitivity on the set with 1% substitutions as we consider this level of noise to be reasonably realistic. (While indel errors are relatively common in pyrosequencing, this is mainly due to homopolymers which can be handled in a preprocessing step, e.g. by truncating runs of identical letters). Error rates are not shown since both programs find no false positives in the parent sequences. Again we observe that UCHIME has greatly improved sensitivity compared with ChimeraSlayer, especially to chimeras with small divergence and/or larger numbers of segments. Similar trends are seen in the other sets (Supplementary Table S2).

Sensitivity on the SIMM set with 1% substitutions. UCHIME has higher sensitivity than ChimeraSlayer on all subsets, especially to chimeras with small divergence and larger numbers of segments. In the 3×3 grid shown in the figure, columns indicate the number of segments (m) in an m-mera and rows correspond to divergence ranges.

Fig. 5.

Sensitivity on the SIMM set with 1% substitutions. UCHIME has higher sensitivity than ChimeraSlayer on all subsets, especially to chimeras with small divergence and larger numbers of segments. In the 3×3 grid shown in the figure, columns indicate the number of segments (m) in an _m_-mera and rows correspond to divergence ranges.

3.3 UCHIME and PerseusD compared on MOCK

Results on the MOCK sets are shown in Table 2. UCHIME is shown to have similar sensitivities and error rates to Perseus. Given that UCHIME was trained entirely on a very different dataset (SIM2) in reference database mode with no separate training of the de novo mode, we interpret these results as demonstrating that the UCHIME algorithm is highly robust when presented with new types of data.

Table 2.

Performance of UCHIME and PerseusD on the MOCK datasets

Set GoodSeqs Chimeras Sensitivity Errors
PerseusD (%) UCdn (%) UCref (%) PerseusD UCdn UCref
Uneven1 94 898 93 95 89 1 0 2
Uneven2 77 742 93 93 86 0 1 2
Uneven3 75 925 93 94 91 0 2 1
_m_=2 _m_=3 _m_=4
N PerseusD UCdn UCref N PerseusD UCdn UCref N PerseusD UCdn UCref
Uneven1 816 761 780 737 81 73 68 61 1 0 1 1
Uneven2 669 619 624 578 71 66 60 57 2 1 1 1
Uneven3 843 797 806 782 82 64 64 59 0 0 0 0
Set GoodSeqs Chimeras Sensitivity Errors
PerseusD (%) UCdn (%) UCref (%) PerseusD UCdn UCref
Uneven1 94 898 93 95 89 1 0 2
Uneven2 77 742 93 93 86 0 1 2
Uneven3 75 925 93 94 91 0 2 1
_m_=2 _m_=3 _m_=4
N PerseusD UCdn UCref N PerseusD UCdn UCref N PerseusD UCdn UCref
Uneven1 816 761 780 737 81 73 68 61 1 0 1 1
Uneven2 669 619 624 578 71 66 60 57 2 1 1 1
Uneven3 843 797 806 782 82 64 64 59 0 0 0 0

Input sequences are denoised amplicons and abundances predicted by AmpliconNoise (Quince et al., 2011). These results show that de novo UCHIME (UCdn) with default parameters (obtained by training on SIM2) has similar sensitivities and error rates to PerseusD. In reference database mode (UCref), using the microbiome utils gold reference database, UCHIME performance is similar, reflecting the fact that many of the 16S sequences in the communities are found in the gold database. N is the number of chimeric sequences and m is the number of segments in the chimera. GoodSeqs and Chimeras are the total numbers of biological sequences and chimeras, respectively, found using a separate reference database of 16S sequences for species in the communities (see Quince et al., 2011 for details).

Table 2.

Performance of UCHIME and PerseusD on the MOCK datasets

Set GoodSeqs Chimeras Sensitivity Errors
PerseusD (%) UCdn (%) UCref (%) PerseusD UCdn UCref
Uneven1 94 898 93 95 89 1 0 2
Uneven2 77 742 93 93 86 0 1 2
Uneven3 75 925 93 94 91 0 2 1
_m_=2 _m_=3 _m_=4
N PerseusD UCdn UCref N PerseusD UCdn UCref N PerseusD UCdn UCref
Uneven1 816 761 780 737 81 73 68 61 1 0 1 1
Uneven2 669 619 624 578 71 66 60 57 2 1 1 1
Uneven3 843 797 806 782 82 64 64 59 0 0 0 0
Set GoodSeqs Chimeras Sensitivity Errors
PerseusD (%) UCdn (%) UCref (%) PerseusD UCdn UCref
Uneven1 94 898 93 95 89 1 0 2
Uneven2 77 742 93 93 86 0 1 2
Uneven3 75 925 93 94 91 0 2 1
_m_=2 _m_=3 _m_=4
N PerseusD UCdn UCref N PerseusD UCdn UCref N PerseusD UCdn UCref
Uneven1 816 761 780 737 81 73 68 61 1 0 1 1
Uneven2 669 619 624 578 71 66 60 57 2 1 1 1
Uneven3 843 797 806 782 82 64 64 59 0 0 0 0

Input sequences are denoised amplicons and abundances predicted by AmpliconNoise (Quince et al., 2011). These results show that de novo UCHIME (UCdn) with default parameters (obtained by training on SIM2) has similar sensitivities and error rates to PerseusD. In reference database mode (UCref), using the microbiome utils gold reference database, UCHIME performance is similar, reflecting the fact that many of the 16S sequences in the communities are found in the gold database. N is the number of chimeric sequences and m is the number of segments in the chimera. GoodSeqs and Chimeras are the total numbers of biological sequences and chimeras, respectively, found using a separate reference database of 16S sequences for species in the communities (see Quince et al., 2011 for details).

3.4 Computational resources

All the tested programs have modest memory requirements, needing at most 50 Mb to complete the tests reported here. Execution times required to execute UCHIME, ChimeraSlayer and Perseus on a representative dataset are shown in Table 3. Two versions of UCHIME are tested: a stand-alone program and a second implementation of the UCHIME algorithm in the USEARCH package (Edgar, 2010). The stand-alone version of UCHIME is more than 100× faster than Perseus in de novo mode and more than 1000× faster than ChimeraSlayer in reference database mode, with a further order of magnitude achieved by the USEARCH version.

Program Mode Time (h:min:s)
usearch –uchime de novo 0:02
UCHIME de novo 0:13
PerseusD de novo 32:06
usearch –uchime ref. db. 0:34
UCHIME ref. db. 13:19
ChimeraSlayer ref. db. 4:28:28
Program Mode Time (h:min:s)
usearch –uchime de novo 0:02
UCHIME de novo 0:13
PerseusD de novo 32:06
usearch –uchime ref. db. 0:34
UCHIME ref. db. 13:19
ChimeraSlayer ref. db. 4:28:28

Elapsed time required to execute two implementations of the UCHIME algorithm compared with ChimeraSlayer and Perseus on the Uneven1 subset of the MOCK data, which has 1124 sequences. The ChimeraSlayer reference database (5181 sequences) was used for both UCHIME and ChimeraSlayer. The stand-alone UCHIME program is tested and also an implementation of the same algorithm in the USEARCH package (Edgar, 2010). A single-core, 1 GHz 32-bit i86 Linux computer with 1 GB RAM was used.

Program Mode Time (h:min:s)
usearch –uchime de novo 0:02
UCHIME de novo 0:13
PerseusD de novo 32:06
usearch –uchime ref. db. 0:34
UCHIME ref. db. 13:19
ChimeraSlayer ref. db. 4:28:28
Program Mode Time (h:min:s)
usearch –uchime de novo 0:02
UCHIME de novo 0:13
PerseusD de novo 32:06
usearch –uchime ref. db. 0:34
UCHIME ref. db. 13:19
ChimeraSlayer ref. db. 4:28:28

Elapsed time required to execute two implementations of the UCHIME algorithm compared with ChimeraSlayer and Perseus on the Uneven1 subset of the MOCK data, which has 1124 sequences. The ChimeraSlayer reference database (5181 sequences) was used for both UCHIME and ChimeraSlayer. The stand-alone UCHIME program is tested and also an implementation of the same algorithm in the USEARCH package (Edgar, 2010). A single-core, 1 GHz 32-bit i86 Linux computer with 1 GB RAM was used.

4 DISCUSSION

Chimeric sequence identification poses a challenging problem in algorithm design, especially with short reads, where the available evidence is often limited to very small numbers of observed differences. UCHIME achieves a significant improvement in detection accuracy over previous methods that use a reference database, and performs comparably to a state-of-the art de novo method designed specifically for pyrosequencing, despite the fact that UCHIME was not designed or trained for this particular type of data. UCHIME achieves much faster execution speeds than previous programs. Our results show that UCHIME has robust performance when presented with different types of 16S data and is tolerant of simulated noise, suggesting that UCHIME is likely to perform well with other types of data, e.g. the fungal ITS region or reads from novel sequencing technologies.

UCHIME requires either a database with adequate coverage of the phylogenetic diversity in the input sequences (reference mode), or an estimate of unique amplicon sequences and their abundances (de novo mode). Construction of a suitable reference database and robust estimation of amplicon sequences and their abundances (denoising) are both challenging problems that are discussed in more detail in the Supplementary Material.

Although we regard the experiments reported here as informative for comparing algorithms, realism is hard to achieve both in simulations and in mock communities, so results may not be predictive of sensitivity and error rates that would be achieved on experimental data from environmental samples.

The emerging interest in characterizing the effects of members of the rare biosphere in a range of clinical and environmental contexts, combined with the rapid decrease in sequencing cost, challenges us to improve the efficiency of sequence analysis so that that computational cost does not become a limiting factor. UCHIME meets this challenge for an essential step in many experiments, offering a unique combination of accuracy and speed that will be of great value to biologists.

ACKNOWLEDGEMENTS

R.C.E. thanks Henrik Flyvbjerg for helpful discussions.

Funding: National Institutes of Health (grant U54-HG004969 to B.J.H.); Engineering and Physical Sciences Research Council Career Acceleration Fellowship (EP/H003851/1 to C.Q.); National Institutes of Health (grant HG004872 and HHMI to R.K., in part).

Conflict of Interest: none declared.

REFERENCES

et al.

PCR-induced sequence artifacts and bias: insights from comparison of two 16S rRNA clone libraries constructed from the same sample

,

Appl. Environ. Microbiol.

,

2005

, vol.

71

(pg.

8966

-

8969

)

Trees, stars and multiple biological sequence alignment

,

SIAM J. Appl. Math.

,

1989

, vol.

49

(pg.

197

-

209

)

et al.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

,

Nucleic Acids Res.

,

1997

, vol.

25

(pg.

3389

-

3402

)

et al.

At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies

,

Appl. Environ. Microbiol.

,

2005

, vol.

71

(pg.

7724

-

7736

)

et al.

New screening software shows that most recent large 16S rRNA gene clone libraries contain chimeras

,

Appl. Environ. Microbiol.

,

2006

, vol.

72

(pg.

5734

-

5741

)

et al. ,

Biological Sequence Analysis

,

1998

Cambridge, UK

Cambridge University Press

Search and clustering orders of magnitude faster than BLAST

,

Bioinformatics

,

2010

, vol.

26

(pg.

2460

-

2461

)

et al.

Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons

,

Genome Res.

,

2011

, vol.

21

(pg.

494

-

504

)

et al.

Bellerophon: a program to detect chimeric sequences in multiple sequence alignments

,

Bioinformatics

,

2004

, vol.

20

(pg.

2317

-

2319

)

Recent developments in the MAFFT multiple sequence alignment program

,

Brief. Bioinformatics

,

2008

, vol.

9

(pg.

286

-

298

)

Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase

,

Biotechniques

,

2009

, vol.

47

(pg.

857

-

866

)

et al.

A new version of the RDP (Ribosomal Database Project)

,

Nucleic Acids Res.

,

1999

, vol.

27

(pg.

171

-

173

)

Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: statistical significance and interpretation

,

Q. J. Meteorol. Soc.

,

2002

, vol.

128

(pg.

2145

-

2166

)

et al.

An open source chimera checker for the fungal ITS region

,

Mol. Ecol. Res.

,

2010

, vol.

10

(pg.

1076

-

1081

)

et al.

Removing noise from pyrosequenced amplicons

,

BMC Bioinformatics

,

2011

, vol.

12

pg.

38

A place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in Bacteriology

,

Int. J. Syst. Bacteriol.

,

1994

, vol.

44

(pg.

846

-

849

)

et al.

Heteroduplexes in mixed-template amplifications: formation, consequence and elimination by ‘reconditioning PCR’

,

Nucleic Acids Res.

,

2002

, vol.

30

(pg.

2083

-

2088

)

The frequency of chimeric molecules as a consequence of PCR co-amplification of 16S rRNA genes from different bacterial species

,

Microbiology

,

1996

, vol.

142

Pt 5

(pg.

1107

-

1114

)

Frequency of formation of chimeric molecules as a consequence of PCR coamplification of 16S rRNA genes from mixed bacterial genomes

,

Appl. Environ. Microbiol.

,

1997

, vol.

63

(pg.

4645

-

4650

)

Author notes

Associate Editor: Martin Bishop

© The Author(s) 2011. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 42,476

33,134 Pageviews

9,342 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 29
December 2016 41
January 2017 94
February 2017 121
March 2017 143
April 2017 134
May 2017 124
June 2017 166
July 2017 128
August 2017 220
September 2017 169
October 2017 174
November 2017 149
December 2017 784
January 2018 715
February 2018 551
March 2018 590
April 2018 525
May 2018 579
June 2018 662
July 2018 492
August 2018 640
September 2018 560
October 2018 677
November 2018 578
December 2018 448
January 2019 430
February 2019 512
March 2019 577
April 2019 499
May 2019 598
June 2019 441
July 2019 508
August 2019 521
September 2019 382
October 2019 360
November 2019 311
December 2019 304
January 2020 306
February 2020 426
March 2020 368
April 2020 258
May 2020 320
June 2020 400
July 2020 369
August 2020 384
September 2020 528
October 2020 367
November 2020 416
December 2020 318
January 2021 351
February 2021 359
March 2021 507
April 2021 431
May 2021 333
June 2021 365
July 2021 396
August 2021 374
September 2021 433
October 2021 536
November 2021 473
December 2021 546
January 2022 539
February 2022 456
March 2022 709
April 2022 658
May 2022 657
June 2022 649
July 2022 520
August 2022 548
September 2022 722
October 2022 542
November 2022 553
December 2022 587
January 2023 497
February 2023 493
March 2023 606
April 2023 516
May 2023 513
June 2023 474
July 2023 497
August 2023 551
September 2023 593
October 2023 531
November 2023 407
December 2023 531
January 2024 606
February 2024 566
March 2024 439
April 2024 423
May 2024 440
June 2024 387
July 2024 371
August 2024 401
September 2024 407
October 2024 400
November 2024 187

×

Email alerts

Citing articles via

More from Oxford Academic