Probabilistic base calling of Solexa sequencing data - PubMed (original) (raw)

Probabilistic base calling of Solexa sequencing data

Jacques Rougemont et al. BMC Bioinformatics. 2008.

Abstract

Background: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology.

Results: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads.

Conclusion: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Signal and noise in fluorescence intensities. Representation of the first cycle of synthesis on five concatenated tiles of the phiX174 sequencing data. A. Projection of the intensity quadruples on the axes corresponding to the A and C channels and the G and T channels at cycles 1 an 15. The ellipses represent the Gaussian mixtures (centers and the line for one standard deviation are shown). B. Same data after de-correlation transformations (see Methods). Coloring reflects the mixture component with largest probability.

Figure 2

Figure 2

Base calling determined by entropy. A. Probability simplex for a 3-letter alphabet (A = blue, C = red, G = green). Each point in the triangle is a probability triplet (_P_A, _P_C, P_G) represented by the corresponding color mixture. Blue lines are iso-entropic levels, black lines are the cutoffs between the various IUPAC codes. These correspond to midpoints in the state variable (S = 2_h). B. Distribution of entropy per base across 10 tiles on 36 bases. Red lines at the bottom indicate the IUPAC cutoffs. Mass within each segment is indicated in red.

Figure 3

Figure 3

Quality and entropy depend on position in the sequence. A. Quantile-quantile plot of fast-q quality score against the information content per base. The two measures are loosely correlated, but clearly not equivalent. B. Boxplot of the fast-q score along the first 35 bases of the sequencing. The overall base quality decreases sharply after base 14, but the distribution still extends up to the top 40 score at bases 30–35. C. Frequency of the four categories of ambiguous IUPAC codes as a function of the position in the sequence.

Figure 4

Figure 4

Rolexa base-calling increases the coverage. Black: Solexa base calling, blue: Rolexa base calling using only the ACGT alphabet (most probable base calling), green: Rolexa base calling using IUPAC codes, red: Rolexa base calling with IUPAC codes and tag length optimization. Numbers in the right margin are the number of matching tags in millions. Sequence tags were sorted by decreasing quality (fast-q) and unique exact matches on the reference phiX174 genome were searched. Vertical axis shows the proportion of tags finding an exact match.

Figure 5

Figure 5

Disequilibrium between complementary bases ratio. A. Error rate at each cycle of sequencing. Each tag was aligned on the genome using align0 and the error rate defined by counting the number of differences between the bases called and the reference at the corresponding position. Black is the error rate for Solexa-called tags, blue for Rolexa tags called using only the ACGT alphabet and green for Rolexa-called tags with IUPAC codes. B. Proportion of bases A, C, G and T at each position in the tags for Solexa base calling (dashed lines) and Rolexa base calling (continuous line). The complementary A and T proportions are different (ratio is not 1) and are degrading along the sequences (lines drift apart). The proportions are less dependent on position with Rolexa base calling, although the ratios remain different from 1. Label on y-axis is wrong. Panels C-D focuses on tags "rescued" by Rolexa base calling, namely those tags that could not be mapped on the genome after Solexa base calling, but had a matching position via Rolexa base calling. C. The distribution of substitutions between the Solexa tags and the corresponding Rolexa tags shows a predominance of C to A and T to G substitutions which is consistent with a re-equilibration of the base complementarity.D. Introducing one to six mutations in the Solexa tags with the same frequencies as the Rolexa algorithm at random positions only rescues about 2% of the tags that were rescued by Rolexa with the same number of ambiguous bases (green bars).

Figure 6

Figure 6

Tag-dependent quality filtering improves the mapping efficiency. Several entropy cutoffs were used to filter low-quality Rolexa-called tags and to reduce tags to higher scoring sub-tags. Solexa-called tags were filtered to the same length as the average length of the previous sets and to various average fast-q score. A. The actual coverage of the target genome as a function of the expected coverage (if all tags could have been mapped). B. The efficiency of the filtering in coverage ratio (actual number of nucleotides covered divided by expected number, X axis) and in tag mapping ratio (number of tags mapped to the genome divided by number of tags passing the quality filter, Y axis). Rolexa (red points) has superior efficiency to Solexa (green points) in all data sets. Points are labeled with the cutoffs used (see text): Rolexa cutoffs are either constant (2, 4, 6, 8), growing logarithmically (Log) or exponentially (Exp), Solexa cutoffs are indicated by two numbers, the length cutoff followed by the fast-q cutoff.

Similar articles

Cited by

References

    1. Bentley DR. Whole-genome re-sequencing. Current Opinion in Genetics & Development. 2006;16:545–552. doi: 10.1016/j.gde.2006.10.009. - DOI - PubMed
    1. Chen W, Kalscheu V, Tzschach A, Menzel C, Ullmann R, Schulz M, Erdogan F, Li N, Kijas Z, Arkesteijn G, et al. Mapping translocation breakpoints by next-generation sequencing. Genome Research. 2008 - PMC - PubMed
    1. Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons J, Kim PM, Palejev D, Carriero NJ, Du L, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318:420–426. doi: 10.1126/science.1149504. - DOI - PMC - PubMed
    1. Hafner M, Landgraf P, Ludwig J, Rice A, Ojo T, Lin C, Holoch D, Lim C, Tuschl T. Identification of microRNAs and other small regulatory RNAs using cDNA library sequencing. Methods. 2008;44:3–12. doi: 10.1016/j.ymeth.2007.09.009. - DOI - PMC - PubMed
    1. Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH. Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol. 2008;17:1636–1647. doi: 10.1111/j.1365-294X.2008.03666.x. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources