Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim - PubMed (original) (raw)

Characteristics of 454 pyrosequencing data--enabling realistic simulation with flowsim

Susanne Balzer et al. Bioinformatics. 2010.

Erratum in

Abstract

Motivation: The commercial launch of 454 pyrosequencing in 2005 was a milestone in genome sequencing in terms of performance and cost. Throughout the three available releases, average read lengths have increased to approximately 500 base pairs and are thus approaching read lengths obtained from traditional Sanger sequencing. Study design of sequencing projects would benefit from being able to simulate experiments.

Results: We explore 454 raw data to investigate its characteristics and derive empirical distributions for the flow values generated by pyrosequencing. Based on our findings, we implement Flowsim, a simulator that generates realistic pyrosequencing data files of arbitrary size from a given set of input DNA sequences. We finally use our simulator to examine the impact of sequence lengths on the results of concrete whole-genome assemblies, and we suggest its use in planning of sequencing projects, benchmarking of assembly methods and other fields.

Availability: Flowsim is freely available under the General Public License from http://blog.malde.org/index.php/flowsim/.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

(a) A 454 flowgram: cyclic flowing during one read. The light signal strengths (flow values) are directly translated into homopolymer runs. (b) Absolute frequencies of flow values (E.coli). Left: original data, no quality-trimming; right: quality-trimmed. The trimming algorithm enhances the separation of the homopolymer length distributions and levels out discrepancies between the nucleotides such that the curves for the four nucleotides are nearly identical.

Fig. 2.

Fig. 2.

(a) Absolute frequencies of flow values by flow cycle. A total of 200 flow cycles of a Titanium run correspond to 200×4 = 800 flows. The first two flow cycles contain the TCAG tag and are omitted here. Towards the end of a run, flow values tend to lie further away from their ideal values (integers), but are obviously less in number because many values from later flow cycles have been trimmed away. (b) Standard deviation of flow values (difference in relation to their closest integer), by flow cycle. Standard deviation increases almost linearly. Only flow values <5.5 were included.

Fig. 3.

Fig. 3.

Empirical distributions (smoothed average of E.coli and D.labrax) on logarithmic scale. In gray: fitted (log-) normal distributions.

Fig. 4.

Fig. 4.

De novo and reference-based N50 for E.coli. Both real and simulated 454 data were assembled using Newbler v2.3.

Similar articles

Cited by

References

    1. Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Blattner FR, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1462. - PubMed
    1. Brockman W, et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008;18:763–770. - PMC - PubMed
    1. Engle ML, Burks C. GenFrag 2.1: new features for more robust fragment assembly benchmarks. Comput. Appl. Biosci. 1994;10:567–568. - PubMed
    1. Gomez-Alvarez V, et al. Systematic artifacts in metagenomes from complex microbial communities. ISME J. 2009;3:1314–1317. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources