SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data - PubMed (original) (raw)

SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data

Murray P Cox et al. BMC Bioinformatics. 2010.

Abstract

Background: Illumina's second-generation sequencing platform is playing an increasingly prominent role in modern DNA and RNA sequencing efforts. However, rapid, simple, standardized and independent measures of run quality are currently lacking, as are tools to process sequences for use in downstream applications based on read-level quality data.

Results: We present SolexaQA, a user-friendly software package designed to generate detailed statistics and at-a-glance graphics of sequence data quality both quickly and in an automated fashion. This package contains associated software to trim sequences dynamically using the quality scores of bases within individual reads.

Conclusion: The SolexaQA package produces standardized outputs within minutes, thus facilitating ready comparison between flow cell lanes and machine runs, as well as providing immediate diagnostic information to guide the manipulation of sequence data for downstream analyses.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Example heat map showing several commonly observed quality defects. Nucleotide positions 1-75 are plotted from left-to-right along the _x_-axis; tiles 1-100 are ranked from top-to-bottom along the _y_-axis. (These numbers may vary for other datasets). The scale depicts the mean probability of observing a base call error for each tile at each nucleotide position. The defects evident in this dataset (see text for details) are atypical of Illumina sequencing; this dataset was chosen specifically to illustrate the capabilities of SolexaQA.

Figure 2

Figure 2

Distribution of mean quality (probability of error, _y_-axis) at each nucleotide position (_x_-axis) for each tile individually (dotted black lines) and the entire dataset combined (red circles). Note the considerable variance in data quality between tiles. The defects evident in this dataset (see text for details) are atypical of Illumina sequencing; this dataset was chosen specifically to illustrate the capabilities of SolexaQA.

Figure 3

Figure 3

Distribution of longest read segments passing a user-defined quality threshold (here, P = 0.05, or equivalently, Phred quality score Q ≈ 13, or a base call error rate of 1-in-20). Note that reads in this dataset would be trimmed on average to ~25 nucleotides (i.e., only approximately one-third of the initial 75 nucleotide read length). The defects evident in this dataset (see text for details) are atypical of Illumina sequencing; this dataset was chosen specifically to illustrate the capabilities of SolexaQA.

Figure 4

Figure 4

Effect of dynamically trimmed versus untrimmed reads on de novo assembly with the Velvet assembler. Dynamically trimmed reads (solid symbols) relative to untrimmed reads (open symbols) yield improved N50 values (red squares) and maximum contig sizes (blue triangles). Summary statistics were averaged across de novo assemblies for 20 isolates of Campylobacter coli and C. jejuni, and normalized by the total number of reads employed in each assembly.

Similar articles

Cited by

References

    1. Metzker ML. Sequencing technologies - The next generation. Nat Rev Genet. 2010;11:31–46. doi: 10.1038/nrg2626. - DOI - PubMed
    1. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008;36:e105. doi: 10.1093/nar/gkn425. - DOI - PMC - PubMed
    1. Dolan PC, Denver DR. TileQC: A system for tile-based quality control of Solexa data. BMC Bioinformatics. 2008;9:250. doi: 10.1186/1471-2105-9-250. - DOI - PMC - PubMed
    1. Hannon GJ. FASTX-Toolkit. 2010. http://hannonlab.cshl.edu/fastx_toolkit/
    1. Martínez-Alcántara A, Ballesteros E, Feng C, Rojas M, Koshinsky H, Fofanov VY, Havlak P, Fofanov Y. PIQA: Pipeline for Illumina G1 genome analyzer data quality assessment. Bioinformatics. 2009;25:2438–2439. doi: 10.1093/bioinformatics/btp429. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources