SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data - PubMed (original) (raw)

SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data

Murray P Cox et al. BMC Bioinformatics. 2010.

Abstract

Background: Illumina's second-generation sequencing platform is playing an increasingly prominent role in modern DNA and RNA sequencing efforts. However, rapid, simple, standardized and independent measures of run quality are currently lacking, as are tools to process sequences for use in downstream applications based on read-level quality data.

Results: We present SolexaQA, a user-friendly software package designed to generate detailed statistics and at-a-glance graphics of sequence data quality both quickly and in an automated fashion. This package contains associated software to trim sequences dynamically using the quality scores of bases within individual reads.

Conclusion: The SolexaQA package produces standardized outputs within minutes, thus facilitating ready comparison between flow cell lanes and machine runs, as well as providing immediate diagnostic information to guide the manipulation of sequence data for downstream analyses.

PubMed Disclaimer

Figures

Figure 1

Example heat map showing several commonly observed quality defects. Nucleotide positions 1-75 are plotted from left-to-right along the _x_-axis; tiles 1-100 are ranked from top-to-bottom along the _y_-axis. (These numbers may vary for other datasets). The scale depicts the mean probability of observing a base call error for each tile at each nucleotide position. The defects evident in this dataset (see text for details) are atypical of Illumina sequencing; this dataset was chosen specifically to illustrate the capabilities of SolexaQA.

Figure 2

Distribution of mean quality (probability of error, _y_-axis) at each nucleotide position (_x_-axis) for each tile individually (dotted black lines) and the entire dataset combined (red circles). Note the considerable variance in data quality between tiles. The defects evident in this dataset (see text for details) are atypical of Illumina sequencing; this dataset was chosen specifically to illustrate the capabilities of SolexaQA.

Figure 3

Distribution of longest read segments passing a user-defined quality threshold (here, P = 0.05, or equivalently, Phred quality score Q ≈ 13, or a base call error rate of 1-in-20). Note that reads in this dataset would be trimmed on average to ~25 nucleotides (i.e., only approximately one-third of the initial 75 nucleotide read length). The defects evident in this dataset (see text for details) are atypical of Illumina sequencing; this dataset was chosen specifically to illustrate the capabilities of SolexaQA.

Figure 4

Effect of dynamically trimmed versus untrimmed reads on de novo assembly with the Velvet assembler. Dynamically trimmed reads (solid symbols) relative to untrimmed reads (open symbols) yield improved N50 values (red squares) and maximum contig sizes (blue triangles). Summary statistics were averaged across de novo assemblies for 20 isolates of Campylobacter coli and C. jejuni, and normalized by the total number of reads employed in each assembly.

Cited by

Inter-Individual Differences in the Oral Bacteriome Are Greater than Intra-Day Fluctuations in Individuals.
Sato Y, Yamagishi J, Yamashita R, Shinozaki N, Ye B, Yamada T, Yamamoto M, Nagasaki M, Tsuboi A. Sato Y, et al. PLoS One. 2015 Jun 29;10(6):e0131607. doi: 10.1371/journal.pone.0131607. eCollection 2015. PLoS One. 2015. PMID: 26121551 Free PMC article.
Metabolic potential of lithifying cyanobacteria-dominated thrombolitic mats.
Mobberley JM, Khodadad CL, Foster JS. Mobberley JM, et al. Photosynth Res. 2013 Nov;118(1-2):125-40. doi: 10.1007/s11120-013-9890-6. Epub 2013 Jul 19. Photosynth Res. 2013. PMID: 23868401 Free PMC article.
Genome evolution in an ancient bacteria-ant symbiosis: parallel gene loss among Blochmannia spanning the origin of the ant tribe Camponotini.
Williams LE, Wernegreen JJ. Williams LE, et al. PeerJ. 2015 Apr 2;3:e881. doi: 10.7717/peerj.881. eCollection 2015. PeerJ. 2015. PMID: 25861561 Free PMC article.
Australian black field crickets show changes in neural gene expression associated with socially-induced morphological, life-history, and behavioral plasticity.
Kasumovic MM, Chen Z, Wilkins MR. Kasumovic MM, et al. BMC Genomics. 2016 Oct 24;17(1):827. doi: 10.1186/s12864-016-3119-y. BMC Genomics. 2016. PMID: 27776492 Free PMC article.
BIGpre: a quality assessment package for next-generation sequencing data.
Zhang T, Luo Y, Liu K, Pan L, Zhang B, Yu J, Hu S. Zhang T, et al. Genomics Proteomics Bioinformatics. 2011 Dec;9(6):238-44. doi: 10.1016/S1672-0229(11)60027-2. Genomics Proteomics Bioinformatics. 2011. PMID: 22289480 Free PMC article.

References

1. Metzker ML. Sequencing technologies - The next generation. Nat Rev Genet. 2010;11:31–46. doi: 10.1038/nrg2626. - DOI - PubMed
1. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008;36:e105. doi: 10.1093/nar/gkn425. - DOI - PMC - PubMed
1. Dolan PC, Denver DR. TileQC: A system for tile-based quality control of Solexa data. BMC Bioinformatics. 2008;9:250. doi: 10.1186/1471-2105-9-250. - DOI - PMC - PubMed
1. Hannon GJ. FASTX-Toolkit. 2010. http://hannonlab.cshl.edu/fastx_toolkit/
1. Martínez-Alcántara A, Ballesteros E, Feng C, Rojas M, Koshinsky H, Fofanov VY, Havlak P, Fofanov Y. PIQA: Pipeline for Illumina G1 genome analyzer data quality assessment. Bioinformatics. 2009;25:2438–2439. doi: 10.1093/bioinformatics/btp429. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data - PubMed (original) (raw)