ART: a next-generation sequencing read simulator (original) (raw)

Journal Article

1Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709 and 2Department of Biology, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA

* To whom correspondence should be addressed.

Search for other works by this author on:

* To whom correspondence should be addressed.

Search for other works by this author on:

Received:

03 October 2011

Revision received:

06 December 2011

Accepted:

19 December 2011

Published:

23 December 2011

Cite

Weichun Huang, Leping Li, Jason R. Myers, Gabor T. Marth, ART: a next-generation sequencing read simulator, Bioinformatics, Volume 28, Issue 4, February 2012, Pages 593–594, https://doi.org/10.1093/bioinformatics/btr708
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles.

Availability: Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art

Contact: weichun.huang@nih.gov; gabor.marth@bc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

In the past few years, high-throughput next-generation sequencing technologies have effectively replaced earlier data types for genome-wide studies measuring gene expression changes and discovering genomic/epigenetic variations, and many tools were developed for analyzing such datasets. Simulated data is indispensable for guiding tool development and evaluating tool performance, and therefore it is essential to develop simulation software that can produce next-generation sequencing reads that captures the most essential characteristics of real data. Currently available read simulation programs include Wgsim from the Samtools package (Li et al., 2009) for generating Illumina sequencing reads, MetaSim (Richter et al., 2008) for simulating metagenomic data, Mason (http://seqan.de/projects/mason.html) for both Illumina and 454 reads, SimSeq (https://github.com/jstjohn/SimSeq) for Illumina reads and FlowSim (Balzer et al., 2010) for 454 reads. Although these programs work well in their domain, there is a need for a read simulation program that can deal with all major sequencing platforms, and generate sequence reads with both substitution and insertion–deletion (INDEL) errors, as appropriate for the error modes of each specific platform.

As a general simulator, our ART software was initially developed for simulation studies helping to design data collection modalities of the 1000 Genomes Project (Durbin et al., 2010). ART has been subsequently used by many users worldwide to facilitate sequencing software development. ART takes a set of DNA sequences (representing e.g. a reference genome), and generates ‘synthetic’ sequencing reads in a way that mimics the technology-specific sequencing process. ART comes with a set of technology-specific read error profiles, but it can also take user-supplied profiles to generate sequencing data with customized read length and error characteristics. ART can report simulated reads in the standard SAM alignment format and UCSC BED files.

2 FEATURES AND METHODS

ART simulates both single-end and paired-end sequencing reads of the three main commercial next-generation sequencing platforms: 454, Illumina and SOLiD. The built-in read length and read error profiles were derived from large sets of actual real sequencing data (see Supplementary Material). ART supports all three types of common sequencing errors: base substitutions, insertions and deletions.

2.1 Illumina read simulation

Illumina sequencing by synthesis is a base-by-base sequencing technology using a reversible terminator-based method, enabling detection of single bases as they are incorporated into growing DNA strands complementary to the template (Bentley, 2006). Since this technology reads out one base at a time, the main error mode is substitution rather than insertion or deletion. The probability of a substitution error is determined by the base quality score associated with the called base. The distribution of base quality scores is position-dependent: the mean quality score decreases as a function of increasing base position. ART simulates substitution errors according to the empirical, position-dependent distribution of base quality scores, measured in large training datasets. The base quality score does not directly provide information for INDEL errors, and ART simulates insertion and deletion errors directly from empirical distributions from our training data. The current version of ART comes with four empirical read quality score distributions, one for each of four different read lengths: 36, 44, 50 and 75 bp. The built-in insertion and deletion error rates were derived from 35 bp reads aligned with our modified ACANA alignment tool (Huang et al., 2006). For paired-end simulation, ART uses two different quality score distributions and error rates for the first and second reads, each determined empirically.

2.2 454 read simulation

Roche/454 sequencing is a pyrosequencing technology that tests for the presence of each of the four DNA nucleotides (T, A, C, G) in a cyclical fashion. All consecutive bases within a homopolymer run are incorporated within a single cycle, and the read-out is an intensity signal that is proportional with the number of incorporated bases (Margulies et al., 2005). The dominant error mode is base over- or under-call, resulting in INDEL type errors. While sequencing error rate only slightly increases with the number of flow cycles, the error rate increases dramatically with the frequency of long homopolymer runs. Accordingly, ART models the 454 sequencing error profile with homopolymer length-dependent over-call (insertion) and under-call (deletion) error distributions, and models base quality profiles as homopolymer length-dependent first-order Markov chains. ART uses an empirical distribution of 454 read lengths. By default, ART generates 454 reads with built-in distributions derived for the 454 GS FLX sequencer model.

2.3 SOLiD read simulation

Applied Biosystems' SOLiD sequencing technology is based on ligation of oligonucleotides. It uses four fluorescent color dyes to encode the 16 different dinucleotides, each dye encoding four dinucleotides. SOLiD performs double interrogation of each base by combining the four-dye encoding scheme with a sequencing assay that samples every base (http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/solid-next-generation-sequencing.html). Different from either 454 or Illumina technology, the SOLiD base caller reports nucleotide transition color codes, rather than nucleotide sequences. Accordingly, ART also generates nucleotide transition codes or ‘color-space’ reads. For paired-end read simulations, a Gaussian distribution is used to model the distribution of DNA fragment sizes. The built-in empirical error profiles of SOLiD reads were derived from the read data generated at Applied Biosystems. ART provides an option to tune sequencing error rates with a linear scaling factor.

2.4 Performance

To test ART's speed, we used human chromosome 17 as reference, and generated reads representing 10× coverage for each of the three sequencing platforms. The test was performed on a desktop computer with Intel Xeon 2.93 GHz CPU, running a Linux operating system. This procedure took <12 min (Table 1), with Illumina reads being the fastest and SOLiD reads the slowest.

Table 1.

ART simulation speed. Speed measured for generating 10× read coverage of human chromosome 17, for 454, Illumina, and SOLiD technology-specific parameters

Platform	Read length	Running time (s)	Speed (no. of reads/s)
Single	Paired	Single	Paired
454	Varied	491	676	7,049	10,490
Illumina	50 bp	290	300	55,997	54,130
SOLiD	33 bp	728	696	33,798	33,870

Platform	Read length	Running time (s)	Speed (no. of reads/s)
Single	Paired	Single	Paired
454	Varied	491	676	7,049	10,490
Illumina	50 bp	290	300	55,997	54,130
SOLiD	33 bp	728	696	33,798	33,870

Table 1.

ART simulation speed. Speed measured for generating 10× read coverage of human chromosome 17, for 454, Illumina, and SOLiD technology-specific parameters

Platform	Read length	Running time (s)	Speed (no. of reads/s)
Single	Paired	Single	Paired
454	Varied	491	676	7,049	10,490
Illumina	50 bp	290	300	55,997	54,130
SOLiD	33 bp	728	696	33,798	33,870

Platform	Read length	Running time (s)	Speed (no. of reads/s)
Single	Paired	Single	Paired
454	Varied	491	676	7,049	10,490
Illumina	50 bp	290	300	55,997	54,130
SOLiD	33 bp	728	696	33,798	33,870

†Present address: Department of Biological Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA.

ACKNOWLEDGEMENTS

We would like to thank Dr Heather E. Peckham at Applied Biosystems for kindly providing SOLiD read error profiles.

Funding: Intramural Research Program of the National Institutes of Health; National Institute of Environmental Health Sciences (ES101765); National Human Genome Research Institute, National Institutes of Health (HG003698 and HG004719 to G.T.M.) in part.

Conflict of interest: none declared.

REFERENCES

, et al.

Characteristics of 454 pyrosequencing data–enabling realistic simulation with flowsim

Bioinformatics

2010

, vol.

(pg.

i420

i425

)

Whole-genome re-sequencing

Curr. Opin. Genet. Dev.

2006

, vol.

(pg.

545

552

)

, et al.

A map of human genome variation from population-scale sequencing

Nature

2010

, vol.

467

(pg.

1061

1073

)

, et al.

Accurate anchoring alignment of divergent sequences

Bioinformatics

2006

, vol.

(pg.

)

, et al.

The Sequence Alignment/Map format and SAMtools

Bioinformatics

2009

, vol.

(pg.

2078

2079

)

, et al.

Genome sequencing in microfabricated high-density picolitre reactors

Nature

2005

, vol.

437

(pg.

376

380

)

, et al.

MetaSim: a sequencing simulator for genomics and metagenomics

PLoS One

2008

, vol.

pg.

e3373

Author notes

Associate Editor: Martin Bishop

Published by Oxford University Press 2012.

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 22,327

18,015 Pageviews

4,312 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	6
December 2016	14
January 2017	71
February 2017	126
March 2017	122
April 2017	82
May 2017	136
June 2017	122
July 2017	81
August 2017	93
September 2017	82
October 2017	107
November 2017	132
December 2017	216
January 2018	193
February 2018	253
March 2018	301
April 2018	290
May 2018	254
June 2018	207
July 2018	236
August 2018	161
September 2018	172
October 2018	185
November 2018	220
December 2018	212
January 2019	210
February 2019	188
March 2019	341
April 2019	286
May 2019	294
June 2019	152
July 2019	152
August 2019	162
September 2019	213
October 2019	292
November 2019	241
December 2019	261
January 2020	277
February 2020	299
March 2020	185
April 2020	136
May 2020	132
June 2020	203
July 2020	249
August 2020	247
September 2020	281
October 2020	268
November 2020	287
December 2020	209
January 2021	267
February 2021	252
March 2021	265
April 2021	280
May 2021	334
June 2021	250
July 2021	277
August 2021	258
September 2021	276
October 2021	325
November 2021	300
December 2021	198
January 2022	248
February 2022	348
March 2022	348
April 2022	326
May 2022	344
June 2022	288
July 2022	254
August 2022	282
September 2022	200
October 2022	299
November 2022	279
December 2022	232
January 2023	287
February 2023	295
March 2023	359
April 2023	299
May 2023	320
June 2023	238
July 2023	184
August 2023	262
September 2023	258
October 2023	302
November 2023	338
December 2023	280
January 2024	337
February 2024	327
March 2024	418
April 2024	259
May 2024	206
June 2024	276
July 2024	228
August 2024	283
September 2024	202

Citations

900 Web of Science

ART: a next-generation sequencing read simulator (original) (raw)

Cite

Abstract

1 INTRODUCTION

2 FEATURES AND METHODS

2.1 Illumina read simulation

2.2 454 read simulation

2.3 SOLiD read simulation

2.4 Performance

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

ART: a next-generation sequencing read simulator (original) (raw)

Cite

Abstract

1 INTRODUCTION

2 FEATURES AND METHODS

2.1 Illumina read simulation

2.2 454 read simulation

2.3 SOLiD read simulation

2.4 Performance

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited