ART: a next-generation sequencing read simulator (original) (raw)

Journal Article

,

1Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709 and 2Department of Biology, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA

* To whom correspondence should be addressed.

Search for other works by this author on:

,

1Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709 and 2Department of Biology, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA

Search for other works by this author on:

,

1Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709 and 2Department of Biology, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA

Search for other works by this author on:

1Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709 and 2Department of Biology, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA

* To whom correspondence should be addressed.

Search for other works by this author on:

Received:

03 October 2011

Revision received:

06 December 2011

Accepted:

19 December 2011

Published:

23 December 2011

Cite

Weichun Huang, Leping Li, Jason R. Myers, Gabor T. Marth, ART: a next-generation sequencing read simulator, Bioinformatics, Volume 28, Issue 4, February 2012, Pages 593–594, https://doi.org/10.1093/bioinformatics/btr708
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles.

Availability: Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art

Contact: weichun.huang@nih.gov; gabor.marth@bc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

In the past few years, high-throughput next-generation sequencing technologies have effectively replaced earlier data types for genome-wide studies measuring gene expression changes and discovering genomic/epigenetic variations, and many tools were developed for analyzing such datasets. Simulated data is indispensable for guiding tool development and evaluating tool performance, and therefore it is essential to develop simulation software that can produce next-generation sequencing reads that captures the most essential characteristics of real data. Currently available read simulation programs include Wgsim from the Samtools package (Li et al., 2009) for generating Illumina sequencing reads, MetaSim (Richter et al., 2008) for simulating metagenomic data, Mason (http://seqan.de/projects/mason.html) for both Illumina and 454 reads, SimSeq (https://github.com/jstjohn/SimSeq) for Illumina reads and FlowSim (Balzer et al., 2010) for 454 reads. Although these programs work well in their domain, there is a need for a read simulation program that can deal with all major sequencing platforms, and generate sequence reads with both substitution and insertion–deletion (INDEL) errors, as appropriate for the error modes of each specific platform.

As a general simulator, our ART software was initially developed for simulation studies helping to design data collection modalities of the 1000 Genomes Project (Durbin et al., 2010). ART has been subsequently used by many users worldwide to facilitate sequencing software development. ART takes a set of DNA sequences (representing e.g. a reference genome), and generates ‘synthetic’ sequencing reads in a way that mimics the technology-specific sequencing process. ART comes with a set of technology-specific read error profiles, but it can also take user-supplied profiles to generate sequencing data with customized read length and error characteristics. ART can report simulated reads in the standard SAM alignment format and UCSC BED files.

2 FEATURES AND METHODS

ART simulates both single-end and paired-end sequencing reads of the three main commercial next-generation sequencing platforms: 454, Illumina and SOLiD. The built-in read length and read error profiles were derived from large sets of actual real sequencing data (see Supplementary Material). ART supports all three types of common sequencing errors: base substitutions, insertions and deletions.

2.1 Illumina read simulation

Illumina sequencing by synthesis is a base-by-base sequencing technology using a reversible terminator-based method, enabling detection of single bases as they are incorporated into growing DNA strands complementary to the template (Bentley, 2006). Since this technology reads out one base at a time, the main error mode is substitution rather than insertion or deletion. The probability of a substitution error is determined by the base quality score associated with the called base. The distribution of base quality scores is position-dependent: the mean quality score decreases as a function of increasing base position. ART simulates substitution errors according to the empirical, position-dependent distribution of base quality scores, measured in large training datasets. The base quality score does not directly provide information for INDEL errors, and ART simulates insertion and deletion errors directly from empirical distributions from our training data. The current version of ART comes with four empirical read quality score distributions, one for each of four different read lengths: 36, 44, 50 and 75 bp. The built-in insertion and deletion error rates were derived from 35 bp reads aligned with our modified ACANA alignment tool (Huang et al., 2006). For paired-end simulation, ART uses two different quality score distributions and error rates for the first and second reads, each determined empirically.

2.2 454 read simulation

Roche/454 sequencing is a pyrosequencing technology that tests for the presence of each of the four DNA nucleotides (T, A, C, G) in a cyclical fashion. All consecutive bases within a homopolymer run are incorporated within a single cycle, and the read-out is an intensity signal that is proportional with the number of incorporated bases (Margulies et al., 2005). The dominant error mode is base over- or under-call, resulting in INDEL type errors. While sequencing error rate only slightly increases with the number of flow cycles, the error rate increases dramatically with the frequency of long homopolymer runs. Accordingly, ART models the 454 sequencing error profile with homopolymer length-dependent over-call (insertion) and under-call (deletion) error distributions, and models base quality profiles as homopolymer length-dependent first-order Markov chains. ART uses an empirical distribution of 454 read lengths. By default, ART generates 454 reads with built-in distributions derived for the 454 GS FLX sequencer model.

2.3 SOLiD read simulation

Applied Biosystems' SOLiD sequencing technology is based on ligation of oligonucleotides. It uses four fluorescent color dyes to encode the 16 different dinucleotides, each dye encoding four dinucleotides. SOLiD performs double interrogation of each base by combining the four-dye encoding scheme with a sequencing assay that samples every base (http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/solid-next-generation-sequencing.html). Different from either 454 or Illumina technology, the SOLiD base caller reports nucleotide transition color codes, rather than nucleotide sequences. Accordingly, ART also generates nucleotide transition codes or ‘color-space’ reads. For paired-end read simulations, a Gaussian distribution is used to model the distribution of DNA fragment sizes. The built-in empirical error profiles of SOLiD reads were derived from the read data generated at Applied Biosystems. ART provides an option to tune sequencing error rates with a linear scaling factor.

2.4 Performance

To test ART's speed, we used human chromosome 17 as reference, and generated reads representing 10× coverage for each of the three sequencing platforms. The test was performed on a desktop computer with Intel Xeon 2.93 GHz CPU, running a Linux operating system. This procedure took <12 min (Table 1), with Illumina reads being the fastest and SOLiD reads the slowest.

Table 1.

ART simulation speed. Speed measured for generating 10× read coverage of human chromosome 17, for 454, Illumina, and SOLiD technology-specific parameters

Platform Read length Running time (s) Speed (no. of reads/s)
Single Paired Single Paired
454 Varied 491 676 7,049 10,490
Illumina 50 bp 290 300 55,997 54,130
SOLiD 33 bp 728 696 33,798 33,870
Platform Read length Running time (s) Speed (no. of reads/s)
Single Paired Single Paired
454 Varied 491 676 7,049 10,490
Illumina 50 bp 290 300 55,997 54,130
SOLiD 33 bp 728 696 33,798 33,870

Table 1.

ART simulation speed. Speed measured for generating 10× read coverage of human chromosome 17, for 454, Illumina, and SOLiD technology-specific parameters

Platform Read length Running time (s) Speed (no. of reads/s)
Single Paired Single Paired
454 Varied 491 676 7,049 10,490
Illumina 50 bp 290 300 55,997 54,130
SOLiD 33 bp 728 696 33,798 33,870
Platform Read length Running time (s) Speed (no. of reads/s)
Single Paired Single Paired
454 Varied 491 676 7,049 10,490
Illumina 50 bp 290 300 55,997 54,130
SOLiD 33 bp 728 696 33,798 33,870

†Present address: Department of Biological Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA.

ACKNOWLEDGEMENTS

We would like to thank Dr Heather E. Peckham at Applied Biosystems for kindly providing SOLiD read error profiles.

Funding: Intramural Research Program of the National Institutes of Health; National Institute of Environmental Health Sciences (ES101765); National Human Genome Research Institute, National Institutes of Health (HG003698 and HG004719 to G.T.M.) in part.

Conflict of interest: none declared.

REFERENCES

, et al.

Characteristics of 454 pyrosequencing data–enabling realistic simulation with flowsim

,

Bioinformatics

,

2010

, vol.

26

(pg.

i420

-

i425

)

.

Whole-genome re-sequencing

,

Curr. Opin. Genet. Dev.

,

2006

, vol.

16

(pg.

545

-

552

)

, et al.

A map of human genome variation from population-scale sequencing

,

Nature

,

2010

, vol.

467

(pg.

1061

-

1073

)

, et al.

Accurate anchoring alignment of divergent sequences

,

Bioinformatics

,

2006

, vol.

22

(pg.

29

-

34

)

, et al.

The Sequence Alignment/Map format and SAMtools

,

Bioinformatics

,

2009

, vol.

25

(pg.

2078

-

2079

)

, et al.

Genome sequencing in microfabricated high-density picolitre reactors

,

Nature

,

2005

, vol.

437

(pg.

376

-

380

)

, et al.

MetaSim: a sequencing simulator for genomics and metagenomics

,

PLoS One

,

2008

, vol.

3

pg.

e3373

Author notes

Associate Editor: Martin Bishop

Published by Oxford University Press 2012.

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 22,327

18,015 Pageviews

4,312 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 6
December 2016 14
January 2017 71
February 2017 126
March 2017 122
April 2017 82
May 2017 136
June 2017 122
July 2017 81
August 2017 93
September 2017 82
October 2017 107
November 2017 132
December 2017 216
January 2018 193
February 2018 253
March 2018 301
April 2018 290
May 2018 254
June 2018 207
July 2018 236
August 2018 161
September 2018 172
October 2018 185
November 2018 220
December 2018 212
January 2019 210
February 2019 188
March 2019 341
April 2019 286
May 2019 294
June 2019 152
July 2019 152
August 2019 162
September 2019 213
October 2019 292
November 2019 241
December 2019 261
January 2020 277
February 2020 299
March 2020 185
April 2020 136
May 2020 132
June 2020 203
July 2020 249
August 2020 247
September 2020 281
October 2020 268
November 2020 287
December 2020 209
January 2021 267
February 2021 252
March 2021 265
April 2021 280
May 2021 334
June 2021 250
July 2021 277
August 2021 258
September 2021 276
October 2021 325
November 2021 300
December 2021 198
January 2022 248
February 2022 348
March 2022 348
April 2022 326
May 2022 344
June 2022 288
July 2022 254
August 2022 282
September 2022 200
October 2022 299
November 2022 279
December 2022 232
January 2023 287
February 2023 295
March 2023 359
April 2023 299
May 2023 320
June 2023 238
July 2023 184
August 2023 262
September 2023 258
October 2023 302
November 2023 338
December 2023 280
January 2024 337
February 2024 327
March 2024 418
April 2024 259
May 2024 206
June 2024 276
July 2024 228
August 2024 283
September 2024 202

Citations

900 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic