Comparison of solution-based exome capture methods for next generation sequencing - PubMed (original) (raw)

Comparative Study

doi: 10.1186/gb-2011-12-9-r94.

Pekka Ellonen, Henrikki Almusa, Maija Lepistö, Samuli Eldfors, Sari Hannula, Timo Miettinen, Henna Tyynismaa, Perttu Salo, Caroline Heckman, Heikki Joensuu, Taneli Raivio, Anu Suomalainen, Janna Saarela

Affiliations

Comparative Study

Comparison of solution-based exome capture methods for next generation sequencing

Anna-Maija Sulonen et al. Genome Biol. 2011.

Abstract

Background: Techniques enabling targeted re-sequencing of the protein coding sequences of the human genome on next generation sequencing instruments are of great interest. We conducted a systematic comparison of the solution-based exome capture kits provided by Agilent and Roche NimbleGen. A control DNA sample was captured with all four capture methods and prepared for Illumina GAII sequencing. Sequence data from additional samples prepared with the same protocols were also used in the comparison.

Results: We developed a bioinformatics pipeline for quality control, short read alignment, variant identification and annotation of the sequence data. In our analysis, a larger percentage of the high quality reads from the NimbleGen captures than from the Agilent captures aligned to the capture target regions. High GC content of the target sequence was associated with poor capture success in all exome enrichment methods. Comparison of mean allele balances for heterozygous variants indicated a tendency to have more reference bases than variant bases in the heterozygous variant positions within the target regions in all methods. There was virtually no difference in the genotype concordance compared to genotypes derived from SNP arrays. A minimum of 11× coverage was required to make a heterozygote genotype call with 99% accuracy when compared to common SNPs on genome-wide association arrays.

Conclusions: Libraries captured with NimbleGen kits aligned more accurately to the target regions. The updated NimbleGen kit most efficiently covered the exome with a minimum coverage of 20×, yet none of the kits captured all the Consensus Coding Sequence annotated exons.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Comparison of the probe designs of the exome capture kits against CCDS exon annotations. (a, b) Given are the numbers of CCDS exon regions, common target regions outside CCDS annotations and the regions covered individually by the Agilent SureSelect and NimbleGen SeqCap sequence capture kits (a) and the Agilent SureSelect 50 Mb and NimbleGen SeqCap v2.0 sequence capture kits (b). Regions of interest are defined as merged genomic positions regardless of their strandedness, which overlap with the kit in question. Sizes of the spheres are proportional to the number of targeted regions in the kit. Total numbers of targeted regions are given under the name of each sphere.

Figure 2

Figure 2

Overview of the variant calling pipeline. VCP consists of sequence analysis software and in-house built algorithms, and its output gives a wide variety of sequencing results. Sequence reads are first filtered for quality. Sequence alignment is then performed with BWA, followed by duplicate removal, variant calling with SAMtools' pileup and in-house developed algorithms for SNV calling with qualities and REA calling. File transformation programs are used to convert different file formats between the software. White boxes, files and intermediate data; purple boxes, filtering steps; grey ellipses, software and algorithms; green boxes, final VCP output; yellow boxes, files for data visualization; area circled with blue dashed line, VCP analysis options not used in this study. PE, paired end.

Figure 3

Figure 3

Number of fully covered CCDS transcripts with different minimum coverage thresholds. For each exon, median coverage was calculated as the sum of sequencing coverage on every nucleotide in the exon divided by the length of the exon. If all the annotated exons of a transcript had a median coverage above a given threshold, the transcript was considered to be completely covered. The number of all CCDS transcripts is 23,634.

Figure 4

Figure 4

Number of identified novel and known single nucleotide variants. SNVs were called with SamTools pileup, and the called variants were filtered based on the allele quality ratio in VCP. Numbers are given for variants with a minimum sequencing depth of 20× in the capture target region (CTR) and CCDS annotated exon regions (CCDS) for the control I sample. Mean numbers for the variants found in the CTRs of the additional samples are also given (CTR Mean). Dark grey bars represent Agilent SureSelect (left panel) and SureSelect 50 Mb (right panel); black bars represent NimbleGen SeqCap (left panel) and SeqCap v2.0 (right panel); light grey bars represent novel SNPs (according to dbSNP b130).

Figure 5

Figure 5

Sharing of single nucleotide variants between the exome capture kits. The number of all sequenced variants in the common target region was specified as the combination of all variants found with a minimum coverage of 20× in any of the exome capture kits (altogether, 15,044 variants). Variable positions were then examined for sharing between all kits, both Agilent kits, both NimbleGen kits, Agilent SureSelect kit and NimbleGen SeqCap kit, and Agilent SureSelect 50 Mb kit and NimbleGen SeqCap v2.0 kit. Numbers for the shared variants between the kits in question are given, followed by the number of shared variants with the same genotype calls. The diagram is schematic, as the sharing between Agilent SureSelect and NimbleGen SeqCap v2.0, Agilent SureSelect 50 Mb and NimbleGen SeqCap or any of the combinations of three exome capture kits is not illustrated.

Figure 6

Figure 6

Correlation of sequenced genotypes to the SNP chip genotypes. SAMtools' pileup genotype calls recalled with quality ratios in the VCP were compared with the Illumina Human660W-Quad v1 SNP chip genotypes. (a) The correlations for Agilent SureSelect- and NimbleGen SeqCap-captured sequenced genotypes. (b) The correlations for SureSelect 50 Mb- and SeqCap v2.0-captured sequenced genotypes. Correlations for heterozygous, reference homozygous and variant homozygous SNPs (according to the chip genotype call) are presented on separate lines, though the lines for homozygous variants, laying near 100% correlation, cannot be visualized. The x-axis represents the accumulative minimum coverage of the sequenced SNPs.

Similar articles

Cited by

References

    1. Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24:133–141. doi: 10.1016/j.tig.2007.12.007. - DOI - PubMed
    1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. doi: 10.1038/nbt1486. - DOI - PubMed
    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. doi: 10.1038/nrg2626. - DOI - PubMed
    1. Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ, Weinstock GM, Gibbs RA. Direct selection of human genomic loci by microarray hybridization. Nat Methods. 2007;4:903–905. doi: 10.1038/nmeth1111. - DOI - PubMed
    1. Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, McCombie WR. Genome-wide in situ exon capture for selective resequencing. Nat Genet. 2007;39:1522–1527. doi: 10.1038/ng.2007.42. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources