Improved data analysis for the MinION nanopore sequencer - PubMed (original) (raw)

Improved data analysis for the MinION nanopore sequencer

Miten Jain et al. Nat Methods. 2015 Apr.

Abstract

Speed, single-base sensitivity and long read lengths make nanopores a promising technology for high-throughput sequencing. We evaluated and optimized the performance of the MinION nanopore sequencer using M13 genomic DNA and used expectation maximization to obtain robust maximum-likelihood estimates for insertion, deletion and substitution error rates (4.9%, 7.8% and 5.1%, respectively). Over 99% of high-quality 2D MinION reads mapped to the reference at a mean identity of 85%. We present a single-nucleotide-variant detection tool that uses maximum-likelihood parameter estimates and marginalization over many possible read alignments to achieve precision and recall of up to 99%. By pairing our high-confidence alignment strategy with long MinION reads, we resolved the copy number for a cancer-testis gene family (CT47) within an unresolved region of human chromosome Xq24.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS

MA is a consultant to Oxford Nanopore Technologies.

Figures

Fig. 1

Fig. 1

Molecular events and ionic current trace for a 2D read of a 7.25 kb M13 phage dsDNA molecule. (a) Schematic for the steps in DNA translocation through the nanopore. (i) Open channel; (ii) dsDNA with a ligated lead adaptor (blue), with a molecular motor bound to it (orange), and a hairpin adaptor (red), is captured by the nanopore. DNA translocation through the nanopore begins through the effect of an applied voltage across the membrane and the action of a molecular motor; (iii) Translocation of the lead adaptor (blue); (iv) Translocation of the template strand (gold); (v) Translocation of the hairpin adaptor (red); (vi) Translocation of the complement strand (dark blue); (vii) Translocation of the trailing adaptor (brown); (viii) Return to open channel. (b) Raw current trace for the passage of the M13 dsDNA construct through the nanopore. Regions of the ionic current trace corresponding to steps i-viii are labeled. (c) Expanded time and current scale for raw current traces corresponding to steps i–viii. Each adaptor generates a unique current signal used to aid base calling.

Fig. 2

Fig. 2

Read length distributions and identity plots for M13. Read length histograms for mapped vs. unmapped reads across three replicate M13 experiments for (a) template; (b) complement; and (c) 2D reads. Most of the reads mapped to a known reference, with two distinct peaks at about 7.2 kb, corresponding to full-length M13, and 3.8 kb, corresponding to the phage lambda DNA (control fragment). Insets show the proportion of mappable vs. unmappable reads and the proportion of unmappable reads that found hits when compared against the NCBI NT database using BLAST (to check for contamination or missed homology). Read alignment identities for mappable reads using tuned LAST, realigned LAST, and EM trained LAST for (d) template; (e) complement; and (f) 2D reads.

Fig. 3

Fig. 3

Maximum-likelihood (ML) alignment parameters derived using expectation-maximization (EM). The process starts from four guide alignments each generated with a different mapper using tuned parameters. (a) Insertion vs. deletion rates, expressed as events per aligned base. (b) Insertion or deletion (indel) events per aligned base vs. rate of mismatches per aligned base (see Supplement). Rates vary strongly between different guide alignments, however, EM training and realignment results in very similar rates (grey circles), regardless of the initial guide alignment. (c) Matrix for substitution emissions determined using EM reveals very low rates of A-to-T and T-to-A substitutions.

Fig. 4

Fig. 4

M13 sequencing depth. (a) The magenta line denotes coverage by position in the genome, and the dotted blue line depicts local G/C% for that position. Ignoring sites close to the ends of the reference, which appear to be affected by adaptor trimming, the coverage drops at polymeric nucleotide runs. Coverage was calculated by binning over a sliding 5 bp window. G/C content was calculated by binning over a 50 bp sliding window. (b) Coverage depth distribution fitted with a generalized extreme value distribution.

Fig. 5

Fig. 5

Exploring single nucleotide variant (SNV) calling with MinION reads. (a) Variant calling with substitution frequency of 1%. (b) Variant calling with substitution frequency of 5%. Dotted lines in both (a) and (b) represent variant calling using a simple transducer model, using a tuned LAST alignment and giving all substitutions equal probability. Different sampled read coverages are shown. Each curve is produced by varying the posterior base calling threshold to trade off precision for recall. Solid lines in both (a) and (b) represent variant calling using the same simple transducer model as in the dotted lines, but trained by EM and incorporating marginalization over the read to reference alignments. Results shown are averaged over three replicate M13 experiments, and for each coverage level, three samplings of the reads. ALL curve reflects all the available data for each experiment. (c) The distribution of posterior match probabilities show that there is substantial uncertainty in most matches and explain why marginalizing over the read alignments is a powerful approach.

Fig. 6

Fig. 6

Resolution of CT47 repeat copy number estimate on human chromosome Xq24. (a) BAC end sequence alignments (RP11-482A22: AQ630638 and AZ517599) span a 247 kb region, including thirteen annotated CT47 genes (each defined within a 4.8 kb tandem repeat) and a 50 kb scaffold gap in the GRCh38/hg38 reference assembly. (b) Utilizing MinION long-reads obtained from RP11-482A22 high-molecular weight BAC DNA, nine reads span the length of the CT47-repeat region providing evidence for eight tandem copies of the CT47-repeat. Insert size estimate (170–175 kb, as determined by pulse-field gel electrophoresis) is noted as a dotted line, with flanking regions (upstream: 57 kb and downstream region: 73 kb, black line) and repeat region (37-to-42 kb, or 7.5-to-8.75 copies of the repeat, blue line). Single copy regions directly before the CT47 repeats are shown in orange (6.6 kb) and green (2.6 kb), repeat copies are labeled in blue, and grey lines describe read alignment into flanking region. The size of the repeat region are provided on the left (range 36 kb ’ 42 kb). (c) Shearing the BAC DNA to increase sequence coverage provided copy number estimates by read depth. All bases not included in the CT47 repeat unit are labeled as flanking region (grey distribution, mean: 46.2 base coverage). Base coverage across the CT47 repeats are summarized over one copy of the repeat to provide an estimate of the combined number (dark blue distribution, mean: 329.3 base coverage), and are similar to single copy estimates when normalized for eight copies (light blue distribution, mean: 41.15 base coverage).

Comment in

Similar articles

Cited by

References

    1. Chaisson M, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC bioinformatics. 2012;13:238. - PMC - PubMed
    1. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013;00:3.
    1. Frith MC, Wan R, Horton P. Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic acids research. 2010;38:e100. - PMC - PubMed
    1. Harris RS. Ph.D. thesis. The Pennsylvania State University; 2007. Improved pairwise alignment of genomic DNA.
    1. Benson DA, et al. GenBank. Nucleic acids research. 2013;41:D36–42. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources