Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA - PubMed (original) (raw)

Nils Homer et al. Genome Biol. 2010.

Abstract

A primary component of next-generation sequencing analysis is to align short reads to a reference genome, with each read aligned independently. However, reads that observe the same non-reference DNA sequence are highly correlated and can be used to better model the true variation in the target genome. A novel short-read micro realigner, SRMA, that leverages this correlation to better resolve a consensus of the underlying DNA sequence of the targeted genome is described here.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Local re-alignment receiver operator characteristic curves for simulated human genome re-sequencing data. A synthetic diploid human genome with SNPs, deletions, and insertions was created from a reference human genome (hg18) as described in main text. One billion paired 50-mer reads for both base space and color space were simulated from this synthetic genome to assess the true positive and false positive rates of variant calling after re-sequencing. An increasing SNP quality filter was used to generate each curve. The simulated dataset was aligned with BWA (v.0.5.7-5) with the default parameters [9]. The alignments from BWA and SRMA were variant called using the MAQ consensus model implemented in SAMtools (v.0.1.17) using the default settings [10,20]. For the simulated datasets, the resulting variant calls were assessed for accuracy by comparing the called variants against the known introduced sites of variation. The BWA alignments were locally re-aligned with SRMA with variant inclusive settings (c = 2 and p = 0.1).

Figure 2

Figure 2

Allele frequency distribution with local re-alignment of U87MG. SRMA was applied to the alignments produced with BFAST of a human cancer cell line (U87MG; SRA009912.1). Variants were called with SAMtools before and after application of SRMA (see Materials and methods). Homozygous and heterozygous calls were examined independently using zygosity calls produced by SAMtools. The observed non-reference allele frequency for SNPs, deletions, and insertions are plotted for homozygous (left panels) or heterozygous variants (right panels). Ideally, non-reference allele frequencies for homozygous and heterozygous variants approach 1.0 and 0.5, respectively. The absolute counts of observed variants are plotted (y-axis) against non-reference allele frequency ranges (x-axis).

Figure 3

Figure 3

dbSNP concordance before and after local re-alignment of U87MG. SRMA was applied to the alignments produced with BFAST of a human cancer cell line (U87MG; SRA009912.1). Variants were called with SAMtools before and after application of SRMA (see Materials and methods). Deletions and insertions (indels) called within U87MG were compared with those indels reported in dbSNP (v129). An increasing minimum SNP quality filter was used to improve concordance (y-axis) while reducing the number of indels observed at dbSNP positions (x-axis). Using SRMA significantly reduced the discordance (one minus concordance) between observed indels at dbSNP positions.

Figure 4

Figure 4

SNP microarray concordance with known genotypes before and after local re-alignment of U87MG. SRMA was applied to the alignments produced with BFAST of a human cancer cell line (U87MG; SRA009912.1). Heterozygous genotypes from an Illumina SNP microarray were compared with genotypes called from sequence data before and after application of SRMA (see Materials and methods). A minimum threshold on three different variant-calling metrics was applied, respectively, to improve the concordance (y-axis) while reducing the total number of SNP positions on the microarray that were called. Regardless of the metric, SRMA reduced the discordance (one minus concordance) of heterozygous SNPs reported by the SNP microarray and sequencing data.

Figure 5

Figure 5

A deletion and SNP in ALPK2 in U87MG. SRMA was applied to the alignments produced with BFAST of a human cancer cell line (U87MG; SRA009912.1). (a,b) The resulting alignments from within the coding region of ALPK2 (chr18:54,355,303-54,355,477) are shown before applying SRMA (a) and after applying SRMA (b). In this haploid region, Sanger sequencing confirmed a 15-bp deletion and a C-to-T SNP eight bases downstream of the deletion. Panel (a) shows the difficulty of aligning sequence reads from a region with a large deletion and a SNP, as false variation is observed (SNPs and indels). Nevertheless, some reads in (a) (BFAST) do correctly observe the deletion and SNP, which are therefore included in the variant graph created by SRMA. After local re-alignment using SRMA (b), the majority of the reads support the presence of the deletion and SNP, while false variation has been eliminated. The Integrated Genomics Viewer was used to view the alignments [30].

Figure 6

Figure 6

The creation of a variant graph. Four alignments (left) are successively used to create a variant graph (right). (a) An alignment of a read that matches the reference. The associated variant graph consists of nodes that represent each base of the read. (b) An alignment of a read with a base difference at the second position. The base difference adds a new node that is connected to the existing first and third node. (c) An alignment of a read that has a base difference and a deletion relative to the reference. A new edge connecting the sixth and ninth nodes is added to the graph. (d) An alignment of a read that has a base difference, a deletion, and an insertion relative to the reference. Two new nodes are added creating a path from the previously existing SNP at the second position to the reference base at the second position. (e) The resulting variant graph with each edge labeled with the number of alignment paths containing this edge.

References

    1. Kent WJ, Haussler D. Assembly of the working draft of the human genome with GigAssembler. Genome Res. 2001;11:1541–1548. doi: 10.1101/gr.183201. - DOI - PMC - PubMed
    1. Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21(Suppl 2):ii79–85. doi: 10.1093/bioinformatics/bti1114. - DOI - PubMed
    1. Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001;98:9748–9753. doi: 10.1073/pnas.171285098. - DOI - PMC - PubMed
    1. Simpson JT, Durbin R. Efficient construction of an assembly string graph using the FM-index. Bioinformatics. 2010;26:i367–373. doi: 10.1093/bioinformatics/btq217. - DOI - PMC - PubMed
    1. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. doi: 10.1101/gr.089532.108. - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources