Consensus rules in variant detection from next-generation sequencing data - PubMed (original) (raw)

Consensus rules in variant detection from next-generation sequencing data

Peilin Jia et al. PLoS One. 2012.

Abstract

A critical step in detecting variants from next-generation sequencing data is post hoc filtering of putative variants called or predicted by computational tools. Here, we highlight four critical parameters that could enhance the accuracy of called single nucleotide variants and insertions/deletions: quality and deepness, refinement and improvement of initial mapping, allele/strand balance, and examination of spurious genes. Use of these sequence features appropriately in variant filtering could greatly improve validation rates, thereby saving time and costs in next-generation sequencing projects.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1. Pipelines for calling SNVs and indels.

SNVs and indels are called by three options based on SAMtools (pileup or mpileup) and GATK recalibration. Accordingly, three tiers of SNVs and indels are used for comparison. SNVs: single nucleotide variants. Indels: insertions and deletions.

Figure 2. Distribution of accuracy versus recall by different combinations of quality score (QUAL) and read depth (DP) values in two sets (tiers 1 and 2) of SNVs and indels.

(a) Tier One SNVs. (b) Tier Two SNVs. (c) Tier One Indels. (d) Tier Two Indels. For each variant set (panel), each node represents a combination of cutoff values for QUAL and DP. Specifically, the QUAL cutoff was selected by an integer value in the range of 15 to 35 with an increment of 1 each time, and the DP cutoff by an integer value in the range of 3 to 15 with an increment of 1 each time. Then, we evaluated the accuracy, recall, and F score (see text) for each cutoff combination. Note that many nodes are overlapped on the panel and shown by jitter (i.e., points at the same locations are slightly shifted for visibility). The combination of values that could generate the highest F score was selected (shown in red points).

Figure 3. Distribution of read depth (DP) versus SNV quality score (QUAL) for the SNVs or indels selected for validation.

(a) Tier One SNVs (159 SNVs), (b) Tier Two SNVs (145 SNVs), (c) Tier One Indels (22 indels), and (d) Tier Two Indels (19 indels). Variants in blue denote successful validation, and variants in red denote failure in validation. In each panel, the vertical dash line indicates the cutoff value for QUAL, and the horizontal dash line indicates cutoff value for DP (see Point 2 in the main text and Table 1).

Figure 4. Allele and strand bias for SNVs.

This figure shows read distribution of called variants to reference or alternative (i.e., non-reference) alleles in forward or reverse strand. (a) Tier One SNVs passed validation. (b) Tier One SNVs failed in validation. (c) Tier Two SNVs passed validation. (d) Tier Two SNVs failed in validation. Red: reference base forward; pink: reference base reverse; blue: alternative base forward; and cyan: alternative base reverse. The arrows under the x-axis indicate the variants lacked supporting reads for one or more of the four allele/strand cases.

Figure 5. An illustration of Fisher’s exact test for allele and strand balance.

On the top panel (a), the table shows how we summarized the counts for each mutation site (shown in each column and denoted by M) in each of the four cases: reference forward, reference reverse, alternative forward, and alternative reverse. A variant is indicated by 1 if it does not have a supporting read in one or more cases; otherwise, it is indicated by 0. The contingency tables for the Tier One dataset and Tier Two dataset were constructed as shown in (b) and (c), respectively.

Figure 6. A visual examination of a spurious gene (CDC27).

The top panels show visualization of read alignment in good (a) and bad (b) conditions using the software IGV . The top part of each figure shows the coverage. Each grey bar represents one read, with the color grey indicating it is matched well with the reference and other colors indicating mismatches. Panel (c) shows the distribution of mapping quality (MAPQ) of all the reads in a representative sample. MAPQ is defined as -10×log10 Pr(mapping position is wrong), rounded to the nearest integer. As shown on the x-axis in (c), MAPQ ranges between 0 and 60 in this sample, with 60 indicating the best mapping. Y-axis in (c) is the number of reads in this sample. Panel (d) shows the distribution of MAPQ of all the reads in a sample and the reads mapped to CDC27 exon regions. Y-axis in (d) is the proportion of reads in each MAPQ range (x-axis).

Figure 7. Detection of spurious genes.

RPE: the number of Reads Per Exon after adjusting the length of the exon and the overall sequencing depth per sample. PHQR: the Proportion of High-Quality Reads for each exon. Each point represents an exon. The grey points represent all the exons in one sample. The red points indicate the distribution of the 13th exon of the gene CDC27 in all 36 samples, and purple points indicate the distribution of the 42nd exon of the gene MLL3 in all 36 samples, both of which are representative spurious genes and failed to be validated by experiments. The vertical dash line is set RPE = 1.5 and the horizontal dash line is set PHQR = 0.4.

Cited by

A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples.
Cuevas-Córdoba B, Fresno C, Haase-Hernández JI, Barbosa-Amezcua M, Mata-Rocha M, Muñoz-Torrico M, Salazar-Lezama MA, Martínez-Orozco JA, Narváez-Díaz LA, Salas-Hernández J, González-Covarrubias V, Soberón X. Cuevas-Córdoba B, et al. PLoS One. 2021 Oct 26;16(10):e0258774. doi: 10.1371/journal.pone.0258774. eCollection 2021. PLoS One. 2021. PMID: 34699523 Free PMC article.
Generalizable characteristics of false-positive bacterial variant calls.
Bush SJ. Bush SJ. Microb Genom. 2021 Aug;7(8):000615. doi: 10.1099/mgen.0.000615. Microb Genom. 2021. PMID: 34346861 Free PMC article.
Read trimming has minimal effect on bacterial SNP-calling accuracy.
Bush SJ. Bush SJ. Microb Genom. 2020 Dec;6(12):mgen000434. doi: 10.1099/mgen.0.000434. Epub 2020 Dec 11. Microb Genom. 2020. PMID: 33332257 Free PMC article.
Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines.
Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N, Shaw LP, Stoesser N, Peto TEA, Crook DW, Walker AS. Bush SJ, et al. Gigascience. 2020 Feb 1;9(2):giaa007. doi: 10.1093/gigascience/giaa007. Gigascience. 2020. PMID: 32025702 Free PMC article.
Pinpointing the Genomic Localizations of Chromatin-Associated Proteins: The Yesterday, Today, and Tomorrow of ChIP-seq.
Lloyd SM, Bao X. Lloyd SM, et al. Curr Protoc Cell Biol. 2019 Sep;84(1):e89. doi: 10.1002/cpcb.89. Curr Protoc Cell Biol. 2019. PMID: 31483109 Free PMC article.

References

1. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42:35. - PMC - PubMed
1. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:276. - PMC - PubMed
1. Xia J, Wang Q, Jia P, Wang B, Pao W. NGS Catalog: A database of next generation sequencing studies in humans. Hum Mutat. 2012;33:2355. - PMC - PubMed
1. Varela I, Tarpey P, Raine K, Huang D, Ong CK. Exome sequencing identifies frequent mutation of the SWI/SNF complex gene PBRM1 in renal carcinoma. Nature. 2011;469:542. - PMC - PubMed
1. Wei X, Walia V, Lin JC, Teer JK, Prickett TD. Exome sequencing identifies GRIN2A as frequently mutated in melanoma. Nat Genet. 2011;43:446. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations

Consensus rules in variant detection from next-generation sequencing data - PubMed (original) (raw)