cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate - PubMed (original) (raw)

cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate

Günter Klambauer et al. Nucleic Acids Res. 2012 May.

Abstract

Quantitative analyses of next-generation sequencing (NGS) data, such as the detection of copy number variations (CNVs), remain challenging. Current methods detect CNVs as changes in the depth of coverage along chromosomes. Technological or genomic variations in the depth of coverage thus lead to a high false discovery rate (FDR), even upon correction for GC content. In the context of association studies between CNVs and disease, a high FDR means many false CNVs, thereby decreasing the discovery power of the study after correction for multiple testing. We propose 'Copy Number estimation by a Mixture Of PoissonS' (cn.MOPS), a data processing pipeline for CNV detection in NGS data. In contrast to previous approaches, cn.MOPS incorporates modeling of depths of coverage across samples at each genomic position. Therefore, cn.MOPS is not affected by read count variations along chromosomes. Using a Bayesian approach, cn.MOPS decomposes variations in the depth of coverage across samples into integer copy numbers and noise by means of its mixture components and Poisson distributions, respectively. The noise estimate allows for reducing the FDR by filtering out detections having high noise that are likely to be false detections. We compared cn.MOPS with the five most popular methods for CNV detection in NGS data using four benchmark datasets: (i) simulated data, (ii) NGS data from a male HapMap individual with implanted CNVs from the X chromosome, (iii) data from HapMap individuals with known CNVs, (iv) high coverage data from the 1000 Genomes Project. cn.MOPS outperformed its five competitors in terms of precision (1-FDR) and recall for both gains and losses in all benchmark data sets. The software cn.MOPS is publicly available as an R package at http://www.bioinf.jku.at/software/cnmops/ and at Bioconductor.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

The processing pipelines for CNV detection in NGS data. Left column: modeling across samples and integer copy number estimation are unique to cn.MOPS. Right column: either GC correction [class (a) methods] or read count ratios [class (b) methods] are required for previous pipelines.

Figure 2.

Figure 2.

Illustration of the basic concept of cn.MOPS: a CNV call incorporates the detection of variation across samples (I/NI call) and the detection of variation along a chromosome (segmentation). Curves show read counts along one chromosome for five samples. I/NI calls (green) detect variation across samples (green vertical boxes). A CNV (red box) is called if consecutive segments have high I/NI calls. Blue boxes mark segments that segmentation algorithm of class (a) methods (see the ‘Introduction’ section) would combine into a CNV. First vertical bar (from the left) and first sample: the I/NI call indicates variation across samples (‘I/NI call +’). However, too few adjacent segments show high I/NI calls. Second bar and third sample: the I/NI call indicates variation across samples (‘I/NI call +’) and sufficiently many adjacent segments show high I/NI calls, which leads to a CNV call (red box). Third bar: the read counts drop consistently and would thus be detected by a segmentation algorithm of class (a) methods (blue boxes). However, the read counts of the samples do not vary, which does not lead to an I/NI call (‘I/NI call −’). A CNV is not detected, which is correct as the copy number does not vary across samples. Fourth bar and samples numbers 2 and 4: I/NI call indicates variation across samples (‘I/NI call +’). As in the first bar, too few adjacent segments show high I/NI calls. Fifth bar and second sample: a segmentation algorithm of class (a) methods would combine adjacent read counts that are consistently small (blue box) into a CNV. However, the read counts are within the variation of the constant copy number at this location. Therefore, the I/NI call does not indicate variation across samples (‘I/NI call −’).

Figure 3.

Figure 3.

Definitions for the evaluation of copy number detection methods. A genome is split into equally sized evaluation segments of a length shorter than the shortest CNV. Top panel: Knowing the true CNV regions (green), the evaluation segments are labeled as class 1 (CNV segment) or class −1 (non-CNV segment). Middle panel: A CNV detection method classifies each evaluation segment into CNV segments (blue, class 1) and non-CNV segments (class −1). Bottom panel: In the first line, positives (known CNV regions) are divided into true positives (TP, green) and false negatives (FN, red). In the second line, negatives (no overlap with known CNV regions) are divided into true negatives (TN, green) and false positives (FP, red). Segments partly overlapping with known or predicted CNV regions are not considered (‘na’).

Figure 4.

Figure 4.

Whole-genome CNV calling plots that visualize the performance of cn.MOPS, MOFDOC, EWT, JointSLM, CNV-Seq, and FREEC at rediscovering known CNVs of HapMap individuals. The plots visualize CNV calling values (vertical axis) along chromosomes 1–22 of the human genome without segmentation. The first panel shows the I/NI call used for cn.MOPS. The second panel provides mean _z_-scores used by EWT, JointSLM, while the last panel depicts mean log-ratios used by CNV-Seq and FREEC. We called the largest 0.5% of the CNV calling values (blue dots) and scaled them to maximum one. Darker shades of blue indicate a high density of calling values. True CNV regions are displayed as light red bars, and the corresponding CNV calls are indicated by red dots. Segments without calling values (white segments) correspond to assembly gaps in the reference genome. A perfect calling method would call all segments in true CNV regions (red dots) at maximum 1 and would call others (blue dots) at minimum 0. Arrows indicate segments in true CNV regions that are called by one method group but not by the other method groups. A threshold of 0.6 for log-ratios-based methods, namely CNV-Seq and FREEC, and a threshold of 0.8 for cn.MOPS would lead to the same true positive rate, while cn.MOPS yields fewer false discoveries (lower FDR). cn.MOPS is better at separating segments of true CNV regions from non-CNV segments than the other methods, as indicated by the lower variance of I/NI values (see blue area at the bottom of the first panel). The better separation by cn.MOPS results in FDRs lower than those of other methods, regardless of the calling thresholds.

Figure 5.

Figure 5.

CNV detection performance for different levels of coverage. Each curve in the two panels corresponds to the recall of one method at detecting short CNVs of lengths 1–5 kb (left panel: gains; right panel: losses). The FDR was fixed at 0.05.

Similar articles

Cited by

References

    1. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. - PMC - PubMed
    1. Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011;470:187–197. - PubMed
    1. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321:956–960. - PubMed
    1. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. - PMC - PubMed
    1. Łabaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley SH, Kreil DP. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics. 2011;27:i383–i391. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources