Detecting statistically significant common insertion sites in retroviral insertional mutagenesis screens - PubMed (original) (raw)

Detecting statistically significant common insertion sites in retroviral insertional mutagenesis screens

Jeroen de Ridder et al. PLoS Comput Biol. 2006.

Abstract

Retroviral insertional mutagenesis screens, which identify genes involved in tumor development in mice, have yielded a substantial number of retroviral integration sites, and this number is expected to grow substantially due to the introduction of high-throughput screening techniques. The data of various retroviral insertional mutagenesis screens are compiled in the publicly available Retroviral Tagged Cancer Gene Database (RTCGD). Integrally analyzing these screens for the presence of common insertion sites (CISs, i.e., regions in the genome that have been hit by viral insertions in multiple independent tumors significantly more than expected by chance) requires an approach that corrects for the increased probability of finding false CISs as the amount of available data increases. Moreover, significance estimates of CISs should be established taking into account both the noise, arising from the random nature of the insertion process, as well as the bias, stemming from preferential insertion sites present in the genome and the data retrieval methodology. We introduce a framework, the kernel convolution (KC) framework, to find CISs in a noisy and biased environment using a predefined significance level while controlling the family-wise error (FWE) (the probability of detecting false CISs). Where previous methods use one, two, or three predetermined fixed scales, our method is capable of operating at any biologically relevant scale. This creates the possibility to analyze the CISs in a scale space by varying the width of the CISs, providing new insights in the behavior of CISs across multiple scales. Our method also features the possibility of including models for background bias. Using simulated data, we evaluate the KC framework using three kernel functions, the Gaussian, triangular, and rectangular kernel function. We applied the Gaussian KC to the data from the combined set of screens in the RTCGD and found that 53% of the CISs do not reach the significance threshold in this combined setting. Still, with the FWE under control, application of our method resulted in the discovery of eight novel CISs, which each have a probability less than 5% of being false detections.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Schematic View of Insertion Data

(A) Schematic view of the mapped data of four tumors. Significance is determined by the number of tumors which contain insertions in a particular region. The geometric symbols represent the insertions and are given a different shape for each tumor. The blue regions indicate possible CISs. (B) When considering a broad region, the number of insertions one would expect to have occurred by chance is higher, and hence the regions need to be hit in more independent tumors than for narrow regions before significance is reached. (C) Genes (indicated by the green bars) may be affected from various loci around or within the gene, and there does not exist one distance over which viral inserts act on their targets.

Figure 2

Figure 2. Schematic Depiction of the Kernel Convolution Framework

The insertions are convolved with a kernel function with a width determined by the scale parameter. In principle any kernel function can be used, but the Gaussian kernel function is depicted. The significance of the peaks is evaluated using a null-distribution computed by means of a random permutation of the data. This is done for a range of scale parameters to obtain the CISs in the scale space.

Figure 3

Figure 3. The Myc Locus on Chromosome 15

(A) The blue line represents the estimated number of insertions as a function of position for a certain region. The red line depicts the threshold associated with an _α_-level of 0.05. (B) CISs are depicted by means of vertical lines. From top to bottom these represent: the CISs for the current scale (30k), the csCISs, the CISs from the RTCGD, the insertions, and the genes (top and bottom strand separated). (C) Scale space diagram. The vertical axes of the scale space has a logarithmic scale and indicates the scale for which the CIS was detected (only a subset of scales was actually evaluated: [50 100 250 500 1 k 2.5 k 5 k 10 k 30 k 50 k 100 k 150 k] bp). (D) Evaluation of the insertion distribution over four small scale CISs, identified by scale space analysis. Per screen we list the number of insertions that fall within the small scale CIS. The screens are labeled consistent with RTCGD nomenclature.

Figure 4

Figure 4. Results from Simulation Experiments—True and False Positives

(A,B) Results for the GKC applied to artificial data. (C,D) Results for the TKC. (E,F) Results for the RKC. The horizontal solid lines in (A), (C), and (E) show the 5% significance level, the dotted lines show the average number of csFPs. The legend shows the different simulated CISs, stating the number of insertions N CIS that fall within the CIS of width W CIS.

Figure 5

Figure 5. Results from Simulation Experiments—Cross Scale True Positives and Deviations from CIS Center

(A) Average number of csTPs per artificially generated CIS. Significant errors are made for the borderline cases: the narrow CISs (500 bp), or broad CISs (80k bp) with relatively few insertions. The GKC outperforms the RKC and TKC for all simulated CISs. (B) Average deviation of the detected CIS center from the actual simulated CIS center normalized on the simulated CIS width plus the scale parameter under consideration.

Figure 6

Figure 6. Comparison with Previous CIS Definition

(A) Plot of the increase of the error as a function of the screen size, when using the definition from [8], computed using the Poisson distribution, or a permutation approach. Also the results from the two individual windows used are given. Since the errors made by the two windows individually are not mutually exclusive, the Poisson estimate is an overestimate of the true error. (B) Venn diagram comparing three different CIS definitions: a) the definition from [8] applied to the complete dataset, b) the csCISs resulting from the GKC, and c) the published CISs from the RTCGD. The intersection between sets shows three counts (and corresponding percentages), indicating the count for set a, b, and c, respectively. This is because the three sets of CISs used different definitions (at different scales) for a CIS, so that some CISs are split up, and hence are counted twice.

Figure 7

Figure 7. Number of CISs per Scale Parameter

Number of CISs for Various Scale Parameters (Corrected and Uncorrected), the csCISs, the Background-Corrected csCISs, and the CISs from the RTCGD. Background correction only has effect at larger scales.

Figure 8

Figure 8. Example of Novel CIS and Background Corrected CIS

(A) Venn diagram comparing the csCISs and the CISs in the RTCGD. For reasons explained in Figure 7, the intersection shows two counts. (B) An example of a CIS that consists of three insertions from three independent screens, and therefore is only detected when integrally analyzing the data. (C) Venn diagram comparing the csCISs with and without applying background correction. (D) An example of a csCIS, that was also included in the RTCGD, and is rejected based on the background-corrected threshold. The small vertical bars (red) in the genes denote the 5′ ends of genes, and a star denotes a corrected CIS. Since we are only interested in correcting regions that are putative CISs, a background-corrected threshold is only computed for peaks in the estimated number of insertions. The corrected threshold is given by the horizontal dotted line above the peak.

Figure 9

Figure 9. Schematic Depiction of the Significance Analysis of the Density Estimate of the Insertion Data

(A) The position of the N insertions is permuted. (B) The convolution method is applied to the resulting permuted insertion profile. The heights of all peaks are recorded. (C) Step A and B are repeated. A distribution of the peaks in random data results. (D) The threshold is computed by determining the _α_-level in the empirical CDF of the null-distribution. This threshold is applied to the insertion estimate of the real insertion data, resulting in a series of significant peaks.

Figure 10

Figure 10. Schematic Depiction of the Computation of a Background-Corrected Threshold

(A) The density of TSSs (the 5′ ends of the genes) is computed using a fixed kernel width h bg. (B) A new realization of insertions is generated using the density from step A. (C) The GKC method is applied to the resulting insertion profile, yielding one realization of the background density estimate. Steps (A) and (B) and applying the GKC are repeated N times to yield a distribution of background realizations. For every position on the genome, a CDF of these realizations is computed and the threshold is determined based on the _α_-level. (D) The location-dependent threshold is combined with the threshold based on uniform background. Finally, the smoothed insertion estimate of the real data is thresholded with the resulting threshold.

Similar articles

Cited by

References

    1. Uren AG, Kool J, Berns A, van Lohuizen M. Retroviral insertional mutagenesis: Past, present and future. Oncogene. 2005;24:7656–7672. doi:10.1038/sj.onc.1209043. - DOI - PubMed
    1. Mikkers H, Berns A. Retroviral insertional mutagenesis: Tagging cancer pathways. Adv Cancer Res. 2003;88:53–99. - PubMed
    1. Li J, Shen H, Himmel KL, Dupuy AJ, Largaespada DA, et al. Leukaemia disease genes: Large-scale cloning and pathway predictions. Nat Genet. 1999;23:348–353. doi: <10.1038/15531>. - DOI - PubMed
    1. Hansen GM, Skapura D, Justice MJ. Genetic profile of insertion mutations in mouse leukemias and lymphomas. Genome Res. 2000;10:237–243. - PMC - PubMed
    1. Hwang HC, Martins CP, Bronkhorst Y, Randel E, Berns A, et al. Identification of oncogenes collaborating with p27Kip1 loss by insertional mutagenesis and high-throughput insertion site analysis. Proc Natl Acad Sci U S A. 2002;99:11293–11298. doi:10.1073/pnas.162356099. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources