Selection of target sites for mobile DNA integration in the human genome - PubMed (original) (raw)

Selection of target sites for mobile DNA integration in the human genome

Charles Berry et al. PLoS Comput Biol. 2006.

Abstract

DNA sequences from retroviruses, retrotransposons, DNA transposons, and parvoviruses can all become integrated into the human genome. Accumulation of such sequences accounts for at least 40% of our genome today. These integrating elements are also of interest as gene-delivery vectors for human gene therapy. Here we present a comprehensive bioinformatic analysis of integration targeting by HIV, MLV, ASLV, SFV, L1, SB, and AAV. We used a mathematical method which allowed annotation of each base pair in the human genome for its likelihood of hosting an integration event by each type of element, taking advantage of more than 200 types of genomic annotation. This bioinformatic resource documents a wealth of new associations between genomic features and integration targeting. The study also revealed that the length of genomic intervals analyzed strongly affected the conclusions drawn--thus, answering the question "What genomic features affect integration?" requires carefully specifying the length scale of interest.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. ROC Curves as a Measure of the Association of DNA Integration with Genomic Features

(A) Diagram of the ROC analysis. The graph plots the true positive rate against the false positive rate for every possible cutpoint; vertical steps result when only the true positive rate increases as the cutpoint (i.e., cutoff value for the genomic feature) moves down; horizontal steps result when only the false positive rate increases, and when both rates increase as the cutpoint moves down the graph “steps” diagonally. The example shows the effects of score.20 on SB integration (though the method of construction is general). The area between the curve and the “no discrimination” line indicates discrimination between integration sites and random controls by the predictor tested. The curve will lie beneath the line of “no discrimination”—leading to an area of less than 0—if integration sites tend to have lower values of the variable under study than random controls. For details see the text and Text S1. (B) Box plots summarizing ROC results. Each box in Figure 1B indicates the first and third quartiles of the values, while the heavy line in the middle gives the median value. The “whiskers” extend to the most extreme observation within 1.5× the interquartile range of the median. Points that lie beyond the whiskers are plotted individually. For each box plot, the number of points is 17 (the number of datasets) times the number of rows in the relevant heat map for that feature (in Text S2; selected examples of heat maps are shown in Figures 2–4). Specifically, the numbers of points were 170 for gene.exon, 1173 for gene.density, 153 for dnase, 306 for cpg, 340 for juxtapos, 1870 for transfac, 17 for score.20.all, and 340 for score.20.1.bp.

Figure 2

Figure 2. ROC Areas Describing the Effects of DNA Sequences at the 20 bp Surrounding the Point of Integration (Named “score.20”)

(A) Heat map of ROC areas describing the influence of sequences at the point of integration. The key at the bottom indicates the color code for ROC values in this and subsequent figures. The top row indicates the summed effect over all the bases in the score.20 motif, and the individual bars below show the area for each individual base in the motif. The site of integration in each case was between base −1 and 1. (B) PWMs for sequences at the point of integration. Bases shown above the line were favorable for integration; those below were unfavorable. The integrating elements differ in the symmetry of the score.20 PWM at the site of integration. The points of joining on the top and bottom DNA strands have been determined experimentally for some cases, and where available are shown by the arrows in Figure 2B. For most of the elements, the sequences are approximately 2-fold rotationally symmetric through an axis between the points of joining on the two strands—this is because, in these cases at the two ends of the element DNA, the DNA breaking and joining steps mediating integration are identical. The exceptions are L1 and AAV, for which the points in the target DNA for joining of the two ends of the element do not have a consistent relationship. Note that the values given for Mij are the logarithms of the relative frequency of nucleotide i at position j among the integration sites compared with its value among the random controls. Thus, if nucleotide i almost always appears in position j in integration sites, Mij will approximate log(2) − log(1/4) ≈ 1.4, while if it appears only once in 128 integration sites Mij will approximate log(1/128) − log(1/4) ≈ −3.5. (C) ROC values for score.20 considered over longer genomic intervals. The number after score.20 on the vertical axis indicates the length in bp, then in kb (the later indicated by k).

Figure 3

Figure 3. ROC Areas Describing the Effects of Gene-Associated Features on Integration Frequency

(A) ROC areas describing the effects of integration within a gene or exon. The databases studied were as indicated. The geneScan database is solely computational, possibly explaining the divergence of ROC areas from the other gene calls. (B) ROC areas describing the effects of gene density or expression density on integration frequency. To calculate the expression density, each gene in an interval was assigned three scores of zero or one according to whether it was 1) in the upper half, 2) in the upper quarter, or 3) in the upper 12.5% of all genes scored in a transcriptional profiling analysis. Transcriptional profiling was carried out using Affymetrix arrays for each of the cell types studied (accession numbers for array data are in Table S1). For each of the three cutoffs, the expression scores for all the genes in each genomic interval were then added together and divided by the interval width to generate the expression-density measure counting the number of expressed genes for that interval.

Figure 4

Figure 4. ROC Areas Describing the Effects of Genomic Features on Integration Frequency

(A) ROC areas describing the effects of G/C content and CpG islands on integration frequency. CpG islands are on average 764 bp in length. (B) ROC areas describing the relationship between DNase I site density and integration frequency over intervals of different sizes. Each DNase I cleavage site is measured as a single point of cleavage on the human genome. (C) ROC areas describing the effects of proximity to gene boundaries on integration.

Figure 5

Figure 5. Improved Prediction due to Adding Additional Genomic Features to the score.20 ROC Values

(A) Lack of correlation between score.20 and other measures. (B) Diagram of the analytical method, illustrating the improvement of an ROC score by addition of a second predictor to the score.20 value. (C) Box plots describing the improvements in ROC area resulting from combining other genomic features with the score.20 measure. The ROC area increments are modest, because the ROC curve based on the combination of score.20 and another feature can only vary between the value for the score.20 prediction and 1.0. (D) Heat map of increases in ROC areas resulting from adding the gene density measures to the score.20 values for each integrating element. The color code for improvements in ROC areas is indicated at the bottom.

Figure 6

Figure 6. Clustering the Comprehensive BMA Integration Site Models

Red indicates positive correlation and green indicates negative, as illustrated by the key at the bottom of the figure. The models were clustered in both the _x_- and _y_- directions, so the graph is symmetrical along a line from lower left to upper right. See Text S2 for more details.

Similar articles

Cited by

References

    1. Lander ES, Linton LM, Birren B, Nusbaum C, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
    1. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. The sequence of the human genome. Science. 2001;291:1304–1351. - PubMed
    1. Craig NL, Craigie R, Gellert M, Lambowitz AM, editors. Mobile DNA II. Washington (D.C.): ASM Press; 2002. 1204
    1. Bushman FD. Lateral DNA transfer: Mechanisms and consequences. New York: Cold Spring Harbor Laboratory Press; 2001. 448
    1. Hacein-Bey-Abina S, von Kalle C, Schmidt M, Le Deist F, Wulffraat N, et al. A serious adverse event after successful gene therapy for X-linked severe combined immunodeficiency. N Engl J Med. 2003;348:255–256. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources