Accurate detection of recombinant breakpoints in whole-genome alignments - PubMed (original) (raw)

Accurate detection of recombinant breakpoints in whole-genome alignments

Oscar Westesson et al. PLoS Comput Biol. 2009 Mar.

Abstract

We propose a novel method for detecting sites of molecular recombination in multiple alignments. Our approach is a compromise between previous extremes of computationally prohibitive but mathematically rigorous methods and imprecise heuristic methods. Using a combined algorithm for estimating tree structure and hidden Markov model parameters, our program detects changes in phylogenetic tree topology over a multiple sequence alignment. We evaluate our method on benchmark datasets from previous studies on two recombinant pathogens, Neisseria and HIV-1, as well as simulated data. We show that we are not only able to detect recombinant regions of vastly different sizes but also the location of breakpoints with great accuracy. We show that our method does well inferring recombination breakpoints while at the same time maintaining practicality for larger datasets. In all cases, we confirm the breakpoint predictions of previous studies, and in many cases we offer novel predictions.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Accuracy of breakpoint detection varies as a function of simulation and inference parameters.

In each case, we plot both positive predictive value (TP/(TP+FP) = PPV ) and sensitivity (TP/(TP+FN)). A correctly predicted breakpoint is defined as one which occurs less than 10 bp from a true breakpoint. We observe that the overall accuracy remains high except for situations of high diversity, extremely short recombinant region (less than 50 bp), or more than 20 taxa. In several cases, we were resource-limited and only able to provide a few data points for each variable, and this is the reason for the sparseness of the plots. Each data point is the maximum-likelihood outcome of 10 independently run EM trials, each one taking on average 15 minutes for small length/taxa, though this varies as seen in Figure 9.

Figure 2

Figure 2. The detection power increases as more trees are added to the model.

Here we analyze alignments with 5 regions, while setting our predicted number of states to various values. The sensitivity increases steadily while PPV tapers off at a fixed value.

Figure 3

Figure 3. Analyses of Neisseria argF (left) and penA (right) data.

The left plot shows the analysis of Neisseria argF data with predictions from Husmeier and Wright, who used a similar method, in red dashed lines. We confirm each of their breakpoints and are able to better characterize uncertain regions. Still, the region from 0–75 remains difficult to characterize. Different colors represent posterior probabilities of different tree-topology states in the HMM, and sharp changes in color indicates likely recombination breakpoints. The right figure shows analysis of Neisseria penA data, an alignment of 9 taxa of length circa 1900 bp, demonstrating our ability to analyze many taxa. We confirm with high posterior probabilities the two breakpoints previously found by Bowler et al., shown in red .

Figure 4

Figure 4. The top figure shows our analysis of the strain CRF01_AE/B Malaysian HIV-1 with our recombination phylo-HMM.

We recover 6 previously predicted recombination breakpoints (red), and predict new regions in 6415–6594 and 2360–2553 (green). The grey and black regions correspond to posterior probabilities of the trees shown in the lowest figure. Previous bootscanning analysis of the same data is shown in the middle figure . Since this previous analysis involved removing gaps from the alignment, we provide approximate mappings from our predictions to theirs, as the red dashed lines between the two figures. They provided precise breakpoint locations in based on consensus HXB2 strain, which we plot in our figure as the vertical red lines. Note the spike in their plot that appears in our plot around 6500 as a recombinant region. The trees in the lowest figure were those trained as hidden states in our HMM; the black state clearly shows the query strain clustering with CRF_AE, whereas the gray tree shows a closer relationship with subtype B, in accordance with the previous findings.

Figure 5

Figure 5. Analysis of A/C Indian HIV-1 recombinant strain 95IN21301.

In the original paper , gaps were stripped and so mapping predictions to our plot is difficult. Instead, we show our confirmations in red, which correspond closely to the predictions seen in Figures 1 and 2 of . Our new prediction of region 4328–4401 is shown in green. Trees trained as hidden HMM states are shown underneath, with their colored boxes corresponding to the colors in the plot, which in turn denote posterior probabilities of hidden states. Note that in the black tree the query sequence doesn't cluster with C, but the branch length from the (C,F) clade to the query strain is effectively zero, indicating a star-like topology in these areas.

Figure 6

Figure 6. Brazilian strain BREPM12313.

We confirm Filho et al.'s breakpoints near 1322 and 2571 (red), and predict new recombinant regions in nt 4784–4945 as well as 970–1049 (green). The second of these is short, but present in some form in all three strains analyzed here. The spike at 3851–3909 is even shorter and is not represented in the other two species, leading us to not predict it as a likely recombinant region. Trees trained in hidden states are shown below the plot.

Figure 7

Figure 7. Brazilian strain BREPM16704.

We confirm breakpoints near 1322, 2571, and 5462 (red) and predict recombinations in 9281–9405 and 1017–1085 (green). Trees trained in hidden states are shown below the plot.

Figure 8

Figure 8. Brazilian strain BREPM11871.

Confirmation of breakpoints 1322 and 2571, and 4782 (red dashed lines). We predict a region common to BREPM16704 at 9238–9361 (green). Also, the breakpoint previously estimated at 5462 (red) we propose to be at 5277 (green dashed line). In support of this, we provide bootstrapping values (1000 replicates) for the 3 different regions, indicated by horizontal colored lines above the plot. Our prediction (orange) carries the highest value, 99.9%, whereas the previous (blue) is only 85.1%, since it includes a region (purple) that strongly supports BREPM11871 clustering with subtype B, with value 98.2%. The small region at 985–1080 is difficult to confidently categorize, but its high posterior probability for clustering with F and its agreement with the other two strains lead us to suspect a recombination. Trees trained in hidden states are shown below the plot.

Figure 9

Figure 9. Resource use of the algorithm increases with model complexity.

The algorithm converges in a reasonable number of EM steps, as seen in the lower right plot. We observed no dependence of iterations to convergence and the model complexity, and so the lower right histogram represents data concatenated from all simulation trials. The final bar in the histogram represents the proportion of trials which took 14 or more iterations to converge.

Figure 10

Figure 10. Phylo-HMM training algorithm.

Input Alignment ⇒ Model Selection/Parameter Estimation ⇒ Recombination Inference.

Similar articles

Cited by

References

    1. Awadalla P. The evolutionary genomics of pathogen recombination. Nat Rev Genet. 2003;4:50–60. - PubMed
    1. Minin VN, Dorman KS, Fang F, Suchard MA. Phylogenetic mapping of recombination hotspots in human immunodeficiency virus via spatially smoothed change-point processes. Genetics. 2007;175:1773–1785. - PMC - PubMed
    1. Hein J, Shierup H, Wiuf C. Gene Genealogies, Variation and Evolution. New York: Oxford University Press; 2005.
    1. Gomes JP, Bruno W, Nunes A, Santos N, Florindo C, et al. Evolution of Chlamydia trachomatis diversity occurs by widespread interstrain recombination involving hotspots. Genome Res. 2007;17:50–60. - PMC - PubMed
    1. Archer J, Pinney JW, Fan J, Simon-Loriere E, Arts EJ, et al. Identifying the important HIV-1 recombination breakpoints. PLoS Comput Biol. 2008;4:e1000178. doi:10.1371/journal.pcbi.1000178. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources