Evidence for widespread reticulate evolution within human duplicons - PubMed (original) (raw)

Evidence for widespread reticulate evolution within human duplicons

Michael S Jackson et al. Am J Hum Genet. 2005 Nov.

Abstract

Approximately 5% of the human genome consists of segmental duplications that can cause genomic mutations and may play a role in gene innovation. Reticulate evolutionary processes, such as unequal crossing-over and gene conversion, are known to occur within specific duplicon families, but the broader contribution of these processes to the evolution of human duplications remains poorly characterized. Here, we use phylogenetic profiling to analyze multiple alignments of 24 human duplicon families that span >8 Mb of DNA. Our results indicate that none of them are evolving independently, with all alignments showing sharp discontinuities in phylogenetic signal consistent with reticulation. To analyze these results in more detail, we have developed a quartet method that estimates the relative contribution of nucleotide substitution and reticulate processes to sequence evolution. Our data indicate that most of the duplications show a highly significant excess of sites consistent with reticulate evolution, compared with the number expected by nucleotide substitution alone, with 15 of 30 alignments showing a >20-fold excess over that expected. Using permutation tests, we also show that at least 5% of the total sequence shares 100% sequence identity because of reticulation, a figure that includes 74 independent tracts of perfect identity >2 kb in length. Furthermore, analysis of a subset of alignments indicates that the density of reticulation events is as high as 1 every 4 kb. These results indicate that phylogenetic relationships within recently duplicated human DNA can be rapidly disrupted by reticulate evolution. This finding has important implications for efforts to finish the human genome sequence, complicates comparative sequence analysis of duplicon families, and could profoundly influence the tempo of gene-family evolution.

PubMed Disclaimer

Figures

Figure  1

Figure 1

Examples of reticulate and bimutational quartets. See description of quartet classification in the “Material and Methods” section.

Figure  2

Figure 2

Estimate of reticulation-event density. A, Cladogram of c9orf36 alignment. Sequences are defined by their RPCI11 BAC clone names. The three partitions that support the tree (18, 19, and 1D) are indicated. B, Partition matrix of proximal 8.2 kb of c9orf36 alignment. Sites support 16 different partitions; the two sequence groups that define each partition are indicated by black and white circles above the matrix, and the partitions that support the tree are to the left of the vertical dashed line. Each informative position is represented by a separate row of squares (numbered on the right). The specific partition defined by each informative site is indicated by a white square containing a black dot. All partitions compatible with this partition are shown as white squares, and all partitions incompatible with it are shown in black. Positions that support alternative partitions are assumed to be the result of reticulation. The four reticulation events inferred from the data are numbered 1–4, and the maximal extents of the sequences affected are indicated by dashed horizontal lines.

Figure  3

Figure 3

Identification of reticulation events by use of phylogenetic profiling. A, Control and observed profiles of 21-kb section of 15q25 alignment created using a window size of 30 parsimony-informative sites. The extent of gene-related sequences is indicated. The _X_-axis shows position within alignment (in kb); the _Y_-axis shows correlation. B, NJ trees generated using subalignments from regions 1 and 2. The clades indicated with an asterisk (*) are supported by bootstrap values of 99%–100%. The scale (F84 distance) is the same for both trees. All sequences are indicated by the last three digits of their accession numbers. Sequences included are AC044

860

, AC127

482

, AC135

735

, AC135

995

, AC005

630

, and AC010

725

. AC127482 contains two copies of the duplication, A and B. C, Schematic structure of both SMA alleles (Var1 and Var 2) adapted from Schmutz et al. (2004). The positions of the SMN1 and SMN2 genes are indicated. The extent of duplicated sequence is shown in gray, with the position of the most abundant duplicated segments (V1.1–V2.3) indicated. The gap in the sequences is represented by a pair of dashed lines. The scale is in megabases. D, Control and observed profiles spanning the ∼85-kb SMA-1 alignment, created using a window of 20 parsimony-informative sites. The _X_-axes show informative sites; the _Y_-axes show correlation. E, Parsimony networks of all six repeats within allele 1 (left) and all nine repeats within both alleles (right). Scale is in nucleotide differences. Sequences aligned (in order from V1.1 to V2.3) are AC138957, AC131392, AC138866, AC138959, AC138911, AC140139, AC139500, AC108108, and AC138930. Examples of alignments of informative sites used to generate the profiles are provided in figure 4.

Figure  4

Figure 4

Examples of sequence alignments used to generate profiles. Partial sequence alignments stripped of all invariant and uninformative sites are shown, to highlight changes in phylogenetic signal within the profiles presented in figures 3 and 5. Each alignment is shaded with respect to a reference sequence shown in gray, with all identities to the sequence shown in black. A, 15q24. B, SMA-1. C, 22qter. D, chAB4-2 minima 1. E, chAB4-2 minima 2. F, 22q11.1.

Figure  4

Figure 4

Examples of sequence alignments used to generate profiles. Partial sequence alignments stripped of all invariant and uninformative sites are shown, to highlight changes in phylogenetic signal within the profiles presented in figures 3 and 5. Each alignment is shaded with respect to a reference sequence shown in gray, with all identities to the sequence shown in black. A, 15q24. B, SMA-1. C, 22qter. D, chAB4-2 minima 1. E, chAB4-2 minima 2. F, 22q11.1.

Figure  5

Figure 5

Reticulations identified by phylogenetic profiling. For ease of presentation, only parsimony-informative positions are plotted, with a window size of 80, 50, and 40 parsimony-informative positions in panels A, B, and C, respectively. The number of positions identical to a reference sequence within the alignment (used to calculate the correlation) is shown for both windows flanking the numbered minima. Thus, in panel A, AP006327 is identical to the reference at 66 of 80 parsimony-informative positions to the left of the minima at position 257, but only 15 of 80 sites to the right of the minima. All numbered minima are >2 times lower than any observed in the control profiles (not shown). The chromosome 22q11-1 alignment (C) has >7 minima that exceed this control threshold.

Figure  6

Figure 6

Delineation of putative hotspot in SMA-1 region. Output from SimPlot (version 3.2) developed by S. Ray (Lole et al. 1999) shows identity of all sequences within the SMA1 alignment to AC138959 (500-bp window with a 20-bp step size). All nine sequences share ∼99.96% identity within the region of 59–67 kb. B, Detailed view, showing landmarks within the 56–70-kb region. The region of maximal identity between the sequences is defined by an L1PA3 fragment (∼58 kb) and a highly variable AT dinucleotide repeat (68 kb). A further L1PA3 repeat distal to this AT dinucleotide creates a flanking direct repeat (both LIPA3s span positions 5721–6155 of the consensus L1 sequence).

Figure  7

Figure 7

Quartet analysis of multiple alignments. A, Reticulate quartets in CpG-positive data expressed as a percentage of all informative quartets. B, Bimutational quartets in CpG-positive data expressed as a percentage of all informative quartets. The insert shows the same data at a higher resolution. C, Reticulate quartets in CpG-negative data expressed as a percentage of all informative quartets. Bars on observed data show 95% bootstrap values, and bars on simulated data show 95% CIs. In the 22q11.1 CpG-negative alignment, reticulate quartets represent >50% of all informative quartets. This is a result of low bootstrap values within the NJ tree.

Figure  8

Figure 8

Reticulation in relation to sequence identity. Linear regression of log-transformed data is shown as a solid line (

_r_2=0.599

), and 95% CIs are shown as dashed lines.

Figure  9

Figure 9

Tract length increase in duplicons. The ratio of observed to expected tract lengths is shown for control and duplicon alignments.

Figure  10

Figure 10

Reticulation-event density in duplicons. Analysis of 11 alignments for which the expected frequency of reticulate quartets is negligible. All show a >20-fold excess of reticulate quartets relative to the expectation, with expected frequencies in 100 control alignments <0.5% of the observed value at the 50th percentile and <2.0% of the observed value at the 95th percentile. Analyses were performed on CpG-negative data, and the minimum number of events was estimated as shown in figure 2.

Figure  11

Figure 11

Distribution of sites indicating suboptimal trees. Partimatrix output shows parsimony-informative sites from the central region of the 15q25 alignment (11.0–33.12 kb). The three partitions that support the tree are to the left of the red line (7, 3, and 17), and the clustering of sites supporting each partition is indicated. If the tree is an accurate representation of the phylogenetic relationships, then the positions supporting each partition should be randomly distributed. For explanation of output, see figure 2 legend.

References

Web Resources

    1. Algorithms in Bioinformatics: SplitsTree4, http://www-ab.informatik.uni-tuebingen.de/software/splits/welcome.html (for D. H. Huson and D. Bryant's work on estimating phylogenetic trees and networks using SplitsTree4)
    1. BLAST, http://www.ncbi.nlm.nih.gov/BLAST/
    1. NISC Comparative Vertebrate Sequencing, http://www.nisc.nih.gov/open_page.html?/projects/comp_seq.html (for Target 1)
    1. Partimatrix, http://www.cecalc.ula.ve/BIOINFO/servicios/herr1/PARTIMATRIX/manual.htm
    1. Pairwise FLAG, http://bioinformatics.itri.org.tw/prflag/prflag.php

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410 - PubMed
    1. Bagnall RD, Ayres KL, Green PM, Giannelli F (2005) Gene conversion and evolution of Xq28 duplicons involved in recurring inversions causing severe hemophilia A. Genome Res 15:214–223 - PMC - PubMed
    1. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE (2002) Recent segmental duplications in the human genome. Science 297:1003–1007 - PubMed
    1. Bosch E, Hurles ME, Navarro A, Jobling MA (2004) Dynamics of a human interparalog gene conversion hotspot. Genome Res 14:835–844 - PMC - PubMed
    1. Chen FC, Li WH (2001) Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet 68:444–456 - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources