Analysis of segmental duplications and genome assembly in the mouse - PubMed (original) (raw)

Analysis of segmental duplications and genome assembly in the mouse

Jeffrey A Bailey et al. Genome Res. 2004 May.

Abstract

Limited comparative studies suggest that the human genome is particularly enriched for recent segmental duplications. The extent of segmental duplications in other mammalian genomes is unknown and confounded by methodological differences in genome assembly. Here, we present a detailed analysis of recent duplication content within the mouse genome using a whole-genome assembly comparison method and a novel assembly independent method, designed to take advantage of the reduced allelic variation of the C57BL/6J strain. We conservatively estimate that approximately 57% of all highly identical segmental duplications (>or=90%) were misassembled or collapsed within the working draft WGS assembly. The WGS approach often leaves duplications fragmented and unassigned to a chromosome when compared with the clone-ordered-based approach. Our preliminary analysis suggests that 1.7%-2.0% of the mouse genome is part of recent large segmental duplications (about half of what is observed for the human genome). We have constructed a mouse segmental duplication database to aid in the characterization of these regions and their integration into the final mouse genome assembly. This work suggests significant biological differences in the architecture of recent segmental duplications between human and mouse. In addition, our unique method provides the means for improving whole-genome shotgun sequence assembly of mouse and future mammalian genomes.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Whole-genome assembly comparison for mouse and human. We compared the sum of aligned bases (excluding gaps) for segmental duplications represented by alignments ≥10 kb in both the human genome (build 31) and the draft mouse genome (MGSCv3). Both the human and mouse genomes have alignments at all levels of identity; however, the human genome has a dramatically greater amount of aligned bases relative to the mouse (227,812 kbp vs. 10,042 kbp). The number of alignments increases geometrically relative to the number of copies. Mouse appears relatively rich in intrachromosomal duplications (black) and lacking in interchromosomal duplications (dark gray). However, many alignments are poorly characterized as indicated by the enrichment within the unplaced chromosome (chrUn—light gray).

Figure 2

Figure 2

Whole-genome alignment (WGAC) statistics of the mouse draft and the build29 finished genome. Alignment statistics are binned in terms of percent identity or length (≥10 kb). We performed BLAST-based segmental duplication detection on MGSCv3 and the finished portion of build 29. The finished build 29 subset represents 439 Mb (17.7% of the draft assembly size). The abundance of aligned bases between 99.5%–100% that map to the unknown chromosome in MGSCv3 may represent highly similar duplication requiring further characterization. The build 29 pairwise were hand curated to remove uncharacterized interspersed transposable elements (Methods).

Figure 3

Figure 3

Examples of whole-genome shotgun sequence detection (WSSD). The calibration of our WSSD method was performed on a set of unique and duplicated sequences. Unique sequences were drawn from clones shown to be unique by both metaphase and interphase FISH (e.g., AL590991). Examples of duplicated sequence were drawn from recently described pericentromeric duplications (e.g., mmu5; Thomas et al. 2003). Detection parameters were optimized to differentiate unique from duplicated sequence. Black dots represent the similarity and position of individual sequence reads. Masked repetitive regions (LINE elements, purple; ERV elements, green; and simple sequence repeats, red) are shown as vertical bars. From previous studies of the human genome (Bailey et al. 2002a), read depth (blue line) provided the measure for duplication detection. Here, we also took advantage of the reduced level of allelic variation within the C57BL/6J strain to increase our power. Thus, single base-pair differences most likely signify either paralogous sequence or sequencing errors. By excluding errors (through the calculation of read identity using only high quality base positions), we could categorize each read as allelic (≥99.8% identity) or paralogous (<99.8% identity). Regions showing a divergent read ratio (red line) of >0.8 (paralogous: allelic) were deemed duplicated. A divergent read ratio of 1 would suggest one paralogous copy.

Figure 4

Figure 4

FISH confirmation. An example of (a) metaphase and (b) interphase FISH hybridization with a duplicated BAC clone (RP23–3D2; see Table 4) that was identified by the whole-genome shotgun detection strategy. Increased signal intensity was confirmed using (c) cohybridization with a unique probe (RP21-344N12) in the same nucleus as shown in b. Tandem segmental duplications were most frequently observed (Table 4). The results of all FISH experiments are available online (http://www.biologia.uniba.it/mouse/).

Figure 5

Figure 5

Mouse segmental duplications. Segmental duplications detected by whole-genome shotgun sequence detection (WSSD, black bars) and whole-genome analysis comparison (WGAC, red/blue bars) are drawn to scale within the published mouse genome assembly (MGSC 2002). Chromosome lengths and the centromere positions are shown in purple. These data are available as part of an interactive mouse segmental duplication database (

http://mouseparalogy.cwru.edu

).

Similar articles

Cited by

References

    1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410. - PubMed
    1. Armengol, L., Pujana, M.A., Cheung, J., Scherer, S.W., and Estivill, X. 2003. Enrichment of segmental duplications in regions of breaks of synteny between the human and mouse genomes suggest their involvement in evolutionary rearrangements. Hum. Mol. Genet. 12: 2201-2208. - PubMed
    1. Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J., and Eichler, E.E. 2001. Segmental duplications: Organization and impact within the current human genome project assembly. Genome Res. 11: 1005-1017. - PMC - PubMed
    1. Bailey, J.A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M.D., Myers, E.W., Li, P.W., and Eichler, E.E. 2002a. Recent segmental duplications in the human genome. Science 297: 1003-1007. - PubMed
    1. Bailey, J.A., Yavor, A.M., Viggiano, L., Misceo, D., Horvath, J.E., Archidiacono, N., Schwartz, S., Rocchi, M., and Eichler, E.E. 2002b. Human-specific duplication and mosaic transcripts: The recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70: 83-100. - PMC - PubMed

WEB SITE REFERENCES

    1. http://mouseparalogy.gene.cwru.edu; Eichler Lab Mouse Segmental Duplication Database.
    1. http://www.biologia.uniba.it/mouse/; FISH experiments of WSSD duplication-positive clones.
    1. http://www.ncbi.nlm.nih.gov/genome/guide/build.html; NCBI's Genome Annotation Pipeline.
    1. http://www.ncbi.nlm.nih.gov/genome/guide/mouse/MmStats.html; Mouse Build 30 Statistics.
    1. http://www.ncbi.nlm.nih.gov/RefSeq/; NCBI Reference Sequence Database.

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources