Lineage-specific biology revealed by a finished genome assembly of the mouse - PubMed (original) (raw)

. 2009 May 5;7(5):e1000112.

doi: 10.1371/journal.pbio.1000112. Epub 2009 May 26.

Leo Goodstadt, Ladeana W Hillier, Michael C Zody, Steve Goldstein, Xinwe She, Carol J Bult, Richa Agarwala, Joshua L Cherry, Michael DiCuccio, Wratko Hlavina, Yuri Kapustin, Peter Meric, Donna Maglott, Zoë Birtle, Ana C Marques, Tina Graves, Shiguo Zhou, Brian Teague, Konstantinos Potamousis, Christopher Churas, Michael Place, Jill Herschleb, Ron Runnheim, Daniel Forrest, James Amos-Landgraf, David C Schwartz, Ze Cheng, Kerstin Lindblad-Toh, Evan E Eichler, Chris P Ponting; Mouse Genome Sequencing Consortium

Collaborators, Affiliations

Lineage-specific biology revealed by a finished genome assembly of the mouse

Deanna M Church et al. PLoS Biol. 2009.

Abstract

The mouse (Mus musculus) is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes). In addition, we identified 439 long, non-protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Graphical representation of the two sequencing strategies used for mouse.

A careful cost/benefit analysis must be performed when approaching a genomic sequencing project. If lineage-specific biology is important, clone-based finishing of some form will be required.

Figure 2

Figure 2. Graphical representation of sequence composition.

Chromosomes are drawn to scale, with MGSCv3 to the left (green) and Build 36 to the right (purple). A female mouse provided the DNA for the MGSCv3, so no Y chromosome was available for this assembly.

Figure 3

Figure 3. A graphical representation of conserved synteny relationships.

The chromosomes of human Build 36 are painted with segments of conserved synteny ≥300 kb long with mouse MGSCv3 (left) and Build 36 (right). Colors indicate mouse chromosomes (see legend bottom right), while lines indicate orientation (top left to bottom right is direct, top right to bottom left is inverted). White regions are not covered by alignments forming a segment ≥300 kb. Red triangles are human centromeres. Note that all undirected blocks (regions of identical color) are identical between the two mouse builds except a region at the centromere of human Chromosome 9, which is itself an artifact in the MGSCv3 map. However, several areas of orientation change, some quite small, can be seen.

Figure 4

Figure 4. The distribution of segmental duplication in MGSCv3 (top) and Build 36 (bottom).

Interchromosomal (red) and intrachromosomal (blue) duplications (>95% identity and >10 kbp) in length are shown for both genome assemblies with the requirement that pairwise alignments are shown for only those regions (Build 36) that are also confirmed by the WGS depth of coverage analysis (black vertical bars/ticks). Positions of the centromeres (acrocentric) are shown (purple) for the MGSCv3 build. Initial estimates predicted the amount of segmental duplication to be approximately 1.5–2% of the genome. Calculations performed using Build 36 suggested the amount is much higher, approximately 4.5–5%. In addition, >60% of duplicated sequences were unplaced in the MGSCv3. In Build 36, almost all are assigned to a chromosome

Figure 5

Figure 5. The proportion of exonic sequence disrupted in the MGSCv3.

Mouse lineage-specific gene duplicates are shown in red, and all other genes are shown in blue. The large number of mouse-specific genes that are entirely missing, truncated, or otherwise disrupted in MGSCv3 underscores the value of the finished Build 36 assembly in understanding rodent-specific biology.

Figure 6

Figure 6. Improvement of a region in Build 36, rich in Pramel genes, which is virtually absent in MGSCv3.

(A) The upper left hand corner shows a dot-matrix view of the Build 36 Chromosome 5 (horizontal axis) aligned to the MGSCv3 Chromosome 5 (vertical axis). The triangle marks the portion of the chromosome shown in the zoomed in view. The axes are in the same orientation. 1.5 Mb of sequence that was absent from MGSCv3 has been included in Build 36. This region contains 30 Pramel genes (shown in red) and approximately an additional 20 Pramel pseudogenes (in blue). This region consists almost entirely of segmental duplications (represented by blue lines below the dot matrix), which previously confounded the WGA algorithm. Gene models for Build 36 are displayed below the dot matrix view. SD, segmental duplication; CDS, coding regions; pseudo, pseudogenes. (B) Although the orthologous PRAME and Pramel gene families have expanded independently in the primate and rodent lineages, positive selection has been most intense on equivalent regions of their structures. Positive selection on amino acid substitution has been most intense on one exterior surface (left) and has been virtually absent from the alternate face (right). Amino acids predicted to have been positively selected among human HSA1 PRAME genes (shown in red), mouse MMU4 Pramel (blue), or rat RNO5 Pramel (purple) genes have been mapped onto an homologous structure (Protein Databank code 2BNH). Amino acids that are positively selected in two or more species are shown in yellow. Three positively-selected sites among mouse Pramel genes are not highlighted, as they occur within insertions relative to the 2BNH sequence.

Figure 7

Figure 7. Cumulative numbers of protein-coding gene duplication events on the human and mouse lineages since their divergence (grey).

Evolutionary time is estimated using d S (the number of synonymous substitutions per synonymous site) and a divergence time of 91 million years. The greater number of genes in the mouse compared with the human genome is largely accounted for by the lower rate of olfactory and vomeronasal receptor gene duplications (red) in the primate lineage.

References

    1. Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W. Using genomic data to unravel the root of the placental mammal phylogeny. Genome Res. 2007;17:413–421. - PMC - PubMed
    1. Rossant J, McKerlie C. Mouse-based phenogenomics for modelling human disease. Trends Mol Med. 2001;7:502–507. - PubMed
    1. Ohno S. New York: Springer-Verlag; 1970. Evolution by gene duplication.
    1. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–1007. - PubMed
    1. She X, Cheng Z, Zöllner S, Church DM, Eichler EE. Mouse segmental duplication and copy number variation. Nat Genet. 2008;40:909–914. doi: 10.1038/ng.172. - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources