Assembly of polymorphic genomes: algorithms and application to Ciona savignyi - PubMed (original) (raw)

. 2005 Aug;15(8):1127-35.

doi: 10.1101/gr.3722605.

David B Jaffe, Keith O'Neill, Elinor K Karlsson, Nicole Stange-Thomann, Scott Anderson, Jill P Mesirov, Nori Satoh, Yutaka Satou, Chad Nusbaum, Bruce Birren, James E Galagan, Eric S Lander

Affiliations

Assembly of polymorphic genomes: algorithms and application to Ciona savignyi

Jade P Vinson et al. Genome Res. 2005 Aug.

Abstract

Whole-genome assembly is now used routinely to obtain high-quality draft sequence for the genomes of species with low levels of polymorphism. However, genome assembly remains extremely challenging for highly polymorphic species. The difficulty arises because two divergent haplotypes are sequenced together, making it difficult to distinguish alleles at the same locus from paralogs at different loci. We present here a method for assembling highly polymorphic diploid genomes that involves assembling the two haplotypes separately and then merging them to obtain a reference sequence. Our method was developed to assemble the genome of the sea squirt Ciona savignyi, which was sequenced to a depth of 12.7 x from a single wild individual. By comparing finished clones of the two haplotypes we determined that the sequenced individual had an extremely high heterozygosity rate, averaging 4.6% with significant regional variation and rearrangements at all physical scales. Applied to these data, our method produced a reference assembly covering 157 Mb, with N50 contig and scaffold sizes of 47 kb and 989 kb, respectively. Alignment of ESTs indicates that 88% of loci are present at least once and 81% exactly once in the reference assembly. Our method represented loci in a single copy more reliably and achieved greater contiguity than a conventional whole-genome assembly method.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Typical alignments of the two assembled variants to finished clones from the same individual (A) and from a different individual (B). Finished sequence from the same individual aligns very well to one of the two assembled variants but differs significantly from the second variant. In contrast, finished sequence from another individual represents a third variant, and the three pairwise distances are similar. These observations indicate that the two assembled variants are haplotypes, not paralogs. BLASTN alignments are shown as parallelograms, nucleotide mismatches are shown as small dots.

Figure 2.

Figure 2.

(A) Alignment of finished sequence from the two haplotypes, superimposed on the annotation of repetitive DNA (gray bars). In many cases the haplotype-specific sequence and its boundaries coincide with repetitive sequence, indicating that a repetitive DNA element (often a SINE) has inserted in one of the two haplotypes. Black lines indicate all BLASTN alignments between finished sequence from the two haplotypes. (B) Illustration of the repetitive sequences and alignments flanking a 6-kb interval specific to haplotype A. This is consistent with a deletion in haplotype B mediated by recombination through SINE elements.

Figure 3.

Figure 3.

Our method for diploid genome assembly. The multistep process has two main components, which were motivated by an analysis of the default assembly using Arachne. First, we assemble the two haplotypes separately by applying a new algorithm called “the splitting rule” prior to the formation of contigs and scaffolds. This forces the two haplotypes to assemble separately. Second, we merge the two haplotypes by detecting long-range correspondences between haploid scaffolds, forming diploid scaffolds, and then choosing the best representative at each locus. When we cannot unambiguously determine the partner of a haploid scaffold, we set it aside as an unpaired scaffold.

Figure 4.

Figure 4.

The results of applying our method for diploid genome assembly to the ∼13× WGS data set for Ciona savignyi. The majority of ESTs align exactly twice to the default assembly using Arachne, indicating a tendency for the two haplotypes to assemble separately. The fraction of ESTs aligning twice has increased in the haplotype assemblies, consistent with the goal of separating the haplotypes more cleanly. The process of merging the haplotype assemblies had three effects: representing the vast majority of loci exactly once instead of twice, increasing the scaffold size, and increasing the contig size. We also report the unpaired scaffolds, for which we were unable to determine their partner of the opposite haplotype—on average they are small, highly repetitive, and depleted for EST alignments.

Figure 5.

Figure 5.

Experimental design for validating the assembly at critical junctions, sample results, and summary of results from 14 critical junctions. At a critical junction (at which the haploid scaffolds correspond up to a point and then cease to correspond) we select two 40-kb clones from each haplotype spanning the junction. We check that the draft sequences correspond to the clone sequences by designing six PCR assays that are applied to each clone (A). For these 24 assays, we predict a product and its length in 16 assays and the absence of a product in eight assays, as in the sample results (B). We also sequenced the PCR products at 10 of the critical junctions and checked that they align best to the draft sequence of the correct haplotype. Accounting for variations in experimental design, we made predictions for 304 PCR assays (see Supplemental material). We observed only two discrepant PCR product lengths, and the nucleotide sequence of these two products was also discrepant (C). The overwhelming agreement between prediction and observation supports the interpretation that the large-scale haplotype differences are real and that we assembled them correctly.

Figure 6.

Figure 6.

The splitting rule (A) and a diploid scaffold (B). (A) Suppose that the red reads are of one haplotype and the blue reads are of the other haplotype, and that all reads have computationally detected overlaps with all other reads of the same color. In principle, the red reads should assemble in one contig, and the blue reads in another contig. However, if A1 and B1 overlap only in a region of haplotype identity, then A1 and B1 will also have a computationally detected overlap. When Arachne builds contigs, this overlap would trigger algorithms designed to prevent assembly through repeats and break both the red and blue scaffolds. The splitting rule was devised to recognize this situation and sever the overlap between A1 and B1 prior to the formation of contigs, thus allowing the red and blue contigs to assemble separately. See text for details. (B) In general, a diploid scaffold is an alternation between regions represented by a single haploid scaffold and regions represented by a collinear block of alignments relating two haploid scaffolds. For example, the diploid scaffold in B has two collinear blocks and three regions represented by a single haplotype. Within each collinear block, the reference path (thick black line) is chosen to minimize the number of contig gaps.

Similar articles

Cited by

References

    1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402. - PMC - PubMed
    1. Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301–1310. - PubMed
    1. Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J.P., and Lander, E.S. 2002. ARACHNE: A whole-genome shotgun assembler. Genome Res. 12: 177–189. - PMC - PubMed
    1. Dehal, P., Satou, Y., Campbell, R.K., Chapman, J., Degnan, B., De Tomaso, A., Davidson, B., Di Gregorio, A., Gelpke, M., Goodstein, D.M., et al. 2002. The draft genome of Ciona intestinalis: Insights into chordate and vertebrate origins. Science 298: 2157–2167. - PubMed
    1. Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., et al. 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512. - PubMed

Publication types

MeSH terms

LinkOut - more resources