Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome - PubMed (original) (raw)

. 2017 Apr;49(4):643-650.

doi: 10.1038/ng.3802. Epub 2017 Mar 6.

Benjamin D Rosen 2, Sergey Koren 3, Brian L Sayre 4, Alex R Hastie 5, Saki Chan 5, Joyce Lee 5, Ernest T Lam 5, Ivan Liachko 6, Shawn T Sullivan 7, Joshua N Burton 6, Heather J Huson 8, John C Nystrom 8, Christy M Kelley 9, Jana L Hutchison 2, Yang Zhou 2 10, Jiajie Sun 11, Alessandra Crisà 12, F Abel Ponce de León 13, John C Schwartz 14, John A Hammond 14, Geoffrey C Waldbieser 15, Steven G Schroeder 2, George E Liu 2, Maitreya J Dunham 6, Jay Shendure 6 16, Tad S Sonstegard 17, Adam M Phillippy 3, Curtis P Van Tassell 2, Timothy P L Smith 9

Affiliations

Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome

Derek M Bickhart et al. Nat Genet. 2017 Apr.

Abstract

The decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus) based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced what is, to our knowledge, the most continuous de novo mammalian assembly to date, with chromosome-length scaffolds and only 649 gaps. Our assembly represents a ∼400-fold improvement in continuity due to properly assembled gaps, compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, representing the largest repeat family and immune gene complex yet produced for an individual of a ruminant species.

PubMed Disclaimer

Conflict of interest statement

Competing Financial Interests: TSS is a current employee of Recombinetics. IL and STS are employees of Phase Genomics. JB, JS and MJD have a vested financial interest in Phase Genomics. ARH, SC, JL, and ETL are employees of Bionano Genomics. All other authors declare no competing financial interests.

Figures

Figure 1

Figure 1

Assembly schema for producing chromosome-length scaffolds. (A) Four different sets of sequencing data (long-read WGS, Hi-C data, optical mapping and short-read WGS) were produced in order to generate the goat reference genome. A tiered scaffolding approach using optical mapping data followed by Hi-C proximity guided assembly produced the highest quality genome assembly. (B) In order to correct misassemblies resulting from contig- or scaffold-errors, a consensus approach was used. An example from the initial optical mapping dataset is shown in the figure. A scaffold fork was identified on contig 3 (a 91 Mbp length contig) from the optical mapping data. Mapping of short-read WGS data showed signature that there was a misassembly near the 13th megabase of the contig, so it was split at this region. Subsequent analysis based on the RH map confirmed this split.

Figure 2

Figure 2

Assembly benchmarking comparisons reveal high degree of assembly completion. (A) Feature response curves (FRC) showing the error rate as a function of the number of bases in each assembly (CHIR_1.0, CHIR_2.0, and ARS1) and each scaffold test (intermediary assemblies using a combination of Hi-C and Bionano scaffolding). (B) Comparison plots of chromosome 20 sequence between the ARS1 and CHIR_2.0 assemblies reveal several small inversions (light blue circles) and a small insertion of sequence (break in continuity) in the ARS1 assembly. Red circles highlight 9 of the aforementioned inversions and the insertion of sequence in our assembly. The ARS1 assembly contains only 10 gaps on this chromosome scaffold whereas CHIR_2.0 has 5,651 gaps on the same chromosome assembly (gap density histogram on the Y axis). ARS1 optical map scaffolds and Pacbio contigs represented on the X axis as alternating patterns of blue and green shades, respectively, showing the tiling path that comprises the entire single chromosome scaffold.

Figure 3

Figure 3

RH probe map shows excellent assembly continuity. ARS1 RH probe mapping locations were plotted against the RH map order. Each ARS1 scaffold corresponds to an RH map chromosome with the exception of X which is composed of two scaffolds. Red circles highlight two intrachromosomal (on chrs 1 and 23) and two interchromosomal misassemblies (on chrs 18 and 17) in ARS1 that were difficult to resolve.

Figure 4

Figure 4

Long-read assembly with complementary scaffolding resolves gap regions (A) and long repeats (B) that cause problems for short-read reference annotation. (A) A region of the Mucin gene cluster was resolved by long-read assembly, resulting in a complete gene model for Mucin-5b-Like that was impossible due to two assembly gaps in the CHIR_2.0 assembly. (B) Counts of repetitive elements that had greater than 75% sequence length and greater than 60% identity with RepBase database entries for ruminant lineages. With the exception of the rRNA cluster (which is present in many repeated copies in the genome), the CHIR_2.0 reference contained a full complement of shorter repeat segments that were also present in our assembly. However, repeats that were larger than 1 kb were present in higher numbers in our assembly due to our ability to traverse the entire repetitive element’s length.

Figure 5

Figure 5

(A) A region of the Natural Killer Cell (NKC) gene cluster was fragmented in the CHIR_2.0 reference genome but is present on a single contig within ARS1. (B) Likewise, the Leukocyte Receptor Complex (LRC) locus was poorly represented in CHIR_2.0, and was missing ~500 kb of sequence. For highly repetitive and polymorphic gene families, our assembly approach provided the best resolution and highest continuity of gene sequence.

Similar articles

Cited by

References

    1. Matukumalli LK, et al. Development and Characterization of a High Density SNP Genotyping Assay for Cattle. PLoS ONE. 2009;4 - PMC - PubMed
    1. Romay MC, et al. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biol. 2013;14:R55. - PMC - PubMed
    1. Tosser-Klopp G, et al. Design and Characterization of a 52K SNP Chip for Goats. PLOS ONE. 2014;9:e86227. - PMC - PubMed
    1. Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
    1. Consortium IHGS. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. - PubMed

MeSH terms

Substances

LinkOut - more resources