Genetic variation and the de novo assembly of human genomes - PubMed (original) (raw)

a | A comparison of sequence coverage versus N50 contig length for 30 mammalian genomes from 25 species deposited into the

US National Center for Biotechnology Information (NCBI) genome resource

, including 5 human genome assemblies (circles). Colours contrast different sequencing platforms and assembly approaches. GRCh38 (human) and GRCm38 (mouse), generated by Sanger sequencing of bacterial artificial chromosome (BAC) clones represent the highest quality of genome. Genomes are enumerated according to species as follows: 1, Ailuropoda melanoleuca GCA_000004335.1; 2, Bos mutus GCA_000298355.1; 3, Bos taurus GCA_000181335.3; 4, Felis silvestris catus GCA_000687225.1; 5, Ursus maritimus GCA_000687225.1; 6, Balaenoptera acutorostrata GCA_000493695.1; 7, Callithrix jacchus GCA_000004665.1; 8, Daubentonia madagascariensis GCA_000241425.1; 9, Lipotes vexillifer GCA_000442215.1; 10, Pteropus alecto GCA_000325575.1; 11 and 12, Mus musculus GCA_000001635.6; 13, Nasalis larvatus GCA_000772465.1; 14, Nomascus leucogenys GCA_000146795.3; 15, Otolemur garnettii GCA_000181295.3; 16, Pan paniscus GCA_000258655.1; 17, Pan troglodytes GCA_000001515.4; 18, Panthera tigris GCA_000464555.1; 19, Papio anubis GCA_000264685.1; 20, Physeter macrocephalus GCA_000472045.1; 21, Pongo abelii GCF_000001545.4; 22, Rattus norvegicus GCA_000001895.4; 23, Saimiri boliviensis GCA_000235385.1; 24, Tarsius syrichta GCA_000164805.2; 25, Tursiops truncatus GCA_000151865.3; 26–30, Homo sapiens (SOAPdenovo, ALLPATHS, HuRef, GRCh38 and MinHash Alignment Process (MHAP), respectively). b | The amount of duplicated sequence represented in different genome assemblies, as determined by whole-genome assembly comparison (WGAC), is shown for SOAPdenovo (YH, GenBank GCA_000004845.2), ALLPATHS (NA12878, GenBank GCA_000185165.1) and MHAP (CHM1, GenBank GCA_000772585), as well as for the human reference genome (GRCh38). None of the de novo assemblies achieves the amount of duplication content resolved by the clone-based GRCh38 assembly, although the resolution of segmental duplication in massively parallel sequencing (MPS)-based assemblies (SOAPdenovo and ALLPATHS) is reduced compared with that of the single-molecule real-time (SMRT) sequence-based assembly MHAP. c | Sequencing read depth is compared to GC composition across the human genome for different platforms: CHM1 Illumina HiSeq (SRP044331), NA12878 Illumina X10 (data from

AllSeq

) and CHM1 SMRT P5–C3 (SRX533609). (P5–C3 refers to the version of DNA polymerase (P) and chemistry (C) used in the sequencing reaction.) The Illumina bias is decreased in more-modern instruments, whereas the SMRT sequencing coverage is more uniform, with fewer sequence-context gaps. 454, 454 Sequencing; PacBio, Pacific Biosciences.