Genome graphs and the evolution of genome inference - PubMed (original) (raw)
Review
Genome graphs and the evolution of genome inference
Benedict Paten et al. Genome Res. 2017 May.
Abstract
The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures-which we collectively refer to as genome graphs-and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.
© 2017 Paten et al.; Published by Cold Spring Harbor Laboratory Press.
Figures
Figure 1.
Schematic representation of two population-level reference structures. (A) A reference cohort, in which there is no attempt to identify homologies between the genome sequences. (B) A genome graph, in which homologies are collapsed and included as alternate paths in the graph.
Figure 2.
Four types of genome graphs, all constructed from the pair of sequences
ATCCCCTA
and
ATGTCTA
. (A) De Bruijn graph. (B) Directed acyclic graph. (C) Bidirected graph (a.k.a., sequence graph). (D) Biedged graph (a.k.a., biedged sequence graph).
Figure 3.
Ultrabubble sites in a biedged sequence graph. Each arrow shows the terminal node of a site. The color of the arrows indicates the node pairing. Note that the ultrabubble denoted by the gray pair of arrows is nested within the ultrabubble denoted by the purple arrows. (Reprinted from Paten et al. 2017, with permission from the author.)
Figure 4.
A pangenome ordering on a graph constructed from two genomes. The red edges indicate the path of the pangenome through the graph. The solid and dotted edges indicate the adjacencies between nodes in the two source genomes. (Adapted from Nguyen et al. 2015, with permission from the author.)
Figure 5.
A schematic example of an “Array Sequence Graph” of the type used to construct a linearization of the DXZ1 repeat array in the X Chromosome centromere (Miga et al. 2014). A collection of reads (top) shown in the context of a consensus higher-order repeat are converted into a graph representation (bottom). A cycle around the graph represents a higher-order repeat, and the individual repeat units (oblongs) are represented within each node (circles). Edges between individual repeat units represent phasing information from input reads. Transitions between nodes are annotated with probabilities. (Adapted from Miga et al. 2014, with permission from the author.)
Figure 6.
A reference genome graph hierarchy (most collapsed graph at the top, less collapsed lower), with an input graph (bottom) mapped to it. All the graphs in the reference hierarchy are de Bruijn graphs. Dotted red lines show projections between graphs in the hierarchy, whereas solid red lines show mapping of the input sequence graph into the hierarchy. Here, each node has a unique ID, and the L and R strings represent flanking contexts mapping strings required for unique identification. (Reprinted from Paten et al. 2014, with permission from the author.)
Figure 7.
Distinct 1000 Genomes Project haplotypes embedded within a variation subgraph. Haplotypes are shown as colored ribbons with width proportional to the log of their frequency. The number of possible paths traversing left to right is 16, but only five are observed in the 1000 Genomes Project because of linkage disequilibrium. (Figure based on prototype by W Beyer, pers. comm.)
Figure 8.
Bidirected sequence graph (A) being unfolded into a directed acyclic graph (B), in preparation for partial-order alignment. Node 6 is a reversed view of node 1, node 7 is a reversed view of node 2, and node 8 is a reversed view of node 5. (Reprinted from Garrison et al. [in prep], with permission from the author.)
Similar articles
- Fast and accurate genomic analyses using genome graphs.
Rakocevic G, Semenyuk V, Lee WP, Spencer J, Browning J, Johnson IJ, Arsenijevic V, Nadj J, Ghose K, Suciu MC, Ji SG, Demir G, Li L, Toptaş BÇ, Dolgoborodov A, Pollex B, Spulber I, Glotova I, Kómár P, Stachyra AL, Li Y, Popovic M, Källberg M, Jain A, Kural D. Rakocevic G, et al. Nat Genet. 2019 Feb;51(2):354-362. doi: 10.1038/s41588-018-0316-4. Epub 2019 Jan 14. Nat Genet. 2019. PMID: 30643257 - Positional bias in variant calls against draft reference assemblies.
Briskine RV, Shimizu KK. Briskine RV, et al. BMC Genomics. 2017 Mar 28;18(1):263. doi: 10.1186/s12864-017-3637-2. BMC Genomics. 2017. PMID: 28351369 Free PMC article. - Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis.
Tetikol HS, Turgut D, Narci K, Budak G, Kalay O, Arslan E, Demirkaya-Budak S, Dolgoborodov A, Kabakci-Zorlu D, Semenyuk V, Jain A, Davis-Dusenbery BN. Tetikol HS, et al. Nat Commun. 2022 Aug 4;13(1):4384. doi: 10.1038/s41467-022-31724-3. Nat Commun. 2022. PMID: 35927245 Free PMC article. - Pangenome Graphs.
Eizenga JM, Novak AM, Sibbesen JA, Heumos S, Ghaffaari A, Hickey G, Chang X, Seaman JD, Rounthwaite R, Ebler J, Rautiainen M, Garg S, Paten B, Marschall T, Sirén J, Garrison E. Eizenga JM, et al. Annu Rev Genomics Hum Genet. 2020 Aug 31;21:139-162. doi: 10.1146/annurev-genom-120219-080406. Epub 2020 May 26. Annu Rev Genomics Hum Genet. 2020. PMID: 32453966 Free PMC article. Review. - Tools for Predicting the Functional Impact of Nonsynonymous Genetic Variation.
Tang H, Thomas PD. Tang H, et al. Genetics. 2016 Jun;203(2):635-47. doi: 10.1534/genetics.116.190033. Genetics. 2016. PMID: 27270698 Free PMC article. Review.
Cited by
- Integrated Analysis of Whole Genome and Epigenome Data Using Machine Learning Technology: Toward the Establishment of Precision Oncology.
Asada K, Kaneko S, Takasawa K, Machino H, Takahashi S, Shinkai N, Shimoyama R, Komatsu M, Hamamoto R. Asada K, et al. Front Oncol. 2021 May 12;11:666937. doi: 10.3389/fonc.2021.666937. eCollection 2021. Front Oncol. 2021. PMID: 34055633 Free PMC article. Review. - Advances in optical mapping for genomic research.
Yuan Y, Chung CY, Chan TF. Yuan Y, et al. Comput Struct Biotechnol J. 2020 Aug 1;18:2051-2062. doi: 10.1016/j.csbj.2020.07.018. eCollection 2020. Comput Struct Biotechnol J. 2020. PMID: 32802277 Free PMC article. Review. - Impact of index hopping and bias towards the reference allele on accuracy of genotype calls from low-coverage sequencing.
Ros-Freixedes R, Battagin M, Johnsson M, Gorjanc G, Mileham AJ, Rounsley SD, Hickey JM. Ros-Freixedes R, et al. Genet Sel Evol. 2018 Dec 13;50(1):64. doi: 10.1186/s12711-018-0436-4. Genet Sel Evol. 2018. PMID: 30545283 Free PMC article. - Sequence tube maps: making graph genomes intuitive to commuters.
Beyer W, Novak AM, Hickey G, Chan J, Tan V, Paten B, Zerbino DR. Beyer W, et al. Bioinformatics. 2019 Dec 15;35(24):5318-5320. doi: 10.1093/bioinformatics/btz597. Bioinformatics. 2019. PMID: 31368484 Free PMC article. - Computational graph pangenomics: a tutorial on data structures and their applications.
Baaijens JA, Bonizzoni P, Boucher C, Della Vedova G, Pirola Y, Rizzi R, Sirén J. Baaijens JA, et al. Nat Comput. 2022 Mar;21(1):81-108. doi: 10.1007/s11047-022-09882-6. Epub 2022 Mar 4. Nat Comput. 2022. PMID: 36969737 Free PMC article.
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources