Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT) - PubMed (original) (raw)
Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT)
Richard Durbin. Bioinformatics. 2014.
Abstract
Motivation: Over the last few years, methods based on suffix arrays using the Burrows-Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on statistical methods for phasing and imputation, based on probabilistic matching to hidden Markov model representations of the reference data, which while powerful are much less computationally efficient. Here a theory of haplotype matching using suffix array ideas is developed, which should scale too much larger datasets than those currently handled by genotype algorithms.
Results: Given M sequences with N bi-allelic variable sites, an O(NM) algorithm to derive a representation of the data based on positional prefix arrays is given, which is termed the positional Burrows-Wheeler transform (PBWT). On large datasets this compresses with run-length encoding by more than a factor of a hundred smaller than using gzip on the raw data. Using this representation a method is given to find all maximal haplotype matches within the set in O(NM) time rather than O(NM(2)) as expected from naive pairwise comparison, and also a fast algorithm, empirically independent of M given sufficient memory for indexes, to find maximal matches between a new sequence and the set. The discussion includes some proposals about how these approaches could be used for imputation and phasing.
Figures
Fig. 1.
A set of haplotype sequences sorted in order of reversed prefixes at position k, showing the set of values at k isolated from those before and after, and on the right hand side how the order at position (k + 1) is derived from that at k as in Algorithm 1. Maximal substrings shared with the preceding sequence ending at k are shown bold underlined; these start at position _dk_[_i_] as calculated in Algorithm 2
Similar articles
- Syllable-PBWT for space-efficient haplotype long-match query.
Wang V, Naseri A, Zhang S, Zhi D. Wang V, et al. Bioinformatics. 2023 Jan 1;39(1):btac734. doi: 10.1093/bioinformatics/btac734. Bioinformatics. 2023. PMID: 36440908 Free PMC article. - d-PBWT: dynamic positional Burrows-Wheeler transform.
Sanaullah A, Zhi D, Zhang S. Sanaullah A, et al. Bioinformatics. 2021 Aug 25;37(16):2390-2397. doi: 10.1093/bioinformatics/btab117. Bioinformatics. 2021. PMID: 33624749 - Multi-allelic positional Burrows-Wheeler transform.
Naseri A, Zhi D, Zhang S. Naseri A, et al. BMC Bioinformatics. 2019 Jun 6;20(Suppl 11):279. doi: 10.1186/s12859-019-2821-6. BMC Bioinformatics. 2019. PMID: 31167638 Free PMC article. - Efficient haplotype matching between a query and a panel for genealogical search.
Naseri A, Holzhauser E, Zhi D, Zhang S. Naseri A, et al. Bioinformatics. 2019 Jul 15;35(14):i233-i241. doi: 10.1093/bioinformatics/btz347. Bioinformatics. 2019. PMID: 31510689 Free PMC article. - A space-efficient construction of the Burrows-Wheeler transform for genomic data.
Lippert RA, Mobarry CM, Walenz BP. Lippert RA, et al. J Comput Biol. 2005 Sep;12(7):943-51. doi: 10.1089/cmb.2005.12.943. J Comput Biol. 2005. PMID: 16201914 Review.
Cited by
- Exact Decoding of a Sequentially Markov Coalescent Model in Genetics.
Ki C, Terhorst J. Ki C, et al. J Am Stat Assoc. 2024;119(547):2242-2255. doi: 10.1080/01621459.2023.2252570. Epub 2023 Oct 3. J Am Stat Assoc. 2024. PMID: 39323740 - Common DNA sequence variation influences epigenetic aging in African populations.
Meeks GL, Scelza B, Asnake HM, Prall S, Patin E, Froment A, Fagny M, Quintana-Murci L, Henn BM, Gopalan S. Meeks GL, et al. bioRxiv [Preprint]. 2024 Aug 26:2024.08.26.608843. doi: 10.1101/2024.08.26.608843. bioRxiv. 2024. PMID: 39253488 Free PMC article. Preprint. - Genome-wide analyses of neonatal jaundice reveal a marked departure from adult bilirubin metabolism.
Solé-Navais P, Juodakis J, Ytterberg K, Wu X, Bradfield JP, Vaudel M, LaBella AL, Helgeland Ø, Flatley C, Geller F, Finel M, Zhao M, Lazarus P, Hakonarson H, Magnus P, Andreassen OA, Njølstad PR, Grant SFA, Feenstra B, Muglia LJ, Johansson S, Zhang G, Jacobsson B. Solé-Navais P, et al. Nat Commun. 2024 Aug 30;15(1):7550. doi: 10.1038/s41467-024-51947-w. Nat Commun. 2024. PMID: 39214992 Free PMC article. - Polygenic Risk Scores and Twin Concordance for Schizophrenia and Bipolar Disorder.
Song J, Pasman JA, Johansson V, Kuja-Halkola R, Harder A, Karlsson R, Lu Y, Kowalec K, Pedersen NL, Cannon TD, Hultman CM, Sullivan PF. Song J, et al. JAMA Psychiatry. 2024 Aug 28:e242406. doi: 10.1001/jamapsychiatry.2024.2406. Online ahead of print. JAMA Psychiatry. 2024. PMID: 39196586 - Introducing field-programmable gate arrays in genotype phasing and imputation.
Wienbrandt L, Ellinghaus D. Wienbrandt L, et al. Bioinform Adv. 2024 Jul 30;4(1):vbae114. doi: 10.1093/bioadv/vbae114. eCollection 2024. Bioinform Adv. 2024. PMID: 39165344 Free PMC article.
References
- Bauer MJ, et al. Lightweight BWT Construction for Very Large String Collections. In: Giancarlo R, Manzini G, editors. Combinatorial Pattern Matching. Berlin: Springer; 2011. pp. 219–231.
- Burrows M, Wheeler DJ. Technical report 124. Palo Alto, CA: Digital Equipment Corporation; 1994. A block-sorting lossless data compression algorithm.
- Delaneau O, et al. A linear complexity phasing method for thousands of genomes. Nat. Methods. 2012;9:179–181. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources