Hidden Markov Dirichlet process: modeling genetic inference in open ancestral space (original) (raw)

Spectrum: joint bayesian inference of population structure and recombination events

Bioinformatics, 2007

Motivation: While genetic properties such as linkage disequilibrium (LD) and population structure are closely related under a common inheritance process, the statistical methodologies developed so far mostly deal with LD analysis and structural inference separately, using specialized models that do not capture their statistical and genetic relationships. Also, most of these approaches ignore the inherent uncertainty in the genetic complexity of the data and rely on inflexible models built on a closed genetic space. These limitations may make it difficult to infer detailed and consistent structural information from rich genomic data such as populational single nucleotide polymorphisms (SNP) profiles.

A hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population data

The Annals of Applied Statistics, 2009

The perennial problem of "how many clusters?" remains an issue of substantial interest in data mining and machine learning communities, and becomes particularly salient in large data sets such as populational genomic data where the number of clusters needs to be relatively large and open-ended. This problem gets further complicated in a co-clustering scenario in which one needs to solve multiple clustering problems simultaneously because of the presence of common centroids (e.g., ancestors) shared by clusters (e.g., possible descents from a certain ancestor) from different multiple-cluster samples (e.g., different human subpopulations). In this paper we present a hierarchical nonparametric Bayesian model to address this problem in the context of multi-population haplotype inference.

Bayesian multi-population haplotype inference via a hierarchical dirichlet process mixture

Proceedings of the 23rd international conference on Machine learning - ICML '06, 2006

Uncovering the haplotypes of single nucleotide polymorphisms and their population demography is essential for many biological and medical applications. Methods for haplotype inference developed thus far-including methods based on coalescence, finite and infinite mixtures, and maximal parsimonyignore the underlying population structure in the genotype data. As noted by Pritchard , different populations can share certain portion of their genetic ancestors, as well as have their own genetic components through migration and diversification. In this paper, we address the problem of multipopulation haplotype inference. We capture cross-population structure using a nonparametric Bayesian prior known as the hierarchical Dirichlet process (HDP) , conjoining this prior with a recently developed Bayesian methodology for haplotype phasing known as DP-Haplotyper . We also develop an efficient sampling algorithm for the HDP based on a two-level nested Pólya urn scheme. We show that our model outperforms extant algorithms on both simulated and real biological data.

Bayesian Haplotype Inference via the Dirichlet Process

2004

The problem of inferring haplotypes from genotypes of single nucleotide polymorphisms (SNPs) is essential for the understanding of genetic variation within and among populations, with important applications to the genetic analysis of disease propensities and other complex traits. The problem can be formulated as a mixture model, where the mixture components correspond to the pool of haplotypes in the population. The size of this pool is unknown; indeed, knowing the size of the pool would correspond to knowing something significant about the genome and its history. Thus methods for fitting the genotype mixture must crucially address the problem of estimating a mixture with an unknown number of mixture components. In this paper we present a Bayesian approach to this problem based on a nonparametric prior known as the Dirichlet process. The model also incorporates a likelihood that captures statistical errors in the haplotype/genotype relationship trading off these errors against the size of the pool of haplotypes. We describe an algorithm based on Markov chain Monte Carlo for posterior inference in our model. The overall result is a flexible Bayesian method, referred to as DP-Haplotyper, that is reminiscent of parsimony methods in its preference for small haplotype pools. We further generalize the model to treat pedigree relationships (e.g., trios) between the population's genotypes. We apply DP-Haplotyper to the analysis of both simulated and real genotype data, and compare to extant methods.

A Markov Chain Monte Carlo Sampler for Gene Genealogies Conditional on Haplotype Data

Some Recent Advances in Mathematics and Statistics, 2013

The gene genealogy is a tree describing the ancestral relationships among genes sampled from unrelated individuals. Knowledge of the tree is useful for inference of population-genetic parameters such as migration or recombination rates. It also has potential application in gene-mapping, as individuals with similar trait values will tend to be more closely related genetically at the location of a trait-influencing mutation. One way to incorporate genealogical trees in genetic applications is to sample them conditional on observed genetic data. We have implemented a Markov chain Monte Carlo based genealogy sampler that conditions on observed haplotype data. Our implementation is based on an algorithm sketched by Zöllner and Pritchard but with several differences described herein. We also provide insights from our interpretation of their description that were necessary for efficient implementation. Our sampler can be used to summarize the distribution of tree-based association statistics, such as case-clustering measures.

Maximum likelihood estimation of recombination rates from population data

Genetics, 2000

We describe a method for co-estimating r = C/mu (where C is the per-site recombination rate and mu is the per-site neutral mutation rate) and Theta = 4N(e)mu (where N(e) is the effective population size) from a population sample of molecular data. The technique is Metropolis-Hastings sampling: we explore a large number of possible reconstructions of the recombinant genealogy, weighting according to their posterior probability with regard to the data and working values of the parameters. Different relative rates of recombination at different locations can be accommodated if they are known from external evidence, but the algorithm cannot itself estimate rate differences. The estimates of Theta are accurate and apparently unbiased for a wide range of parameter values. However, when both Theta and r are relatively low, very long sequences are needed to estimate r accurately, and the estimates tend to be biased upward. We apply this method to data from the human lipoprotein lipase locus.

Linkage Disequilibrium and Inference of Ancestral Recombination in 538 Single-Nucleotide Polymorphism Clusters across the Human Genome

The American Journal of Human Genetics, 2003

The prospect of using linkage disequilibrium (LD) for fine-scale mapping in humans has attracted considerable attention, and, during the validation of a set of single-nucleotide polymorphisms (SNPs) for linkage analysis, a set of data for 4,833 SNPs in 538 clusters was produced that provides a rich picture of local attributes of LD across the genome. LD estimates may be biased depending on the means by which SNPs are first identified, and a particular problem of ascertainment bias arises when SNPs identified in small heterogeneous panels are subsequently typed in larger population samples. Understanding and correcting ascertainment bias is essential for a useful quantitative assessment of the landscape of LD across the human genome. Heterogeneity in the population recombination rate, rho=4Nr, along the genome reflects how variable the density of markers will have to be for optimal coverage. We find that ascertainment-corrected rho varies along the genome by more than two orders of magnitude, implying great differences in the recombinational history of different portions of our genome. The distribution of rho is unimodal, and we show that this is compatible with a wide range of mixtures of hotspots in a background of variable recombination rate. Although rho is significantly correlated across the three population samples, some regions of the genome exhibit population-specific spikes or troughs in rho that are too large to be explained by sampling. This result is consistent with differences in the genealogical depth of local genomic regions, a finding that has direct bearing on the design and utility of LD mapping and on the National Institutes of Health HapMap project.

Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms

The American Journal of Human Genetics, 2002

Haplotypes have gained increasing attention in the mapping of complex-disease genes, because of the abundance of single-nucleotide polymorphisms (SNPs) and the limited power of conventional single-locus analyses. It has been shown that haplotype-inference methods such as Clark's algorithm, the expectation-maximization algorithm, and a coalescence-based iterative-sampling algorithm are fairly effective and economical alternatives to molecular-haplotyping methods. To contend with some weaknesses of the existing algorithms, we propose a new Monte Carlo approach. In particular, we first partition the whole haplotype into smaller segments. Then, we use the Gibbs sampler both to construct the partial haplotypes of each segment and to assemble all the segments together. Our algorithm can accurately and rapidly infer haplotypes for a large number of linked SNPs. By using a wide variety of real and simulated data sets, we demonstrate the advantages of our Bayesian algorithm, and we show that it is robust to the violation of Hardy-Weinberg equilibrium, to the presence of missing data, and to occurrences of recombination hotspots.

Ancestral Population Genomics: The Coalescent Hidden Markov Model Approach

Genetics, 2009

With incomplete lineage sorting (ILS), the genealogy of closely related species differs along their genomes. The amount of ILS depends on population parameters such as the ancestral effective population sizes and the recombination rate, but also on the number of generations between speciation events. We use a hidden Markov model parameterized according to coalescent theory to infer the genealogy along a fourspecies genome alignment of closely related species and estimate population parameters. We analyze a basic, panmictic demographic model and study its properties using an extensive set of coalescent simulations. We assess the effect of the model assumptions and demonstrate that the Markov property provides a good approximation to the ancestral recombination graph. Using a too restricted set of possible genealogies, necessary to reduce the computational load, can bias parameter estimates. We propose a simple correction for this bias and suggest directions for future extensions of the model. We show that the patterns of ILS along a sequence alignment can be recovered efficiently together with the ancestral recombination rate. Finally, we introduce an extension of the basic model that allows for mutation rate heterogeneity and reanalyze human-chimpanzee-gorilla-orangutan alignments, using the new models. We expect that this framework will prove useful for population genomics and provide exciting insights into genome evolution.

A Hidden Markov Technique for Haplotype Reconstruction

Lecture Notes in Computer Science, 2005

We give a new algorithm for the genotype phasing problem. Our solution is based on a hidden Markov model for haplotypes. The model has a uniform structure, unlike most solutions proposed so far that model recombinations using haplotype blocks. In our model, the haplotypes can be seen as a result of iterated recombinations applied on a few founder haplotypes. We find maximum likelihood model of this type by using the EM algorithm. We show how to solve the subtleties of the EM algorithm that arise when genotypes are generated using a haplotype model. We compare our method to the well-known currently available algorithms (phase, hap, gerbil) using some standard and new datasets. Our algorithm is relatively fast and gives results that are always best or second best among the methods compared.