David Balding | University College London (original) (raw)
Papers by David Balding
We conducted a two-stage genome-wide association study to identify common genetic variation alter... more We conducted a two-stage genome-wide association study to identify common genetic variation altering risk of the metabolic syndrome and related phenotypes in Indian Asian men, who have a high prevalence of these conditions. In Stage 1, approximately 317,000 single nucleotide polymorphisms were genotyped in 2700 individuals, from which 1500 SNPs were selected to be genotyped in a further 2300 individuals.
With the completion of phase 1 of the HapMap project (www. hapmap. org), we are now close to the ... more With the completion of phase 1 of the HapMap project (www. hapmap. org), we are now close to the point where genome-wide association studies form a routine tool for trying to identify genes involved in human disease and drug response. In any of three large human populations, the HapMap provides more than 500,000 single nucleotide polymorphisms (SNPs), chosen as far as possible to be evenly spaced and highly polymorphic.
Abstract Motivation: Most genome-wide association studies rely on single nucleotide polymorphism ... more Abstract Motivation: Most genome-wide association studies rely on single nucleotide polymorphism (SNP) analyses to identify causal loci. The increased stringency required for genome-wide analyses (with per-SNP significance threshold typically≈ 10− 7) means that many real signals will be missed. Thus it is still highly relevant to develop methods with improved power at low type I error.
Abstract Motivation: Copy number variations (CNVs) are increasingly recognized as an substantial ... more Abstract Motivation: Copy number variations (CNVs) are increasingly recognized as an substantial source of individual genetic variation, and hence there is a growing interest in investigating the evolutionary history of CNVs as well as their impact on complex disease susceptibility. CNV/SNP haplotypes are critical for this research, but although many methods have been proposed for inferring integer copy number, few have been designed for inferring CNV haplotypic phase and none of these are applicable at genome-wide scale.
Recent studies of human populations suggest that the genome consists of chromosome segments that ... more Recent studies of human populations suggest that the genome consists of chromosome segments that are ancestrally conserved ('haplotype blocks'; refs. 1–3) and have discrete boundaries defined by recombination hot spots 4, 5. Using publicly available genetic markers 6, we have constructed a first-generation haplotype map of chromosome 19. As expected for this marker density 7, approximately one-third of the chromosome is encompassed within haplotype blocks.
Abstract In recent years, there have been major developments of population genetics methods to es... more Abstract In recent years, there have been major developments of population genetics methods to estimate both rates of recombination and levels of natural selection. However, genomic variants subject to positive selection are likely to have arisen recently and, consequently, had less opportunity to be affected by recombination. Thus, the two processes have an intimately related impact on genetic variation, and inference of either may be vulnerable to confounding by the other.
Abstract: We investigate two approaches to increase the efficiency of phenotypic prediction from ... more Abstract: We investigate two approaches to increase the efficiency of phenotypic prediction from genome-wide markers, which is a key step for genomic selection (GS) in plant and animal breeding. The first approach is feature selection based on Markov blankets, which provide a theoretically-sound framework for identifying non-informative markers. Fitting GS models using only the informative markers results in simpler models, which may allow cost savings from reduced genotyping.
Abstract We present a Bayesian, Markov-chain Monte Carlo method for fine-scale linkage-disequilib... more Abstract We present a Bayesian, Markov-chain Monte Carlo method for fine-scale linkage-disequilibrium gene mapping using high-density marker maps. The method explicitly models the genealogy underlying a sample of case chromosomes in the vicinity of a putative disease locus, in contrast with the assumption of a star-shaped tree made by many existing multipoint methods.
Abstract We consider a random tree and introduce a metric in the space of trees to define the “me... more Abstract We consider a random tree and introduce a metric in the space of trees to define the “mean tree” as the tree minimizing the average distance to the random tree. When the resulting metric space is compact we have laws of large numbers and central limit theorems for sequence of independent identically distributed random trees. As application we propose tests to check if two samples of random trees have the same law.
Abstract Mathematical and statistical aspects of constructing ordered-clone physical maps of chro... more Abstract Mathematical and statistical aspects of constructing ordered-clone physical maps of chromosomes are reviewed. Three broad problems are addressed: analysis of fingerprint data to identify configurations of overlapping clones, prediction of the rate of progress of a mapping strategy and optimal design of pooling schemes for screening large clone libraries.
The interface between life and physical sciences provides an abundant habitat for mathematical mo... more The interface between life and physical sciences provides an abundant habitat for mathematical models. These are often complex to our feeble minds yet ridiculously simplistic in comparison with nature's subtlety. They nevertheless often succeed in extracting important insights into, and sometimes quantitative measures of, nature's ways. The effectiveness of mathematics in the natural sciences was dubbed 'unreasonable'in the title of a famous essay over 50 years ago [1], and is no less so today.
Abstract We present a new multilocus method for the fine-scale mapping of genes contributing to h... more Abstract We present a new multilocus method for the fine-scale mapping of genes contributing to human diseases. The method is designed for use with multiple biallelic markers—in particular, single-nucleotide polymorphisms for which high-density genetic maps will soon be available. We model disease-marker association in a candidate region via a hidden Markov process and allow for correlation between linked marker loci.
Abstract Despite the success of genome-wide association studies (GWASs) in identifying loci assoc... more Abstract Despite the success of genome-wide association studies (GWASs) in identifying loci associated with common diseases, a substantial proportion of the causality remains unexplained. Recent advances in genomic technologies have placed us in a position to initiate large-scale studies of human disease-associated epigenetic variation, specifically variation in DNA methylation.
Neuroticism is a moderately heritable personality trait considered to be a risk factor for develo... more Neuroticism is a moderately heritable personality trait considered to be a risk factor for developing major depression, anxiety disorders and dementia. We performed a genome-wide association study in 2,235 participants drawn from a population-based study of neuroticism, making this the largest association study for neuroticism to date. Neuroticism was measured by the Eysenck Personality Questionnaire.
BACKGROUND: Hay fever or seasonal allergic rhinitis (AR) is a chronic disorder associated with Ig... more BACKGROUND: Hay fever or seasonal allergic rhinitis (AR) is a chronic disorder associated with IgE sensitization to grass. The underlying genetic variants have not been studied comprehensively. There is overwhelming evidence that those who have older siblings have less AR, although the mechanism for this remains unclear.
Abstract How best to summarize large and complex datasets is a problem that arises in many areas ... more Abstract How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models.
Abstract Simulation is an invaluable tool for investigating the effects of various population gen... more Abstract Simulation is an invaluable tool for investigating the effects of various population genetics modeling assumptions on resulting patterns of genetic diversity, and for assessing the performance of statistical techniques, for example those designed to detect and measure the genomic effects of selection. It is also used to investigate the effectiveness of various design options for genetic association studies.
Abstract Although commonplace in human disease genetics, genome-wide association (GWA) studies ha... more Abstract Although commonplace in human disease genetics, genome-wide association (GWA) studies have only relatively recently been applied to plants. Using 32 phenotypes in the inbreeding crop barley, we report GWA mapping of 15 morphological traits across∼ 500 cultivars genotyped with 1,536 SNPs. In contrast to the majority of human GWA studies, we observe high levels of linkage disequilibrium within and between chromosomes. Despite this, GWA analysis readily detected common alleles of high penetrance.
Abstract Background: We describe the distribution of indels in the 44 Encyclopedia of DNA Element... more Abstract Background: We describe the distribution of indels in the 44 Encyclopedia of DNA Elements (ENCODE) regions (about 1% of the human genome) and evaluate the potential contributions of small insertion and deletion polymorphisms (indels) to human genetic variation. We relate indels to known genomic annotation features and measures of evolutionary constraint.
We conducted a two-stage genome-wide association study to identify common genetic variation alter... more We conducted a two-stage genome-wide association study to identify common genetic variation altering risk of the metabolic syndrome and related phenotypes in Indian Asian men, who have a high prevalence of these conditions. In Stage 1, approximately 317,000 single nucleotide polymorphisms were genotyped in 2700 individuals, from which 1500 SNPs were selected to be genotyped in a further 2300 individuals.
With the completion of phase 1 of the HapMap project (www. hapmap. org), we are now close to the ... more With the completion of phase 1 of the HapMap project (www. hapmap. org), we are now close to the point where genome-wide association studies form a routine tool for trying to identify genes involved in human disease and drug response. In any of three large human populations, the HapMap provides more than 500,000 single nucleotide polymorphisms (SNPs), chosen as far as possible to be evenly spaced and highly polymorphic.
Abstract Motivation: Most genome-wide association studies rely on single nucleotide polymorphism ... more Abstract Motivation: Most genome-wide association studies rely on single nucleotide polymorphism (SNP) analyses to identify causal loci. The increased stringency required for genome-wide analyses (with per-SNP significance threshold typically≈ 10− 7) means that many real signals will be missed. Thus it is still highly relevant to develop methods with improved power at low type I error.
Abstract Motivation: Copy number variations (CNVs) are increasingly recognized as an substantial ... more Abstract Motivation: Copy number variations (CNVs) are increasingly recognized as an substantial source of individual genetic variation, and hence there is a growing interest in investigating the evolutionary history of CNVs as well as their impact on complex disease susceptibility. CNV/SNP haplotypes are critical for this research, but although many methods have been proposed for inferring integer copy number, few have been designed for inferring CNV haplotypic phase and none of these are applicable at genome-wide scale.
Recent studies of human populations suggest that the genome consists of chromosome segments that ... more Recent studies of human populations suggest that the genome consists of chromosome segments that are ancestrally conserved ('haplotype blocks'; refs. 1–3) and have discrete boundaries defined by recombination hot spots 4, 5. Using publicly available genetic markers 6, we have constructed a first-generation haplotype map of chromosome 19. As expected for this marker density 7, approximately one-third of the chromosome is encompassed within haplotype blocks.
Abstract In recent years, there have been major developments of population genetics methods to es... more Abstract In recent years, there have been major developments of population genetics methods to estimate both rates of recombination and levels of natural selection. However, genomic variants subject to positive selection are likely to have arisen recently and, consequently, had less opportunity to be affected by recombination. Thus, the two processes have an intimately related impact on genetic variation, and inference of either may be vulnerable to confounding by the other.
Abstract: We investigate two approaches to increase the efficiency of phenotypic prediction from ... more Abstract: We investigate two approaches to increase the efficiency of phenotypic prediction from genome-wide markers, which is a key step for genomic selection (GS) in plant and animal breeding. The first approach is feature selection based on Markov blankets, which provide a theoretically-sound framework for identifying non-informative markers. Fitting GS models using only the informative markers results in simpler models, which may allow cost savings from reduced genotyping.
Abstract We present a Bayesian, Markov-chain Monte Carlo method for fine-scale linkage-disequilib... more Abstract We present a Bayesian, Markov-chain Monte Carlo method for fine-scale linkage-disequilibrium gene mapping using high-density marker maps. The method explicitly models the genealogy underlying a sample of case chromosomes in the vicinity of a putative disease locus, in contrast with the assumption of a star-shaped tree made by many existing multipoint methods.
Abstract We consider a random tree and introduce a metric in the space of trees to define the “me... more Abstract We consider a random tree and introduce a metric in the space of trees to define the “mean tree” as the tree minimizing the average distance to the random tree. When the resulting metric space is compact we have laws of large numbers and central limit theorems for sequence of independent identically distributed random trees. As application we propose tests to check if two samples of random trees have the same law.
Abstract Mathematical and statistical aspects of constructing ordered-clone physical maps of chro... more Abstract Mathematical and statistical aspects of constructing ordered-clone physical maps of chromosomes are reviewed. Three broad problems are addressed: analysis of fingerprint data to identify configurations of overlapping clones, prediction of the rate of progress of a mapping strategy and optimal design of pooling schemes for screening large clone libraries.
The interface between life and physical sciences provides an abundant habitat for mathematical mo... more The interface between life and physical sciences provides an abundant habitat for mathematical models. These are often complex to our feeble minds yet ridiculously simplistic in comparison with nature's subtlety. They nevertheless often succeed in extracting important insights into, and sometimes quantitative measures of, nature's ways. The effectiveness of mathematics in the natural sciences was dubbed 'unreasonable'in the title of a famous essay over 50 years ago [1], and is no less so today.
Abstract We present a new multilocus method for the fine-scale mapping of genes contributing to h... more Abstract We present a new multilocus method for the fine-scale mapping of genes contributing to human diseases. The method is designed for use with multiple biallelic markers—in particular, single-nucleotide polymorphisms for which high-density genetic maps will soon be available. We model disease-marker association in a candidate region via a hidden Markov process and allow for correlation between linked marker loci.
Abstract Despite the success of genome-wide association studies (GWASs) in identifying loci assoc... more Abstract Despite the success of genome-wide association studies (GWASs) in identifying loci associated with common diseases, a substantial proportion of the causality remains unexplained. Recent advances in genomic technologies have placed us in a position to initiate large-scale studies of human disease-associated epigenetic variation, specifically variation in DNA methylation.
Neuroticism is a moderately heritable personality trait considered to be a risk factor for develo... more Neuroticism is a moderately heritable personality trait considered to be a risk factor for developing major depression, anxiety disorders and dementia. We performed a genome-wide association study in 2,235 participants drawn from a population-based study of neuroticism, making this the largest association study for neuroticism to date. Neuroticism was measured by the Eysenck Personality Questionnaire.
BACKGROUND: Hay fever or seasonal allergic rhinitis (AR) is a chronic disorder associated with Ig... more BACKGROUND: Hay fever or seasonal allergic rhinitis (AR) is a chronic disorder associated with IgE sensitization to grass. The underlying genetic variants have not been studied comprehensively. There is overwhelming evidence that those who have older siblings have less AR, although the mechanism for this remains unclear.
Abstract How best to summarize large and complex datasets is a problem that arises in many areas ... more Abstract How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models.
Abstract Simulation is an invaluable tool for investigating the effects of various population gen... more Abstract Simulation is an invaluable tool for investigating the effects of various population genetics modeling assumptions on resulting patterns of genetic diversity, and for assessing the performance of statistical techniques, for example those designed to detect and measure the genomic effects of selection. It is also used to investigate the effectiveness of various design options for genetic association studies.
Abstract Although commonplace in human disease genetics, genome-wide association (GWA) studies ha... more Abstract Although commonplace in human disease genetics, genome-wide association (GWA) studies have only relatively recently been applied to plants. Using 32 phenotypes in the inbreeding crop barley, we report GWA mapping of 15 morphological traits across∼ 500 cultivars genotyped with 1,536 SNPs. In contrast to the majority of human GWA studies, we observe high levels of linkage disequilibrium within and between chromosomes. Despite this, GWA analysis readily detected common alleles of high penetrance.
Abstract Background: We describe the distribution of indels in the 44 Encyclopedia of DNA Element... more Abstract Background: We describe the distribution of indels in the 44 Encyclopedia of DNA Elements (ENCODE) regions (about 1% of the human genome) and evaluate the potential contributions of small insertion and deletion polymorphisms (indels) to human genetic variation. We relate indels to known genomic annotation features and measures of evolutionary constraint.