Qingrun Zhang - Academia.edu (original) (raw)
Papers by Qingrun Zhang
bioRxiv (Cold Spring Harbor Laboratory), Mar 13, 2024
Motivation: Deciphering genetic basis of complex traits via genotype-phenotype association studie... more Motivation: Deciphering genetic basis of complex traits via genotype-phenotype association studies is a long-standing theme in genetics. The availability of molecular omics data (such as transcriptome) has enabled researchers to utilize "in-between-omes" in association studies, for instance transcriptome-wide association study. Although many statistical tests and machine learning models integrating omics in genetic mapping are emerging, there is no standard way to simulate phenotype by genotype with the role of in-between-omes incorporated. Moreover, the involvement of in-between-omes usually bring substantial nonlinear architecture (e.g., coexpression network), that may be non-trivial to simulate. As such, rigorous power estimations, a critical step to test novel models, may not be conducted fairly. Results: To address the gap between emerging methods development and the unavailability of adequate simulators, we developed OmeSim, a phenotype simulator incorporating genetics, an in-between-ome (e.g., transcriptome), and their complex relationships including nonlinear architectures. OmeSim outputs detailed causality graphs together with original data, correlations, and associations structures between phenotypic traits and omes terms as comprehensive gold-standard datasets for the verifications of novel tools integrating an inbetween-ome in genotype-phenotype association studies. We expect OmeSim to enable rigorous benchmarking for the future multi-omics integrations.
PLOS Genetics, Dec 17, 2023
PLOS Computational Biology, Oct 1, 2023
Machine Learning models have been frequently used in transcriptome analyses. Particularly, Repres... more Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the "latent variables" in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL's broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose "Critical genes", defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene's contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET) and a cancer-specific database (COSMIC), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions.
Frontiers in Genetics, Sep 10, 2023
Editorial on the Research Topic Statistical methods for genome-wide association studies (GWAS) an... more Editorial on the Research Topic Statistical methods for genome-wide association studies (GWAS) and transcriptome-wide association studies (TWAS) and their applications
Cancer epidemiology, biomarkers & prevention, Mar 13, 2024
Investigative Ophthalmology & Visual Science, Feb 1, 2009
GENETICS
Towards the identification of genetic basis of complex traits, transcriptome-wide association stu... more Towards the identification of genetic basis of complex traits, transcriptome-wide association study (TWAS) is successful in integrating transcriptome data. However, TWAS is only applicable for common variants, excluding rare variants in exome or whole genome sequences. This is partly because of the inherent limitation of TWAS protocols that rely on predicting gene expressions. Our previous research has revealed the insight into TWAS: the two steps in TWAS, building and applying the expression prediction models, are essentially genetic feature selection and aggregations that do not have to involve predictions. Based on this insight disentangling TWAS, rare variants’ inability of predicting expression traits is no longer an obstacle. Herein, we developed “rare variant TWAS”, or rvTWAS, that first uses a Bayesian model to conduct expression-directed feature selection and then uses a kernel machine to carry out feature aggregation, forming a model leveraging expressions for association ...
bioRxiv (Cold Spring Harbor Laboratory), Jul 17, 2023
Machine Learning models have been frequently used in transcriptome analyses. Particularly, Repres... more Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the "latent variables" in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL's broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose "Critical genes", defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene's contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions.
European Journal of Cell Biology, Sep 1, 2023
Zenodo (CERN European Organization for Nuclear Research), Aug 20, 2022
The success of transcriptome-wide association studies (TWAS) has led to substantial research towa... more The success of transcriptome-wide association studies (TWAS) has led to substantial research towards improving its core component of genetically regulated expression (GReX). GReX links expression information with phenotype by serving as both the outcome of genotype-based expression models and the predictor for downstream association testing. In this work, we demonstrate that current linear models of GReX inadvertently combine two separable steps of machine learning-feature selection and aggregation-which can be independently replaced to improve overall power. We show that the monolithic approach of GReX limits the adaptability of TWAS methodology and practice, especially given low expression heritability.
Science Advances, Dec 21, 2022
Approaches systematically characterizing interactions via transcriptomic data usually follow two ... more Approaches systematically characterizing interactions via transcriptomic data usually follow two systems: (i) coexpression network analyses focusing on correlations between genes and (ii) linear regressions (usually regularized) to select multiple genes jointly. Both suffer from the problem of stability: A slight change of parameterization or dataset could lead to marked alterations of outcomes. Here, we propose Stabilized COre gene and Pathway Election (SCOPE), a tool integrating bootstrapped least absolute shrinkage and selection operator and coexpression analysis, leading to robust outcomes insensitive to variations in data. By applying SCOPE to six cancer expression datasets (BRCA, COAD, KIRC, LUAD, PRAD, and THCA) in The Cancer Genome Atlas, we identified core genes capturing interaction effects in crucial pan-cancer pathways related to genome instability and DNA damage response. Moreover, we highlighted the pivotal role of CD63 as an oncogenic driver and a potential therapeutic target in kidney cancer. SCOPE enables stabilized investigations toward complex interactions using transcriptome data.
Nature, Oct 1, 2007
clear outlier. e-h, EDAR. e, Similar evidence for positive selection in JPT 1 CHB at a chromosome... more clear outlier. e-h, EDAR. e, Similar evidence for positive selection in JPT 1 CHB at a chromosome 2 locus: XP-EHH between CEU and JPT 1 CHB (blue), between YRI and JPT 1 CHB (red), and between CEU and YRI (grey); iHS in JPT 1 CHB (green). A valine to alanine polymorphism in EDAR passes all filters: the frequency of derived alleles (f), differences between populations (g) and differences between populations for high-frequency derived alleles (less than 20% in nonselected populations) (h). Three other functional changes, a DRE change in SULT1C2 and two SNPs associated with RANBP2 expression (Methods), have also become common in the selected population.
The contribution of genetic variants to a complex phenotype may be mediated by various forms of c... more The contribution of genetic variants to a complex phenotype may be mediated by various forms of complicated interactions. Currently, the discovery of genetic variants underlying interaction is limited, partly due to that the real interaction patterns are diverse and unknown, whereas exhaustively examining all potential combinations confers the risk of overfitting and instability. We propose IBAS, Interaction-Bridged Association Study, a new model using statistical learning techniques to extract representations of interaction patterns in transcriptome data, which act as a mediator for the next genotype-phenotype association test. Using simulated perturbation experiments, it is demonstrated that IBAS is more robust to noise than similar mediation-based protocols replying on single-genes, i.e., transcriptome-wide association studies (TWAS). By applying IBAS to real genotype-phenotype and expression data, we reported additional genes underlying complex traits as well as their biological...
ABSTRACTTowards the identification of genetic basis of complex traits, transcriptome-wide associa... more ABSTRACTTowards the identification of genetic basis of complex traits, transcriptome-wide association study (TWAS) is successful in integrating transcriptome data. However, TWAS is only applicable for common variants, excluding rare variants in exome or whole genome sequences. This is partly because of the inherent limitation of TWAS protocols that rely on predicting gene expressions. Briefly, a typical TWAS protocol has two steps: it trains an expression prediction model in a reference dataset containing gene expressions and genotype, and then applies this prediction model to a genotype-phenotype dataset to “impute” the unobserved expression (that is called GReX) to be associated to the phenotype. In this procedure, rare variants are not used due to its low power in predicting expressions. Our previous research has revealed the insight into TWAS: the two steps are essentially genetic feature selection and aggregations that do not have to involve predictions. Based on this insight d...
© 2008 Mader et al; licensee BioMed Central Ltd. This is an Open Access article distributed under... more © 2008 Mader et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License
<p>Power of finding at least one casual variant for model 1, 2, and 3 (depicted in <b>... more <p>Power of finding at least one casual variant for model 1, 2, and 3 (depicted in <b>a</b>, <b>b</b>, and <b>c</b> respectively). The single locus test has the highest power for Model 1, which has explicit marginal effect for both interacting variants; <i>AprioriGWAS</i> has better power for the threshold model, Model 3. The X-axis is the same as <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003627#pcbi-1003627-g001" target="_blank"><b>Figure 1</b></a>.</p
E disease model. The allele test, if carried out by itself, is most powerful but, if used in conj... more E disease model. The allele test, if carried out by itself, is most powerful but, if used in conjunction with the genotype test (MaxGA), is somewhat less powerful than the test. All three tests (genotype, , MaxGA) have essentially the same power.<b>Copyright information:</b>Taken from "Combining identity by descent and association in genetic case-control studies"http://www.biomedcentral.com/1471-2156/9/42BMC Genetics 2008;9():42-42.Published online 5 Jul 2008PMCID:PMC2483716.
The American Journal of Pathology, 2019
Nature, 2005
A haplotype map of the human genome The International HapMap Consortium* Inherited genetic variat... more A haplotype map of the human genome The International HapMap Consortium* Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.
bioRxiv (Cold Spring Harbor Laboratory), Mar 13, 2024
Motivation: Deciphering genetic basis of complex traits via genotype-phenotype association studie... more Motivation: Deciphering genetic basis of complex traits via genotype-phenotype association studies is a long-standing theme in genetics. The availability of molecular omics data (such as transcriptome) has enabled researchers to utilize "in-between-omes" in association studies, for instance transcriptome-wide association study. Although many statistical tests and machine learning models integrating omics in genetic mapping are emerging, there is no standard way to simulate phenotype by genotype with the role of in-between-omes incorporated. Moreover, the involvement of in-between-omes usually bring substantial nonlinear architecture (e.g., coexpression network), that may be non-trivial to simulate. As such, rigorous power estimations, a critical step to test novel models, may not be conducted fairly. Results: To address the gap between emerging methods development and the unavailability of adequate simulators, we developed OmeSim, a phenotype simulator incorporating genetics, an in-between-ome (e.g., transcriptome), and their complex relationships including nonlinear architectures. OmeSim outputs detailed causality graphs together with original data, correlations, and associations structures between phenotypic traits and omes terms as comprehensive gold-standard datasets for the verifications of novel tools integrating an inbetween-ome in genotype-phenotype association studies. We expect OmeSim to enable rigorous benchmarking for the future multi-omics integrations.
PLOS Genetics, Dec 17, 2023
PLOS Computational Biology, Oct 1, 2023
Machine Learning models have been frequently used in transcriptome analyses. Particularly, Repres... more Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the "latent variables" in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL's broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose "Critical genes", defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene's contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET) and a cancer-specific database (COSMIC), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions.
Frontiers in Genetics, Sep 10, 2023
Editorial on the Research Topic Statistical methods for genome-wide association studies (GWAS) an... more Editorial on the Research Topic Statistical methods for genome-wide association studies (GWAS) and transcriptome-wide association studies (TWAS) and their applications
Cancer epidemiology, biomarkers & prevention, Mar 13, 2024
Investigative Ophthalmology & Visual Science, Feb 1, 2009
GENETICS
Towards the identification of genetic basis of complex traits, transcriptome-wide association stu... more Towards the identification of genetic basis of complex traits, transcriptome-wide association study (TWAS) is successful in integrating transcriptome data. However, TWAS is only applicable for common variants, excluding rare variants in exome or whole genome sequences. This is partly because of the inherent limitation of TWAS protocols that rely on predicting gene expressions. Our previous research has revealed the insight into TWAS: the two steps in TWAS, building and applying the expression prediction models, are essentially genetic feature selection and aggregations that do not have to involve predictions. Based on this insight disentangling TWAS, rare variants’ inability of predicting expression traits is no longer an obstacle. Herein, we developed “rare variant TWAS”, or rvTWAS, that first uses a Bayesian model to conduct expression-directed feature selection and then uses a kernel machine to carry out feature aggregation, forming a model leveraging expressions for association ...
bioRxiv (Cold Spring Harbor Laboratory), Jul 17, 2023
Machine Learning models have been frequently used in transcriptome analyses. Particularly, Repres... more Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the "latent variables" in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL's broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose "Critical genes", defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene's contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions.
European Journal of Cell Biology, Sep 1, 2023
Zenodo (CERN European Organization for Nuclear Research), Aug 20, 2022
The success of transcriptome-wide association studies (TWAS) has led to substantial research towa... more The success of transcriptome-wide association studies (TWAS) has led to substantial research towards improving its core component of genetically regulated expression (GReX). GReX links expression information with phenotype by serving as both the outcome of genotype-based expression models and the predictor for downstream association testing. In this work, we demonstrate that current linear models of GReX inadvertently combine two separable steps of machine learning-feature selection and aggregation-which can be independently replaced to improve overall power. We show that the monolithic approach of GReX limits the adaptability of TWAS methodology and practice, especially given low expression heritability.
Science Advances, Dec 21, 2022
Approaches systematically characterizing interactions via transcriptomic data usually follow two ... more Approaches systematically characterizing interactions via transcriptomic data usually follow two systems: (i) coexpression network analyses focusing on correlations between genes and (ii) linear regressions (usually regularized) to select multiple genes jointly. Both suffer from the problem of stability: A slight change of parameterization or dataset could lead to marked alterations of outcomes. Here, we propose Stabilized COre gene and Pathway Election (SCOPE), a tool integrating bootstrapped least absolute shrinkage and selection operator and coexpression analysis, leading to robust outcomes insensitive to variations in data. By applying SCOPE to six cancer expression datasets (BRCA, COAD, KIRC, LUAD, PRAD, and THCA) in The Cancer Genome Atlas, we identified core genes capturing interaction effects in crucial pan-cancer pathways related to genome instability and DNA damage response. Moreover, we highlighted the pivotal role of CD63 as an oncogenic driver and a potential therapeutic target in kidney cancer. SCOPE enables stabilized investigations toward complex interactions using transcriptome data.
Nature, Oct 1, 2007
clear outlier. e-h, EDAR. e, Similar evidence for positive selection in JPT 1 CHB at a chromosome... more clear outlier. e-h, EDAR. e, Similar evidence for positive selection in JPT 1 CHB at a chromosome 2 locus: XP-EHH between CEU and JPT 1 CHB (blue), between YRI and JPT 1 CHB (red), and between CEU and YRI (grey); iHS in JPT 1 CHB (green). A valine to alanine polymorphism in EDAR passes all filters: the frequency of derived alleles (f), differences between populations (g) and differences between populations for high-frequency derived alleles (less than 20% in nonselected populations) (h). Three other functional changes, a DRE change in SULT1C2 and two SNPs associated with RANBP2 expression (Methods), have also become common in the selected population.
The contribution of genetic variants to a complex phenotype may be mediated by various forms of c... more The contribution of genetic variants to a complex phenotype may be mediated by various forms of complicated interactions. Currently, the discovery of genetic variants underlying interaction is limited, partly due to that the real interaction patterns are diverse and unknown, whereas exhaustively examining all potential combinations confers the risk of overfitting and instability. We propose IBAS, Interaction-Bridged Association Study, a new model using statistical learning techniques to extract representations of interaction patterns in transcriptome data, which act as a mediator for the next genotype-phenotype association test. Using simulated perturbation experiments, it is demonstrated that IBAS is more robust to noise than similar mediation-based protocols replying on single-genes, i.e., transcriptome-wide association studies (TWAS). By applying IBAS to real genotype-phenotype and expression data, we reported additional genes underlying complex traits as well as their biological...
ABSTRACTTowards the identification of genetic basis of complex traits, transcriptome-wide associa... more ABSTRACTTowards the identification of genetic basis of complex traits, transcriptome-wide association study (TWAS) is successful in integrating transcriptome data. However, TWAS is only applicable for common variants, excluding rare variants in exome or whole genome sequences. This is partly because of the inherent limitation of TWAS protocols that rely on predicting gene expressions. Briefly, a typical TWAS protocol has two steps: it trains an expression prediction model in a reference dataset containing gene expressions and genotype, and then applies this prediction model to a genotype-phenotype dataset to “impute” the unobserved expression (that is called GReX) to be associated to the phenotype. In this procedure, rare variants are not used due to its low power in predicting expressions. Our previous research has revealed the insight into TWAS: the two steps are essentially genetic feature selection and aggregations that do not have to involve predictions. Based on this insight d...
© 2008 Mader et al; licensee BioMed Central Ltd. This is an Open Access article distributed under... more © 2008 Mader et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License
<p>Power of finding at least one casual variant for model 1, 2, and 3 (depicted in <b>... more <p>Power of finding at least one casual variant for model 1, 2, and 3 (depicted in <b>a</b>, <b>b</b>, and <b>c</b> respectively). The single locus test has the highest power for Model 1, which has explicit marginal effect for both interacting variants; <i>AprioriGWAS</i> has better power for the threshold model, Model 3. The X-axis is the same as <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003627#pcbi-1003627-g001" target="_blank"><b>Figure 1</b></a>.</p
E disease model. The allele test, if carried out by itself, is most powerful but, if used in conj... more E disease model. The allele test, if carried out by itself, is most powerful but, if used in conjunction with the genotype test (MaxGA), is somewhat less powerful than the test. All three tests (genotype, , MaxGA) have essentially the same power.<b>Copyright information:</b>Taken from "Combining identity by descent and association in genetic case-control studies"http://www.biomedcentral.com/1471-2156/9/42BMC Genetics 2008;9():42-42.Published online 5 Jul 2008PMCID:PMC2483716.
The American Journal of Pathology, 2019
Nature, 2005
A haplotype map of the human genome The International HapMap Consortium* Inherited genetic variat... more A haplotype map of the human genome The International HapMap Consortium* Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.