Momiao Xiong - Academia.edu (original) (raw)
Papers by Momiao Xiong
Nature, 2015
A global reference for human genetic variation The 1000 Genomes Project Consortium* The 1000 Geno... more A global reference for human genetic variation The 1000 Genomes Project Consortium* The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes .99% of SNP variants with a frequency of .1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Genetic Epidemiology, 2018
We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene‐based t... more We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene‐based tests of association between a quantitative trait and genetic variants on pedigrees. The effects of a major gene are modeled as a fixed effect, the contributions of polygenes are modeled as a random effect, and the correlations of pedigree members are modeled via inbreeding/kinship coefficients. ‐statistics and χ 2 likelihood ratio test (LRT) statistics based on the LMMs and FLMMs are constructed to test for association. We show empirically that the ‐distributed statistics provide a good control of the type I error rate. The ‐test statistics of the LMMs have similar or higher power than the FLMMs, kernel‐based famSKAT (family‐based sequence kernel association test), and burden test famBT (family‐based burden test). The ‐statistics of the FLMMs perform well when analyzing a combination of rare and common variants. For small samples, the LRT statistics of the FLMMs control the type I error ...
Frontiers in genetics, 2018
The current paradigm of genomic studies of complex diseases is association and correlation analys... more The current paradigm of genomic studies of complex diseases is association and correlation analysis. Despite significant progress in dissecting the genetic architecture of complex diseases by genome-wide association studies (GWAS), the identified genetic variants by GWAS can only explain a small proportion of the heritability of complex diseases. A large fraction of genetic variants is still hidden. Association analysis has limited power to unravel mechanisms of complex diseases. It is time to shift the paradigm of genomic analysis from association analysis to causal inference. Causal inference is an essential component for the discovery of mechanism of diseases. This paper will review the major platforms of the genomic analysis in the past and discuss the perspectives of causal inference as a general framework of genomic analysis. In genomic data analysis, we usually consider four types of associations: association of discrete variables (DNA variation) with continuous variables (ph...
PLoS computational biology, 2017
Investigating the pleiotropic effects of genetic variants can increase statistical power, provide... more Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore correlation information of genetic variants, effectively reduce data dimensions, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new statistic method referred to as a quadratically r...
World Journal of Gastroenterology, 2003
AIM: To identify the susceptible gene (s) for type 2 diabetes in the prevousely mapped region, 1p... more AIM: To identify the susceptible gene (s) for type 2 diabetes in the prevousely mapped region, 1p36.33-p36.23, in Han population of North China using single nucleotide polymorphisms (SNPs) and to analyze the haplotypes of the gene (s) related to type 2 diabetes. METHODS: Twenty three SNPs located in 10 candidate genes in the mapped region were chosen from public SNP domains with bioinformatic methods, and the single base extension (SBE) method was used to genotype the loci for 192 sporadic type 2 diabetes patients and 172 normal individuals, all with Han ethical origin, to perform this casecontrol study. The haplotypes with significant difference in the gene (s) were further analyzed. RESULTS: Among the 23 SNPs, 8 were found to be common in Chinese Han population. Allele frequency of one SNP, rs436045 in the protein kinase C/ζgene (PRKCZ) was statistically different between the case and control groups (P<0.05). Furthermore, haplotypes at five SNP sites of PRKCZ gene were identified. CONCLUSION: PRKCZ gene may be associated with type 2 diabetes in Han population in North China. The haplotypes at five SNP sites in this gene may be responsible for this association.
BMC genomics, May 18, 2017
Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential ... more Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored due to great computational challenges and data availability. Due to variation in splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the cells, RNA-seq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNA-seq and whole genome sequencing (WGS) data poses enormous challenges. We develop a nonlinear functional regression mode...
European journal of human genetics : EJHG, Feb 21, 2016
To analyze next-generation sequencing data, multivariate functional linear models are developed f... more To analyze next-generation sequencing data, multivariate functional linear models are developed for a meta-analysis of multiple studies to connect genetic variant data to multiple quantitative traits adjusting for covariates. The goal is to take the advantage of both meta-analysis and pleiotropic analysis in order to improve power and to carry out a unified association analysis of multiple studies and multiple traits of complex disorders. Three types of approximate F -distributions based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to evaluate false-positive rates and power of the proposed tests. The proposed methods are applied to analyze lipid traits in eight European cohorts. It is shown that it is more advantageous to perform multivariate analysis than univariate analysis in general, and it is more advantageous to...
Genetics, Jan 29, 2015
We developed generalized functional linear models (GFLMs) to perform meta-analysis of multiple ca... more We developed generalized functional linear models (GFLMs) to perform meta-analysis of multiple case-control studies to evaluate the relationship of genetic data to dichotomous traits adjusting for co-variates. Unlike the previously developed MetaSKAT which are based on mixed effect models to make the contributions of major gene locus to be random, GFLMs are fixed models, i.e., genetic effects of multiple genetic variants are fixed. Based on the GFLMs, we developed χ(2)-distributed Rao's efficient score test and likelihood ratio test (LRT) statistics to test for an association between a complex dichotomous trait and multiple genetic variants. We then performed extensive simulations to evaluate the empirical type I error rates and power performance of the proposed tests. The Rao's efficient score test statistics of GFLMs are very conservative, and have higher power than MetaSKAT when some causal variants are rare and some are common. When the causal variants are all rare (i.e., minor ...
European journal of human genetics : EJHG, Jan 15, 2015
The critical barrier in interaction analysis for next-generation sequencing (NGS) data is that th... more The critical barrier in interaction analysis for next-generation sequencing (NGS) data is that the traditional pairwise interaction analysis that is suitable for common variants is difficult to apply to rare variants because of their prohibitive computational time, large number of tests and low power. The great challenges for successful detection of interactions with NGS data are (1) the demands in the paradigm of changes in interaction analysis; (2) severe multiple testing; and (3) heavy computations. To meet these challenges, we shift the paradigm of interaction analysis between two SNPs to interaction analysis between two genomic regions. In other words, we take a gene as a unit of analysis and use functional data analysis techniques as dimensional reduction tools to develop a novel statistic to collectively test interaction between all possible pairs of SNPs within two genome regions. By intensive simulations, we demonstrate that the functional logistic regression for interactio...
Genetics, Jan 9, 2015
Meta-analysis of genetic data must account for differences among studies including study designs,... more Meta-analysis of genetic data must account for differences among studies including study designs, markers genotyped, and covariates. The effects of genetic variants may differ from population to population, i.e., heterogeneity. Thus, meta-analysis of combining data of multiple studies is difficult. Novel statistical methods for meta-analysis are needed. In this paper, functional linear models are developed for meta-analyses which connect genetic data to quantitative traits adjusting for covariates. The models can be used to analyze rare variants, common variants or a combinations of the two. Both likelihood ratio test (LRT) and F-distributed statistics are introduced to test association between quantitative traits and multiple variants in one genetic region. Extensive simulations are performed to evaluate empirical type I error rates and power performance of the proposed tests. The proposed LRT and F-distributed statistics control the type I error very well and have higher power tha...
Genetic epidemiology, Jan 23, 2015
In genetics, pleiotropy describes the genetic effect of a single gene on multiple phenotypic trai... more In genetics, pleiotropy describes the genetic effect of a single gene on multiple phenotypic traits. A common approach is to analyze the phenotypic traits separately using univariate analyses and combine the test results through multiple comparisons. This approach may lead to low power. Multivariate functional linear models are developed to connect genetic variant data to multiple quantitative traits adjusting for covariates for a unified analysis. Three types of approximate F-distribution tests based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants in one genetic region. The approximate F-distribution tests provide much more significant results than those of F-tests of univariate analysis and optimal sequence kernel association test (SKAT-O). Extensive simulations were performed to evaluate the false positive rates and power performance of the proposed m...
European Journal of Human Genetics, 2014
Although pathway analysis methods have been developed and successfully applied to association stu... more Although pathway analysis methods have been developed and successfully applied to association studies of common variants, the statistical methods for pathway-based association analysis of rare variants have not been well developed. Many investigators observed highly inflated false-positive rates and low power in pathway-based tests of association of rare variants. The inflated false-positive rates and low true-positive rates of the current methods are mainly due to their lack of ability to account for gametic phase disequilibrium. To overcome these serious limitations, we develop a novel statistic that is based on the smoothed functional principal component analysis (SFPCA) for pathway association tests with next-generation sequencing data. The developed statistic has the ability to capture position-level variant information and account for gametic phase disequilibrium. By intensive simulations, we demonstrate that the SFPCA-based statistic for testing pathway association with either rare or common or both rare and common variants has the correct type 1 error rates. Also the power of the SFPCA-based statistic and 22 additional existing statistics are evaluated. We found that the SFPCA-based statistic has a much higher power than other existing statistics in all the scenarios considered. To further evaluate its performance, the SFPCA-based statistic is applied to pathway analysis of exome sequencing data in the early-onset myocardial infarction (EOMI) project. We identify three pathways significantly associated with EOMI after the Bonferroni correction. In addition, our preliminary results show that the SFPCA-based statistic has much smaller P-values to identify pathway association than other existing methods.
Genetic epidemiology, 2014
By using functional data analysis techniques, we developed generalized functional linear models f... more By using functional data analysis techniques, we developed generalized functional linear models for testing association between a dichotomous trait and multiple genetic variants in a genetic region while adjusting for covariates. Both fixed and mixed effect models are developed and compared. Extensive simulations show that Rao's efficient score tests of the fixed effect models are very conservative since they generate lower type I errors than nominal levels, and global tests of the mixed effect models generate accurate type I errors. Furthermore, we found that the Rao's efficient score test statistics of the fixed effect models have higher power than the sequence kernel association test (SKAT) and its optimal unified version (SKAT-O) in most cases when the causal variants are both rare and common. When the causal variants are all rare (i.e., minor allele frequencies less than 0.03), the Rao's efficient score test statistics and the global tests have similar or slightly l...
PLoS ONE, 2012
The dimension of the population genetics data produced by next-generation sequencing platforms is... more The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the ''intrinsic dimensionality'' of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis.
PLoS ONE, 2009
Despite current enthusiasm for investigation of gene-gene interactions and gene-environment inter... more Despite current enthusiasm for investigation of gene-gene interactions and gene-environment interactions, the essential issue of how to define and detect gene-environment interactions remains unresolved. In this report, we define geneenvironment interactions as a stochastic dependence in the context of the effects of the genetic and environmental risk factors on the cause of phenotypic variation among individuals. We use mutual information that is widely used in communication and complex system analysis to measure gene-environment interactions. We investigate how geneenvironment interactions generate the large difference in the information measure of gene-environment interactions between the general population and a diseased population, which motives us to develop mutual information-based statistics for testing gene-environment interactions. We validated the null distribution and calculated the type 1 error rates for the mutual information-based statistics to test gene-environment interactions using extensive simulation studies. We found that the new test statistics were more powerful than the traditional logistic regression under several disease models. Finally, in order to further evaluate the performance of our new method, we applied the mutual information-based statistics to three real examples. Our results showed that P-values for the mutual information-based statistics were much smaller than that obtained by other approaches including logistic regression models.
PLoS ONE, 2008
Background: Variants in the complement cascade genes and the LOC387715/HTRA1, have been widely re... more Background: Variants in the complement cascade genes and the LOC387715/HTRA1, have been widely reported to associate with age-related macular degeneration (AMD), the most common cause of visual impairment in industrialized countries. Methods/Principal Findings: We investigated the association between the LOC387715 A69S and complement component C3 R102G risk alleles in the Finnish case-control material and found a significant association with both variants (OR 2.98, p = 3.75610 29 ; non-AMD controls and OR 2.79, p = 2.78610 219 , blood donor controls and OR 1.83, p = 0.008; non-AMD controls and OR 1.39, p = 0.039; blood donor controls), respectively. Previously, we have shown a strong association between complement factor H (CFH) Y402H and AMD in the Finnish population. A carrier of at least one risk allele in each of the three susceptibility loci (LOC387715, C3, CFH) had an 18-fold risk of AMD when compared to a non-carrier homozygote in all three loci. A tentative gene-gene interaction between the two major AMD-associated loci, LOC387715 and CFH, was found in this study using a multiplicative (logistic regression) model, a synergy index (departure-from-additivity model) and the mutual information method (MI), suggesting that a common causative pathway may exist for these genes. Smoking (ever vs. never) exerted an extra risk for AMD, but somewhat surprisingly, only in connection with other factors such as sex and the C3 genotype. Population attributable risks (PAR) for the CFH, LOC387715 and C3 variants were 58.2%, 51.4% and 5.8%, respectively, the summary PAR for the three variants being 65.4%. Conclusions/Significance: Evidence for gene-gene interaction between two major AMD associated loci CFH and LOC387715 was obtained using three methods, logistic regression, a synergy index and the mutual information (MI) index.
Nucleic Acids Research, 2013
Digital transcriptome analysis by next-generation sequencing discovers substantial mRNA variants.... more Digital transcriptome analysis by next-generation sequencing discovers substantial mRNA variants. Variation in gene expression underlies many biological processes and holds a key to unravelling mechanism of common diseases. However, the current methods for construction of co-expression networks using overall gene expression are originally designed for microarray expression data, and they overlook a large number of variations in gene expressions. To use information on exon, genomic positional level and allele-specific expressions, we develop novel component-based methods, single and bivariate canonical correlation analysis, for construction of co-expression networks with RNA-seq data. To evaluate the performance of our methods for co-expression network inference with RNA-seq data, they are applied to lung squamous cell cancer expression data from TCGA database and our bipolar disorder and schizophrenia RNA-seq study. The preliminary results demonstrate that the co-expression networks constructed by canonical correlation analysis and RNA-seq data provide rich genetic and molecular information to gain insight into biological processes and disease mechanism. Our new methods substantially outperform the current statistical methods for co-expression network construction with microarray expression data or RNA-seq data based on overall gene expression levels.
Hypertension Research, 2002
To investigate the relationship between 12 candidate genes responsible for water regulation, sodi... more To investigate the relationship between 12 candidate genes responsible for water regulation, sodium metabolism and membrane ion transport and essential hypertension (EH) in the Chinese. Linkage analysis of EH was performed in 95 Chinese nuclear families including 477 subjects using a technique of fluorescencebased gene scanning with 12 microsatellite markers. Markers were selected on the chromosomal regions covering 12 candidate genes responsible for regulating water and sodium metabolism and membrane ion transport. These candidate genes included sodium hydrogen exchanger 3, sodium hydrogen exchanger 5, chloride bicarbonate exchanger 3, sodium calcium exchanger 1, mineralocorticoid receptor, plasma membrane calcium ATPase 2, ATPase,Na/K transporting alpha,-adducin, SA gene, kidney epithelial sodium channel-, vasopressin receptor 1A, and 11-hydroxysteroid dehydrogenase type 2 genes. Two-point nonparametric linkage analysis (NPL), maximum LOD score analysis and transmission/disequilibrium test (TDT) were performed using the GENEHUNTER software package. The NPL analysis and LOD score suggested a significant linkage at D12S398 (Z 2.08, p 0.05 and LOD score 1.26, p 0.01, respectively). TDT indicated a significant disequilibrium of transmission at the locus 2 9.00, p 0.005). No significant linkages were found at the other loci tested (p 0.05 or LOD 1). In conclusion, D12S398, a marker near the vasopressin receptor 1A gene (V1AR), showed a positive linkage with EH based on the results of three statistical methods (NPL, LOD score, and TDT). This region warrants further exploration.
Human Heredity, 2012
Objectives: We aimed at extending the Natural and Orthogonal Interaction (NOIA) framework, develo... more Objectives: We aimed at extending the Natural and Orthogonal Interaction (NOIA) framework, developed for modeling gene-gene interactions in the analysis of quantitative traits, to allow for reduced genetic models, dichotomous traits, and gene-environment interactions. We evaluate the performance of the NOIA statistical models using simulated data and lung cancer data. Methods: The NOIA statistical models are developed for additive, dominant, and recessive genetic models as well as for a binary environmental exposure. Using the Kronecker product rule, a NOIA statistical model is built to model gene-environment interactions. By treating the genotypic values as the logarithm of odds, the NOIA statistical models are extended to the analysis of case-control data. Results: Our simulations showed that power for testing associations while allowing for interaction using the NOIA statistical model is much higher than using functional models for most of the scenarios we simulated. When applied...
Human Genetics, 2014
Although inversions have occasionally been found to be associated with disease susceptibility thr... more Although inversions have occasionally been found to be associated with disease susceptibility through interrupting a gene or its regulatory region, or by increasing the risk for deleterious secondary rearrangements, no association study has been specifically conducted for risks associated with inversions, mainly because existing approaches to detecting and genotyping inversions do not readily scale to a large number of samples. Based on our recently proposed approach to identifying and genotyping inversions using principal components analysis (PCA), we herein develop a method of detecting association between inversions and disease in a genomewide fashion. Our method uses genotype data for single nucleotide polymorphisms (SNPs), and is thus cost-efficient and computationally fast. For an inversion polymorphism, local PCA around the inversion region is performed to infer the inversion genotypes of all samples. For many inversions, we found that some of the SNPs inside an inversion region are fixed in the two lineages of different orientations and thus can serve as surrogate markers. Our method can be applied to case-control and quantitative trait association studies to identify inversions that may interrupt a gene or the connection between a gene and its regulatory agents. Our method also offers a new venue to identify inversions that are responsible for disease-causing secondary rearrangements. We illustrated our proposed approach to case-control data for psoriasis and identified novel associations with a few inversion polymorphisms.
Nature, 2015
A global reference for human genetic variation The 1000 Genomes Project Consortium* The 1000 Geno... more A global reference for human genetic variation The 1000 Genomes Project Consortium* The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes .99% of SNP variants with a frequency of .1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Genetic Epidemiology, 2018
We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene‐based t... more We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene‐based tests of association between a quantitative trait and genetic variants on pedigrees. The effects of a major gene are modeled as a fixed effect, the contributions of polygenes are modeled as a random effect, and the correlations of pedigree members are modeled via inbreeding/kinship coefficients. ‐statistics and χ 2 likelihood ratio test (LRT) statistics based on the LMMs and FLMMs are constructed to test for association. We show empirically that the ‐distributed statistics provide a good control of the type I error rate. The ‐test statistics of the LMMs have similar or higher power than the FLMMs, kernel‐based famSKAT (family‐based sequence kernel association test), and burden test famBT (family‐based burden test). The ‐statistics of the FLMMs perform well when analyzing a combination of rare and common variants. For small samples, the LRT statistics of the FLMMs control the type I error ...
Frontiers in genetics, 2018
The current paradigm of genomic studies of complex diseases is association and correlation analys... more The current paradigm of genomic studies of complex diseases is association and correlation analysis. Despite significant progress in dissecting the genetic architecture of complex diseases by genome-wide association studies (GWAS), the identified genetic variants by GWAS can only explain a small proportion of the heritability of complex diseases. A large fraction of genetic variants is still hidden. Association analysis has limited power to unravel mechanisms of complex diseases. It is time to shift the paradigm of genomic analysis from association analysis to causal inference. Causal inference is an essential component for the discovery of mechanism of diseases. This paper will review the major platforms of the genomic analysis in the past and discuss the perspectives of causal inference as a general framework of genomic analysis. In genomic data analysis, we usually consider four types of associations: association of discrete variables (DNA variation) with continuous variables (ph...
PLoS computational biology, 2017
Investigating the pleiotropic effects of genetic variants can increase statistical power, provide... more Investigating the pleiotropic effects of genetic variants can increase statistical power, provide important information to achieve deep understanding of the complex genetic structures of disease, and offer powerful tools for designing effective treatments with fewer side effects. However, the current multiple phenotype association analysis paradigm lacks breadth (number of phenotypes and genetic variants jointly analyzed at the same time) and depth (hierarchical structure of phenotype and genotypes). A key issue for high dimensional pleiotropic analysis is to effectively extract informative internal representation and features from high dimensional genotype and phenotype data. To explore correlation information of genetic variants, effectively reduce data dimensions, and overcome critical barriers in advancing the development of novel statistical methods and computational algorithms for genetic pleiotropic analysis, we proposed a new statistic method referred to as a quadratically r...
World Journal of Gastroenterology, 2003
AIM: To identify the susceptible gene (s) for type 2 diabetes in the prevousely mapped region, 1p... more AIM: To identify the susceptible gene (s) for type 2 diabetes in the prevousely mapped region, 1p36.33-p36.23, in Han population of North China using single nucleotide polymorphisms (SNPs) and to analyze the haplotypes of the gene (s) related to type 2 diabetes. METHODS: Twenty three SNPs located in 10 candidate genes in the mapped region were chosen from public SNP domains with bioinformatic methods, and the single base extension (SBE) method was used to genotype the loci for 192 sporadic type 2 diabetes patients and 172 normal individuals, all with Han ethical origin, to perform this casecontrol study. The haplotypes with significant difference in the gene (s) were further analyzed. RESULTS: Among the 23 SNPs, 8 were found to be common in Chinese Han population. Allele frequency of one SNP, rs436045 in the protein kinase C/ζgene (PRKCZ) was statistically different between the case and control groups (P<0.05). Furthermore, haplotypes at five SNP sites of PRKCZ gene were identified. CONCLUSION: PRKCZ gene may be associated with type 2 diabetes in Han population in North China. The haplotypes at five SNP sites in this gene may be responsible for this association.
BMC genomics, May 18, 2017
Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential ... more Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored due to great computational challenges and data availability. Due to variation in splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the cells, RNA-seq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNA-seq and whole genome sequencing (WGS) data poses enormous challenges. We develop a nonlinear functional regression mode...
European journal of human genetics : EJHG, Feb 21, 2016
To analyze next-generation sequencing data, multivariate functional linear models are developed f... more To analyze next-generation sequencing data, multivariate functional linear models are developed for a meta-analysis of multiple studies to connect genetic variant data to multiple quantitative traits adjusting for covariates. The goal is to take the advantage of both meta-analysis and pleiotropic analysis in order to improve power and to carry out a unified association analysis of multiple studies and multiple traits of complex disorders. Three types of approximate F -distributions based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to evaluate false-positive rates and power of the proposed tests. The proposed methods are applied to analyze lipid traits in eight European cohorts. It is shown that it is more advantageous to perform multivariate analysis than univariate analysis in general, and it is more advantageous to...
Genetics, Jan 29, 2015
We developed generalized functional linear models (GFLMs) to perform meta-analysis of multiple ca... more We developed generalized functional linear models (GFLMs) to perform meta-analysis of multiple case-control studies to evaluate the relationship of genetic data to dichotomous traits adjusting for co-variates. Unlike the previously developed MetaSKAT which are based on mixed effect models to make the contributions of major gene locus to be random, GFLMs are fixed models, i.e., genetic effects of multiple genetic variants are fixed. Based on the GFLMs, we developed χ(2)-distributed Rao's efficient score test and likelihood ratio test (LRT) statistics to test for an association between a complex dichotomous trait and multiple genetic variants. We then performed extensive simulations to evaluate the empirical type I error rates and power performance of the proposed tests. The Rao's efficient score test statistics of GFLMs are very conservative, and have higher power than MetaSKAT when some causal variants are rare and some are common. When the causal variants are all rare (i.e., minor ...
European journal of human genetics : EJHG, Jan 15, 2015
The critical barrier in interaction analysis for next-generation sequencing (NGS) data is that th... more The critical barrier in interaction analysis for next-generation sequencing (NGS) data is that the traditional pairwise interaction analysis that is suitable for common variants is difficult to apply to rare variants because of their prohibitive computational time, large number of tests and low power. The great challenges for successful detection of interactions with NGS data are (1) the demands in the paradigm of changes in interaction analysis; (2) severe multiple testing; and (3) heavy computations. To meet these challenges, we shift the paradigm of interaction analysis between two SNPs to interaction analysis between two genomic regions. In other words, we take a gene as a unit of analysis and use functional data analysis techniques as dimensional reduction tools to develop a novel statistic to collectively test interaction between all possible pairs of SNPs within two genome regions. By intensive simulations, we demonstrate that the functional logistic regression for interactio...
Genetics, Jan 9, 2015
Meta-analysis of genetic data must account for differences among studies including study designs,... more Meta-analysis of genetic data must account for differences among studies including study designs, markers genotyped, and covariates. The effects of genetic variants may differ from population to population, i.e., heterogeneity. Thus, meta-analysis of combining data of multiple studies is difficult. Novel statistical methods for meta-analysis are needed. In this paper, functional linear models are developed for meta-analyses which connect genetic data to quantitative traits adjusting for covariates. The models can be used to analyze rare variants, common variants or a combinations of the two. Both likelihood ratio test (LRT) and F-distributed statistics are introduced to test association between quantitative traits and multiple variants in one genetic region. Extensive simulations are performed to evaluate empirical type I error rates and power performance of the proposed tests. The proposed LRT and F-distributed statistics control the type I error very well and have higher power tha...
Genetic epidemiology, Jan 23, 2015
In genetics, pleiotropy describes the genetic effect of a single gene on multiple phenotypic trai... more In genetics, pleiotropy describes the genetic effect of a single gene on multiple phenotypic traits. A common approach is to analyze the phenotypic traits separately using univariate analyses and combine the test results through multiple comparisons. This approach may lead to low power. Multivariate functional linear models are developed to connect genetic variant data to multiple quantitative traits adjusting for covariates for a unified analysis. Three types of approximate F-distribution tests based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants in one genetic region. The approximate F-distribution tests provide much more significant results than those of F-tests of univariate analysis and optimal sequence kernel association test (SKAT-O). Extensive simulations were performed to evaluate the false positive rates and power performance of the proposed m...
European Journal of Human Genetics, 2014
Although pathway analysis methods have been developed and successfully applied to association stu... more Although pathway analysis methods have been developed and successfully applied to association studies of common variants, the statistical methods for pathway-based association analysis of rare variants have not been well developed. Many investigators observed highly inflated false-positive rates and low power in pathway-based tests of association of rare variants. The inflated false-positive rates and low true-positive rates of the current methods are mainly due to their lack of ability to account for gametic phase disequilibrium. To overcome these serious limitations, we develop a novel statistic that is based on the smoothed functional principal component analysis (SFPCA) for pathway association tests with next-generation sequencing data. The developed statistic has the ability to capture position-level variant information and account for gametic phase disequilibrium. By intensive simulations, we demonstrate that the SFPCA-based statistic for testing pathway association with either rare or common or both rare and common variants has the correct type 1 error rates. Also the power of the SFPCA-based statistic and 22 additional existing statistics are evaluated. We found that the SFPCA-based statistic has a much higher power than other existing statistics in all the scenarios considered. To further evaluate its performance, the SFPCA-based statistic is applied to pathway analysis of exome sequencing data in the early-onset myocardial infarction (EOMI) project. We identify three pathways significantly associated with EOMI after the Bonferroni correction. In addition, our preliminary results show that the SFPCA-based statistic has much smaller P-values to identify pathway association than other existing methods.
Genetic epidemiology, 2014
By using functional data analysis techniques, we developed generalized functional linear models f... more By using functional data analysis techniques, we developed generalized functional linear models for testing association between a dichotomous trait and multiple genetic variants in a genetic region while adjusting for covariates. Both fixed and mixed effect models are developed and compared. Extensive simulations show that Rao's efficient score tests of the fixed effect models are very conservative since they generate lower type I errors than nominal levels, and global tests of the mixed effect models generate accurate type I errors. Furthermore, we found that the Rao's efficient score test statistics of the fixed effect models have higher power than the sequence kernel association test (SKAT) and its optimal unified version (SKAT-O) in most cases when the causal variants are both rare and common. When the causal variants are all rare (i.e., minor allele frequencies less than 0.03), the Rao's efficient score test statistics and the global tests have similar or slightly l...
PLoS ONE, 2012
The dimension of the population genetics data produced by next-generation sequencing platforms is... more The dimension of the population genetics data produced by next-generation sequencing platforms is extremely high. However, the ''intrinsic dimensionality'' of sequence data, which determines the structure of populations, is much lower. This motivates us to use locally linear embedding (LLE) which projects high dimensional genomic data into low dimensional, neighborhood preserving embedding, as a general framework for population structure and historical inference. To facilitate application of the LLE to population genetic analysis, we systematically investigate several important properties of the LLE and reveal the connection between the LLE and principal component analysis (PCA). Identifying a set of markers and genomic regions which could be used for population structure analysis will provide invaluable information for population genetics and association studies. In addition to identifying the LLE-correlated or PCA-correlated structure informative marker, we have developed a new statistic that integrates genomic information content in a genomic region for collectively studying its association with the population structure and LASSO algorithm to search such regions across the genomes. We applied the developed methodologies to a low coverage pilot dataset in the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap. We observed that 25.1%, 44.9% and 21.4% of the common variants and 89.2%, 92.4% and 75.1% of the rare variants were the LLE-correlated markers in CEU, YRI and ASI, respectively. This showed that rare variants, which are often private to specific populations, have much higher power to identify population substructure than common variants. The preliminary results demonstrated that next generation sequencing offers a rich resources and LLE provide a powerful tool for population structure analysis.
PLoS ONE, 2009
Despite current enthusiasm for investigation of gene-gene interactions and gene-environment inter... more Despite current enthusiasm for investigation of gene-gene interactions and gene-environment interactions, the essential issue of how to define and detect gene-environment interactions remains unresolved. In this report, we define geneenvironment interactions as a stochastic dependence in the context of the effects of the genetic and environmental risk factors on the cause of phenotypic variation among individuals. We use mutual information that is widely used in communication and complex system analysis to measure gene-environment interactions. We investigate how geneenvironment interactions generate the large difference in the information measure of gene-environment interactions between the general population and a diseased population, which motives us to develop mutual information-based statistics for testing gene-environment interactions. We validated the null distribution and calculated the type 1 error rates for the mutual information-based statistics to test gene-environment interactions using extensive simulation studies. We found that the new test statistics were more powerful than the traditional logistic regression under several disease models. Finally, in order to further evaluate the performance of our new method, we applied the mutual information-based statistics to three real examples. Our results showed that P-values for the mutual information-based statistics were much smaller than that obtained by other approaches including logistic regression models.
PLoS ONE, 2008
Background: Variants in the complement cascade genes and the LOC387715/HTRA1, have been widely re... more Background: Variants in the complement cascade genes and the LOC387715/HTRA1, have been widely reported to associate with age-related macular degeneration (AMD), the most common cause of visual impairment in industrialized countries. Methods/Principal Findings: We investigated the association between the LOC387715 A69S and complement component C3 R102G risk alleles in the Finnish case-control material and found a significant association with both variants (OR 2.98, p = 3.75610 29 ; non-AMD controls and OR 2.79, p = 2.78610 219 , blood donor controls and OR 1.83, p = 0.008; non-AMD controls and OR 1.39, p = 0.039; blood donor controls), respectively. Previously, we have shown a strong association between complement factor H (CFH) Y402H and AMD in the Finnish population. A carrier of at least one risk allele in each of the three susceptibility loci (LOC387715, C3, CFH) had an 18-fold risk of AMD when compared to a non-carrier homozygote in all three loci. A tentative gene-gene interaction between the two major AMD-associated loci, LOC387715 and CFH, was found in this study using a multiplicative (logistic regression) model, a synergy index (departure-from-additivity model) and the mutual information method (MI), suggesting that a common causative pathway may exist for these genes. Smoking (ever vs. never) exerted an extra risk for AMD, but somewhat surprisingly, only in connection with other factors such as sex and the C3 genotype. Population attributable risks (PAR) for the CFH, LOC387715 and C3 variants were 58.2%, 51.4% and 5.8%, respectively, the summary PAR for the three variants being 65.4%. Conclusions/Significance: Evidence for gene-gene interaction between two major AMD associated loci CFH and LOC387715 was obtained using three methods, logistic regression, a synergy index and the mutual information (MI) index.
Nucleic Acids Research, 2013
Digital transcriptome analysis by next-generation sequencing discovers substantial mRNA variants.... more Digital transcriptome analysis by next-generation sequencing discovers substantial mRNA variants. Variation in gene expression underlies many biological processes and holds a key to unravelling mechanism of common diseases. However, the current methods for construction of co-expression networks using overall gene expression are originally designed for microarray expression data, and they overlook a large number of variations in gene expressions. To use information on exon, genomic positional level and allele-specific expressions, we develop novel component-based methods, single and bivariate canonical correlation analysis, for construction of co-expression networks with RNA-seq data. To evaluate the performance of our methods for co-expression network inference with RNA-seq data, they are applied to lung squamous cell cancer expression data from TCGA database and our bipolar disorder and schizophrenia RNA-seq study. The preliminary results demonstrate that the co-expression networks constructed by canonical correlation analysis and RNA-seq data provide rich genetic and molecular information to gain insight into biological processes and disease mechanism. Our new methods substantially outperform the current statistical methods for co-expression network construction with microarray expression data or RNA-seq data based on overall gene expression levels.
Hypertension Research, 2002
To investigate the relationship between 12 candidate genes responsible for water regulation, sodi... more To investigate the relationship between 12 candidate genes responsible for water regulation, sodium metabolism and membrane ion transport and essential hypertension (EH) in the Chinese. Linkage analysis of EH was performed in 95 Chinese nuclear families including 477 subjects using a technique of fluorescencebased gene scanning with 12 microsatellite markers. Markers were selected on the chromosomal regions covering 12 candidate genes responsible for regulating water and sodium metabolism and membrane ion transport. These candidate genes included sodium hydrogen exchanger 3, sodium hydrogen exchanger 5, chloride bicarbonate exchanger 3, sodium calcium exchanger 1, mineralocorticoid receptor, plasma membrane calcium ATPase 2, ATPase,Na/K transporting alpha,-adducin, SA gene, kidney epithelial sodium channel-, vasopressin receptor 1A, and 11-hydroxysteroid dehydrogenase type 2 genes. Two-point nonparametric linkage analysis (NPL), maximum LOD score analysis and transmission/disequilibrium test (TDT) were performed using the GENEHUNTER software package. The NPL analysis and LOD score suggested a significant linkage at D12S398 (Z 2.08, p 0.05 and LOD score 1.26, p 0.01, respectively). TDT indicated a significant disequilibrium of transmission at the locus 2 9.00, p 0.005). No significant linkages were found at the other loci tested (p 0.05 or LOD 1). In conclusion, D12S398, a marker near the vasopressin receptor 1A gene (V1AR), showed a positive linkage with EH based on the results of three statistical methods (NPL, LOD score, and TDT). This region warrants further exploration.
Human Heredity, 2012
Objectives: We aimed at extending the Natural and Orthogonal Interaction (NOIA) framework, develo... more Objectives: We aimed at extending the Natural and Orthogonal Interaction (NOIA) framework, developed for modeling gene-gene interactions in the analysis of quantitative traits, to allow for reduced genetic models, dichotomous traits, and gene-environment interactions. We evaluate the performance of the NOIA statistical models using simulated data and lung cancer data. Methods: The NOIA statistical models are developed for additive, dominant, and recessive genetic models as well as for a binary environmental exposure. Using the Kronecker product rule, a NOIA statistical model is built to model gene-environment interactions. By treating the genotypic values as the logarithm of odds, the NOIA statistical models are extended to the analysis of case-control data. Results: Our simulations showed that power for testing associations while allowing for interaction using the NOIA statistical model is much higher than using functional models for most of the scenarios we simulated. When applied...
Human Genetics, 2014
Although inversions have occasionally been found to be associated with disease susceptibility thr... more Although inversions have occasionally been found to be associated with disease susceptibility through interrupting a gene or its regulatory region, or by increasing the risk for deleterious secondary rearrangements, no association study has been specifically conducted for risks associated with inversions, mainly because existing approaches to detecting and genotyping inversions do not readily scale to a large number of samples. Based on our recently proposed approach to identifying and genotyping inversions using principal components analysis (PCA), we herein develop a method of detecting association between inversions and disease in a genomewide fashion. Our method uses genotype data for single nucleotide polymorphisms (SNPs), and is thus cost-efficient and computationally fast. For an inversion polymorphism, local PCA around the inversion region is performed to infer the inversion genotypes of all samples. For many inversions, we found that some of the SNPs inside an inversion region are fixed in the two lineages of different orientations and thus can serve as surrogate markers. Our method can be applied to case-control and quantitative trait association studies to identify inversions that may interrupt a gene or the connection between a gene and its regulatory agents. Our method also offers a new venue to identify inversions that are responsible for disease-causing secondary rearrangements. We illustrated our proposed approach to case-control data for psoriasis and identified novel associations with a few inversion polymorphisms.