Dabao Zhang - Academia.edu (original) (raw)

Papers by Dabao Zhang

Research paper thumbnail of Genotyping Error Detection in Samples of Unrelated Individuals without Replicate Genotyping

Human Heredity, Dec 15, 2008

typing error rates and allele frequencies. This work may help researchers to estimate error rates... more typing error rates and allele frequencies. This work may help researchers to estimate error rates and to use the estimates in their analysis to increase power and decrease bias, without the extra work of genotyping family members or replicates.

Research paper thumbnail of Exploring the Effects of Genetic Variants on Clinical Profiles of Parkinson’s Disease Assessed by the Unified Parkinson’s Disease Rating Scale and the Hoehn–Yahr Stage

PLOS ONE, Jun 14, 2016

Many genetic variants have been linked to familial or sporadic Parkinson's disease (PD), among wh... more Many genetic variants have been linked to familial or sporadic Parkinson's disease (PD), among which those identified in PARK16, BST1, SNCA, LRRK2, GBA and MAPT genes have been demonstrated to be the most common risk factors worldwide. Moreover, complex gene-gene and gene-environment interactions have been highlighted in PD pathogenesis. Compared to studies focusing on the predisposing effects of genes, there is a relative lack of research investigating how these genes and their interactions influence the clinical profiles of PD. In a cohort consisting of 2,011 Chinese Han PD patients, we selected 9 representative variants from the 6 above-mentioned common PD genes to analyze their main and epistatic effects on the Unified Parkinson's Disease Rating Scale (UPDRS) and the Hoehn and Yahr (H-Y) stage of PD. With multiple linear regression models adjusting for medication status, disease duration, gender and age at onset, none of the variants displayed significant main effects on UPDRS or the H-Y scores. However, for gene-gene interaction analyses, 7 out of 37 pairs of variants showed significant or marginally significant associations with these scores. Among these, the GBA rs421016 (L444P)×LRRK2 rs33949390 (R1628P) interaction was consistently significant in relation to UPDRS III and UPDRS total (I+II+III), even after controlling for the family-wise error rate using False Discovery Rate (FDR-corrected p values are 0.0481 and 0.0070, respectively). Although the effects of the remaining pairs of variants did not survive the FDR correction, they showed marginally significant associations with either UPDRS or the H-Y stage (raw p<0.05). Our results highlight the importance of epistatic effects of multiple genes on the determination of PLOS ONE |

Research paper thumbnail of Orphanin FQ Antagonizes the Inhibition of Ca<SUP>2+</SUP> Currents Induced by µ-Opioid Receptors

Journal of Molecular Neuroscience, 2005

Orphanin FQ (OFQ), an endogenous peptide ligand of opioid receptor-like receptors (ORLs), has pro... more Orphanin FQ (OFQ), an endogenous peptide ligand of opioid receptor-like receptors (ORLs), has properties similar to traditional opioids. This peptide inhibits adenylyl cyclase and voltage-gated calcium channels but stimulates inwardly rectifying potassium channels. Among other actions, however, OFQ also has pharmacological functions that are different from, or even opposite to, those of opioids. For example, OFQ antagonizes the behavioral analgesic effects mediated by κand µ-opioid receptors. In a previous paper, we reported that OFQ antagonizes inhibition of calcium channels mediated by κ-opioid receptors. We report here that OFQ also antagonizes the inhibition of calcium channels mediated by µ-opioid receptor. Further, single-cell RT-PCR reveals that the antagonistic effect of OFQ is correlated with the presence of ORL1 mRNA in individual cells.

Research paper thumbnail of A Coefficient of Determination for Generalized Linear Models

The American Statistician, Oct 2, 2017

The coefficient of determination, a.k.a. R 2 , is well-defined in linear regression models, and m... more The coefficient of determination, a.k.a. R 2 , is well-defined in linear regression models, and measures the proportion of variation in the dependent variable explained by the predictors included in the model. To extend it for generalized linear models, we use the variance function to define the total variation of the dependent variable, as well as the remaining variation of the dependent variable after modeling the predictive effects of the independent variables. Unlike other definitions which demand complete specification of the likelihood function, our definition of R 2 only needs to know the mean and variance functions, so applicable to more general quasi-models. It is consistent with the classical measure of uncertainty using variance, and reduces to the classical definition of the coefficient of determination when linear regression models are considered.

Research paper thumbnail of Generalized Thresholding Estimators for High-Dimensional Location Parameters

Analyzing high-throughput genomic, proteomic, and metabolomic data usually involves estimating hi... more Analyzing high-throughput genomic, proteomic, and metabolomic data usually involves estimating high-dimensional location parameters. Thresholding estimators can significantly improve such estimation when many parameters are zero, i.e., parameters are sparse. Several such estimators have been constructed to be adaptive to parameter sparsity. However, they assume that the underlying parameter spaces are symmetric. Since many applications present asymmetry parameter spaces, we introduce a class of generalized thresholding estimators. A construction of these estimators is developed using a Bayes approach, where an important constraint on the hyperparameters is identified. A generalized empirical Bayes implementation is presented for estimating high-dimensional yet sparse normal means. This implementation provides generalized thresholding estimators which are adaptive to both sparsity and asymmetry of high-dimensional parameters.

Research paper thumbnail of Inferring Gene Regulatory Networks from a Population of Yeast Segregants

Scientific Reports, Feb 4, 2019

Constructing gene regulatory networks is crucial to unraveling the genetic architecture of comple... more Constructing gene regulatory networks is crucial to unraveling the genetic architecture of complex traits and to understanding the mechanisms of diseases. on the basis of gene expression and single nucleotide polymorphism data in the yeast, Saccharomyces cerevisiae, we constructed gene regulatory networks using a two-stage penalized least squares method. A large system of structural equations via optimal prediction of a set of surrogate variables was established at the first stage, followed by consistent selection of regulatory effects at the second stage. Using this approach, we identified subnetworks that were enriched in gene ontology categories, revealing directional regulatory mechanisms controlling these biological pathways. our mapping and analysis of expression-based quantitative trait loci uncovered a known alteration of gene expression within a biological pathway that results in regulatory effects on companion pathway genes in the phosphocholine network. In addition, we identify nodes in these gene ontology-enriched subnetworks that are coordinately controlled by transcription factors driven by transacting expression quantitative trait loci. Altogether, the integration of documented transcription factor regulatory associations with subnetworks defined by a system of structural equations using quantitative trait loci data is an effective means to delineate the transcriptional control of biological pathways. Gene expression is a fundamental step in the flow of information from an organism's genotype to phenotype. The genetic information encoded in an organism's DNA is transferred into a functional gene product (e.g., protein) via the process of gene expression, and gene expression leads to the formation of the organism's phenotype. Gene expression have been found to be associated with a broad range of complex traits and diseases 1 , and thus play an important role in determining an organism's development. Numerous efforts have been made to map phenotypes to gene expression in order to dissect their genetic basis. Genes rarely act in isolation; instead, they interact with each other and make up gene regulatory networks to function as a whole 2. The study of this mechanism is crucial for understanding the properties and functions of genes, which help reveal the genetic architecture of complex traits and diseases. Although genetic experiments can be conducted to discover interactions among genes, this approach can be costly and time consuming. Alternatively, measurements of gene expression levels reveal gene expression patterns in a specific condition and can be exploited to infer gene regulatory networks. Various approaches have been proposed to infer gene regulatory networks using gene expression data, such as relevance networks 3-7 , Bayesian networks 8-11 , Gaussian graphical models 12-15 , and many others. Recent advances in sequencing technologies make it feasible to obtain both whole-genome genotype and gene expression for each individual, i.e., genetical genomics data 16. Combining genetics with gene expression reveals additional information on genetic structure and holds great promise for improving the accuracy of gene regulatory network inference. Numerous genetical genomics experiments, such as the Genotype-Tissue Expression (GTEx) project 17 , have been conducted to collect genetical genomics data. Much effort has been devoted to using genetical genomics data for genome-wide association (GWA) analysis of gene expression, i.e., expression quantitative trait loci (eQTL) mapping 18. Mapping of eQTL intends to elucidate variation of expression traits attributed to genomic variation, and to identify chromosomal loci (i.e., eQTL)

Research paper thumbnail of Simultaneous genome-wide association studies of anti-cyclic citrullinated peptide in rheumatoid arthritis using penalized orthogonal-components regression

BMC Proceedings, Dec 1, 2009

Genome-wide associations between single-nucleotide polymorphisms and clinical traits were simulta... more Genome-wide associations between single-nucleotide polymorphisms and clinical traits were simultaneously conducted using penalized orthogonal-components regression. This method was developed to identify the genetic variants controlling phenotypes from a massive number of candidate variants. By investigating the association between all single-nucleotide polymorphisms to the phenotype of antibodies against cyclic citrullinated peptide using the rheumatoid arthritis data provided by Genetic Analysis Workshop 16, we identified genetic regions which may contribute to the pathogenesis of rheumatoid arthritis. Bioinformatic analysis of these genomic regions showed most of them harbor protein-coding gene(s).

Research paper thumbnail of Case-control genome-wide association study of rheumatoid arthritis from Genetic Analysis Workshop 16 using penalized orthogonal-components regression-linear discriminant analysis

BMC Proceedings, Dec 1, 2009

Currently, genome-wide association studies (GWAS) are conducted by collecting a massive number of... more Currently, genome-wide association studies (GWAS) are conducted by collecting a massive number of SNPs (i.e., large p) for a relatively small number of individuals (i.e., small n) and associations are made between clinical phenotypes and genetic variation one single-nucleotide polymorphism (SNP) at a time. Univariate association approaches like this ignore the linkage disequilibrium between SNPs in regions of low recombination. This results in a low reliability of candidate gene identification. Here we propose to improve the case-control GWAS approach by implementing linear discriminant analysis (LDA) through a penalized orthogonal-components regression (POCRE), a newly developed variable selection method for large p small n data. The proposed POCRE-LDA method was applied to the Genetic Analysis Workshop 16 case-control data for rheumatoid arthritis (RA). In addition to the two regions on chromosomes 6 and 9 previously associated with RA by GWAS, we identified SNPs on chromosomes 10 and 18 as potential candidates for further investigation.

Research paper thumbnail of Altered metabolite levels and correlations in patients with colorectal cancer and polyps detected using seemingly unrelated regression analysis

Metabolomics, Sep 15, 2017

Introduction-Metabolomics technologies enable the identification of putative biomarkers for numer... more Introduction-Metabolomics technologies enable the identification of putative biomarkers for numerous diseases; however, the influence of confounding factors on metabolite levels poses a major challenge in moving forward with such metabolites for pre-clinical or clinical applications.

Research paper thumbnail of Large‐scale identification of expression quantitative trait loci in <i>Arabidopsis</i> reveals novel candidate regulators of immune responses and other processes

Journal of Integrative Plant Biology, May 4, 2020

The extensive phenotypic diversity within natural populations of Arabidopsis is associated with d... more The extensive phenotypic diversity within natural populations of Arabidopsis is associated with differences in gene expression. Transcript levels can be considered as inheritable quantitative traits, and used to map expression quantitative trait loci (eQTL) in genome-wide association studies (GWASs). In order to identify putative genetic determinants for variations in gene expression, we used publicly available genomic and transcript variation data from 665 Arabidopsis accessions and applied the SNP-set (Sequence) Kernel Association Test (SKAT) method for the identification of eQTL. Moreover, we used the penalized orthogonal-components regression (POCRE) method to increase the power of statistical tests. Then, gene annotations were used as test units to identify genes that are associated with natural variations in transcript accumulation, which correspond to candidate regulators, some of which may have a broad impact on gene expression. Besides increasing the chances to identify real associations, the analysis using POCRE and SKAT significantly reduced the computational cost required to analyze large datasets. As a proof of concept, we used this approach to identify eQTL that represent novel candidate regulators of immune responses. The versatility of this approach allows its

Research paper thumbnail of Multiplicative background correction for spotted microarrays to improve reproducibility

Genetics Research, Jun 1, 2006

Research paper thumbnail of These authors contributed equally to this work

Understanding genome and chromosome evolution is important for understanding genetic inheritance ... more Understanding genome and chromosome evolution is important for understanding genetic inheritance and evolution. Universal events comprising DNA replication, transcription, repair, mobile genetic element transposition, chromosome rearrangements, mitosis, and meiosis underlie inheritance and variation of living organisms. Although the genome of a species as a whole is important, chromosomes are the basic units subjected to genetic events that coin evolution to a large extent. Now many complete genome sequences are available, we can address evolution and variation of individual chromosomes across species. For example, ‘‘How are the repeat and nonrepeat proportions of genetic codes distributed among different chromosomes in a multichromosome species?’ ’ ‘‘Is there a general rule behind the intuitive observation that chromosome lengths tend to be similar in a species, and if so, can we generalize any findings in chromosome content and size across different taxonomic groups?’ ’ Here, we s...

Research paper thumbnail of Variable selection for large

Research paper thumbnail of Differential Analysis of Directed Networks

arXiv (Cornell University), Jul 26, 2018

We developed a novel statistical method to identify structural differences between networks chara... more We developed a novel statistical method to identify structural differences between networks characterized by structural equation models. We propose to reparameterize the model to separate the differential structures from common structures, and then design an algorithm with calibration and construction stages to identify these differential structures. The calibration stage serves to obtain consistent prediction by building the ℓ 2 regularized regression of each endogenous variables against pre-screened exogenous variables, correcting for potential endogeneity issue. The construction stage consistently selects and estimates both common and differential effects by undertaking ℓ 1 regularized regression of each endogenous variable against the predicts of other endogenous variables as well as its anchoring exogenous variables. Our method allows easy parallel computation at each stage. Theoretical results are obtained to establish non-asymptotic error bounds of predictions and estimates at both stages, as well as the consistency of identified common and differential effects. Our studies on synthetic data demonstrated that our proposed method performed much better than independently constructing the networks. A real data set is analyzed to illustrate the applicability of our method.

Research paper thumbnail of Penalized orthogonal-components regression for large p small n data

Electronic Journal of Statistics, 2009

Here we propose a penalized orthogonal-components regression (POCRE) for large p small n data. Or... more Here we propose a penalized orthogonal-components regression (POCRE) for large p small n data. Orthogonal components are sequentially constructed to maximize, upon standardization, their correlation to the response residuals. A new penalization framework, implemented via empirical Bayes thresholding, is presented to effectively identify sparse predictors of each component. POCRE is computationally efficient owing to its sequential construction of leading sparse principal components. In addition, such construction offers other properties such as grouping highly correlated predictors and allowing for collinear or nearly collinear predictors. With multivariate responses, POCRE can construct common components and thus build up latent-variable models for large p small n data.

Research paper thumbnail of Internet Accessible

Research paper thumbnail of Research Article (Open Accession)

Understanding genome and chromosome evolution is important for understanding genetic inheritance ... more Understanding genome and chromosome evolution is important for understanding genetic inheritance and evolution. Universal events comprising DNA replication, transcription, repair, mobile genetic element transposition, chromosome rearrangements, mitosis, and meiosis underlie inheritance and variation of living organisms. Although the genome of a species as a whole is important, chromosomes are the basic unit subjected to genetic events that coin evolution to a large extent. Now as many complete genome sequences are available, we can address evolution and variation of individual chromosomes across species. For example, "how are the repeat and nonrepeat proportions of genetic codes distributed among different chromosomes in a multichromosome species?" "Is there a general rule behind the intuitive observation that chromosome lengths tend to be similar in a species, and if so, can we generalize any findings in chromosome content and size across different taxonomic groups?" Here we show that chromosomes within a species do not show dramatic fluctuation in their content of mobile genetic elements as the proliferation of these elements increases from unicellular eukaryotes to vertebrates. Furthermore, we demonstrate that, notwithstanding the remarkable plasticity, there is an upper limit to chromosome size variation in diploid eukaryotes with linear chromosomes. Strikingly, variation in chromosome size for 886 chromosomes in 68 eukaryotic genomes (including 22 human autosomes) can be viably captured by a single model, which predicts that vast majority of the chromosomes in a species are expected to have a basepair length between 0.4035 and 1.8626 times the average chromosome length. This conserved boundary of chromosome size variation, which prevails across a wide taxonomic range with few exceptions, indicates that cellular, molecular, and evolutionary mechanisms, possibly together, confine the chromosome lengths around a species-specific average chromosome length.

Research paper thumbnail of Serum metabolomic analysis reveals several novel metabolites in association with excessive alcohol use – an exploratory study

Translational Research, 2021

Appropriate screening tool for excessive alcohol use (EAU) is clinically important as it may help... more Appropriate screening tool for excessive alcohol use (EAU) is clinically important as it may help providers encourage early intervention and prevent adverse outcomes. We hypothesized that patients with excessive alcohol use will have distinct serum metabolites when compared to healthy controls. Serum metabolic profiling of 22 healthy controls and 147 patients with a history of EAU was performed. We employed seemingly unrelated regression to identify the unique metabolites and found 67 metabolites (out of 556), which were differentially expressed in patients with EAU. Sixteen metabolites belong to the sphingolipid metabolism, 13 belong to phospholipid metabolism, and the remaining 38 were metabolites of 25 different pathways. We also found 93 serum metabolites that were significantly associated with the total quantity of alcohol consumption in the last 30 days. A total of 15 metabolites belong to the sphingolipid metabolism, 11 belong to phospholipid metabolism, and 7 metabolites belong to lysolipid. Using a Venn diagram approach, we found the top 10 metabolites with differentially expressed in EAU and significantly associated with the quantity of alcohol consumption, sphingomyelin (d18:2/18:1), sphingomyelin (d18:2/21:0,d16:2/23:0), guanosine, S-methylmethionine, 10-undecenoate (11:1n1), sphingomyelin (d18:1/20:1, d18:2/20:0), sphingomyelin (d18:1/17:0, d17:1/18:0, d19:1/16:0), N-acetylasparagine, sphingomyelin (d18:1/19:0, d19:1/18:0), and 1-palmitoyl-2-palmitoleoyl-GPC (16:0/16:1). The diagnostic performance of the top 10 metabolites, using the area under the ROC curve, was significantly higher than that of commonly used markers. We have identified a unique metaboloic signature among patients with EAU. Future studies to validate and determine the kinetics of these markers as a function of alcohol consumption are needed.

Research paper thumbnail of Integrating Biological Knowledge Into Case–Control Analysis Through Iterated Conditional Modes/Medians Algorithm

Journal of Computational Biology, 2019

Logistic regression is an effective tool in case-control analysis. With the advanced high through... more Logistic regression is an effective tool in case-control analysis. With the advanced high throughput technology, a quest to seek a fast and efficient method in fitting high-dimensional logistic regression has gained much interest. An empirical Bayes model for logistic regression is considered in this article. A spike-and-slab prior is used for variable selection purpose, which plays a vital role in building an effective predictive model while making model interpretable. To increase the power of variable selection, we incorporate biological knowledge through the Ising prior. The development of the iterated conditional modes/medians (ICM/M) algorithm is proposed to fit the logistic model that has computational advantage over Markov Chain Monte Carlo (MCMC) algorithms. The implementation of the ICM/M algorithm for both linear and logistic models can be found in R package icmm that is freely available on Comprehensive R Archive Network (CRAN). Simulation studies were carried out to assess the performances of our method, with lasso and adaptive lasso as benchmark. Overall, the simulation studies show that the ICM/M outperform the others in terms of number of false positives and have competitive predictive ability. An application to a real data set from Parkinson's disease study was also carried out for illustration. To identify important variables, our approach provides flexibility to select variables based on local posterior probabilities while controlling false discovery rate at a desired level rather than relying only on regression coefficients.

Research paper thumbnail of Advanced Statistical Methods for NMR-Based Metabolomics

NMR-Based Metabolomics, 2019

Despite the increasing popularity and applicability of metabolomics for putative biomarker identi... more Despite the increasing popularity and applicability of metabolomics for putative biomarker identification, analysis of the data is challenged by low statistical power resulting from the small sample sizes and large numbers of metabolites and other omics information, as well as confounding demographic and clinical variables. To enhance the statistical power and improve reproducibility of the identified metabolite-based biomarkers, we advocate the use of advanced statistical methods that can simultaneously evaluate the relationship between a group of metabolites and various types of variables including other omics profiles, demographic and clinical data, as well as the complex interactions between them. Accordingly, in this chapter, we describe the method of seemingly unrelated regression that can simultaneously analyze multiple metabolites while controlling the confounding effects of demographic and clinical variables (such as gender, age, BMI, smoking status). We also introduce penalized orthogonal components regression as a screening approach that can handle millions of omics predictors in the model.

Research paper thumbnail of Genotyping Error Detection in Samples of Unrelated Individuals without Replicate Genotyping

Human Heredity, Dec 15, 2008

typing error rates and allele frequencies. This work may help researchers to estimate error rates... more typing error rates and allele frequencies. This work may help researchers to estimate error rates and to use the estimates in their analysis to increase power and decrease bias, without the extra work of genotyping family members or replicates.

Research paper thumbnail of Exploring the Effects of Genetic Variants on Clinical Profiles of Parkinson’s Disease Assessed by the Unified Parkinson’s Disease Rating Scale and the Hoehn–Yahr Stage

PLOS ONE, Jun 14, 2016

Many genetic variants have been linked to familial or sporadic Parkinson's disease (PD), among wh... more Many genetic variants have been linked to familial or sporadic Parkinson's disease (PD), among which those identified in PARK16, BST1, SNCA, LRRK2, GBA and MAPT genes have been demonstrated to be the most common risk factors worldwide. Moreover, complex gene-gene and gene-environment interactions have been highlighted in PD pathogenesis. Compared to studies focusing on the predisposing effects of genes, there is a relative lack of research investigating how these genes and their interactions influence the clinical profiles of PD. In a cohort consisting of 2,011 Chinese Han PD patients, we selected 9 representative variants from the 6 above-mentioned common PD genes to analyze their main and epistatic effects on the Unified Parkinson's Disease Rating Scale (UPDRS) and the Hoehn and Yahr (H-Y) stage of PD. With multiple linear regression models adjusting for medication status, disease duration, gender and age at onset, none of the variants displayed significant main effects on UPDRS or the H-Y scores. However, for gene-gene interaction analyses, 7 out of 37 pairs of variants showed significant or marginally significant associations with these scores. Among these, the GBA rs421016 (L444P)×LRRK2 rs33949390 (R1628P) interaction was consistently significant in relation to UPDRS III and UPDRS total (I+II+III), even after controlling for the family-wise error rate using False Discovery Rate (FDR-corrected p values are 0.0481 and 0.0070, respectively). Although the effects of the remaining pairs of variants did not survive the FDR correction, they showed marginally significant associations with either UPDRS or the H-Y stage (raw p<0.05). Our results highlight the importance of epistatic effects of multiple genes on the determination of PLOS ONE |

Research paper thumbnail of Orphanin FQ Antagonizes the Inhibition of Ca<SUP>2+</SUP> Currents Induced by µ-Opioid Receptors

Journal of Molecular Neuroscience, 2005

Orphanin FQ (OFQ), an endogenous peptide ligand of opioid receptor-like receptors (ORLs), has pro... more Orphanin FQ (OFQ), an endogenous peptide ligand of opioid receptor-like receptors (ORLs), has properties similar to traditional opioids. This peptide inhibits adenylyl cyclase and voltage-gated calcium channels but stimulates inwardly rectifying potassium channels. Among other actions, however, OFQ also has pharmacological functions that are different from, or even opposite to, those of opioids. For example, OFQ antagonizes the behavioral analgesic effects mediated by κand µ-opioid receptors. In a previous paper, we reported that OFQ antagonizes inhibition of calcium channels mediated by κ-opioid receptors. We report here that OFQ also antagonizes the inhibition of calcium channels mediated by µ-opioid receptor. Further, single-cell RT-PCR reveals that the antagonistic effect of OFQ is correlated with the presence of ORL1 mRNA in individual cells.

Research paper thumbnail of A Coefficient of Determination for Generalized Linear Models

The American Statistician, Oct 2, 2017

The coefficient of determination, a.k.a. R 2 , is well-defined in linear regression models, and m... more The coefficient of determination, a.k.a. R 2 , is well-defined in linear regression models, and measures the proportion of variation in the dependent variable explained by the predictors included in the model. To extend it for generalized linear models, we use the variance function to define the total variation of the dependent variable, as well as the remaining variation of the dependent variable after modeling the predictive effects of the independent variables. Unlike other definitions which demand complete specification of the likelihood function, our definition of R 2 only needs to know the mean and variance functions, so applicable to more general quasi-models. It is consistent with the classical measure of uncertainty using variance, and reduces to the classical definition of the coefficient of determination when linear regression models are considered.

Research paper thumbnail of Generalized Thresholding Estimators for High-Dimensional Location Parameters

Analyzing high-throughput genomic, proteomic, and metabolomic data usually involves estimating hi... more Analyzing high-throughput genomic, proteomic, and metabolomic data usually involves estimating high-dimensional location parameters. Thresholding estimators can significantly improve such estimation when many parameters are zero, i.e., parameters are sparse. Several such estimators have been constructed to be adaptive to parameter sparsity. However, they assume that the underlying parameter spaces are symmetric. Since many applications present asymmetry parameter spaces, we introduce a class of generalized thresholding estimators. A construction of these estimators is developed using a Bayes approach, where an important constraint on the hyperparameters is identified. A generalized empirical Bayes implementation is presented for estimating high-dimensional yet sparse normal means. This implementation provides generalized thresholding estimators which are adaptive to both sparsity and asymmetry of high-dimensional parameters.

Research paper thumbnail of Inferring Gene Regulatory Networks from a Population of Yeast Segregants

Scientific Reports, Feb 4, 2019

Constructing gene regulatory networks is crucial to unraveling the genetic architecture of comple... more Constructing gene regulatory networks is crucial to unraveling the genetic architecture of complex traits and to understanding the mechanisms of diseases. on the basis of gene expression and single nucleotide polymorphism data in the yeast, Saccharomyces cerevisiae, we constructed gene regulatory networks using a two-stage penalized least squares method. A large system of structural equations via optimal prediction of a set of surrogate variables was established at the first stage, followed by consistent selection of regulatory effects at the second stage. Using this approach, we identified subnetworks that were enriched in gene ontology categories, revealing directional regulatory mechanisms controlling these biological pathways. our mapping and analysis of expression-based quantitative trait loci uncovered a known alteration of gene expression within a biological pathway that results in regulatory effects on companion pathway genes in the phosphocholine network. In addition, we identify nodes in these gene ontology-enriched subnetworks that are coordinately controlled by transcription factors driven by transacting expression quantitative trait loci. Altogether, the integration of documented transcription factor regulatory associations with subnetworks defined by a system of structural equations using quantitative trait loci data is an effective means to delineate the transcriptional control of biological pathways. Gene expression is a fundamental step in the flow of information from an organism's genotype to phenotype. The genetic information encoded in an organism's DNA is transferred into a functional gene product (e.g., protein) via the process of gene expression, and gene expression leads to the formation of the organism's phenotype. Gene expression have been found to be associated with a broad range of complex traits and diseases 1 , and thus play an important role in determining an organism's development. Numerous efforts have been made to map phenotypes to gene expression in order to dissect their genetic basis. Genes rarely act in isolation; instead, they interact with each other and make up gene regulatory networks to function as a whole 2. The study of this mechanism is crucial for understanding the properties and functions of genes, which help reveal the genetic architecture of complex traits and diseases. Although genetic experiments can be conducted to discover interactions among genes, this approach can be costly and time consuming. Alternatively, measurements of gene expression levels reveal gene expression patterns in a specific condition and can be exploited to infer gene regulatory networks. Various approaches have been proposed to infer gene regulatory networks using gene expression data, such as relevance networks 3-7 , Bayesian networks 8-11 , Gaussian graphical models 12-15 , and many others. Recent advances in sequencing technologies make it feasible to obtain both whole-genome genotype and gene expression for each individual, i.e., genetical genomics data 16. Combining genetics with gene expression reveals additional information on genetic structure and holds great promise for improving the accuracy of gene regulatory network inference. Numerous genetical genomics experiments, such as the Genotype-Tissue Expression (GTEx) project 17 , have been conducted to collect genetical genomics data. Much effort has been devoted to using genetical genomics data for genome-wide association (GWA) analysis of gene expression, i.e., expression quantitative trait loci (eQTL) mapping 18. Mapping of eQTL intends to elucidate variation of expression traits attributed to genomic variation, and to identify chromosomal loci (i.e., eQTL)

Research paper thumbnail of Simultaneous genome-wide association studies of anti-cyclic citrullinated peptide in rheumatoid arthritis using penalized orthogonal-components regression

BMC Proceedings, Dec 1, 2009

Genome-wide associations between single-nucleotide polymorphisms and clinical traits were simulta... more Genome-wide associations between single-nucleotide polymorphisms and clinical traits were simultaneously conducted using penalized orthogonal-components regression. This method was developed to identify the genetic variants controlling phenotypes from a massive number of candidate variants. By investigating the association between all single-nucleotide polymorphisms to the phenotype of antibodies against cyclic citrullinated peptide using the rheumatoid arthritis data provided by Genetic Analysis Workshop 16, we identified genetic regions which may contribute to the pathogenesis of rheumatoid arthritis. Bioinformatic analysis of these genomic regions showed most of them harbor protein-coding gene(s).

Research paper thumbnail of Case-control genome-wide association study of rheumatoid arthritis from Genetic Analysis Workshop 16 using penalized orthogonal-components regression-linear discriminant analysis

BMC Proceedings, Dec 1, 2009

Currently, genome-wide association studies (GWAS) are conducted by collecting a massive number of... more Currently, genome-wide association studies (GWAS) are conducted by collecting a massive number of SNPs (i.e., large p) for a relatively small number of individuals (i.e., small n) and associations are made between clinical phenotypes and genetic variation one single-nucleotide polymorphism (SNP) at a time. Univariate association approaches like this ignore the linkage disequilibrium between SNPs in regions of low recombination. This results in a low reliability of candidate gene identification. Here we propose to improve the case-control GWAS approach by implementing linear discriminant analysis (LDA) through a penalized orthogonal-components regression (POCRE), a newly developed variable selection method for large p small n data. The proposed POCRE-LDA method was applied to the Genetic Analysis Workshop 16 case-control data for rheumatoid arthritis (RA). In addition to the two regions on chromosomes 6 and 9 previously associated with RA by GWAS, we identified SNPs on chromosomes 10 and 18 as potential candidates for further investigation.

Research paper thumbnail of Altered metabolite levels and correlations in patients with colorectal cancer and polyps detected using seemingly unrelated regression analysis

Metabolomics, Sep 15, 2017

Introduction-Metabolomics technologies enable the identification of putative biomarkers for numer... more Introduction-Metabolomics technologies enable the identification of putative biomarkers for numerous diseases; however, the influence of confounding factors on metabolite levels poses a major challenge in moving forward with such metabolites for pre-clinical or clinical applications.

Research paper thumbnail of Large‐scale identification of expression quantitative trait loci in <i>Arabidopsis</i> reveals novel candidate regulators of immune responses and other processes

Journal of Integrative Plant Biology, May 4, 2020

The extensive phenotypic diversity within natural populations of Arabidopsis is associated with d... more The extensive phenotypic diversity within natural populations of Arabidopsis is associated with differences in gene expression. Transcript levels can be considered as inheritable quantitative traits, and used to map expression quantitative trait loci (eQTL) in genome-wide association studies (GWASs). In order to identify putative genetic determinants for variations in gene expression, we used publicly available genomic and transcript variation data from 665 Arabidopsis accessions and applied the SNP-set (Sequence) Kernel Association Test (SKAT) method for the identification of eQTL. Moreover, we used the penalized orthogonal-components regression (POCRE) method to increase the power of statistical tests. Then, gene annotations were used as test units to identify genes that are associated with natural variations in transcript accumulation, which correspond to candidate regulators, some of which may have a broad impact on gene expression. Besides increasing the chances to identify real associations, the analysis using POCRE and SKAT significantly reduced the computational cost required to analyze large datasets. As a proof of concept, we used this approach to identify eQTL that represent novel candidate regulators of immune responses. The versatility of this approach allows its

Research paper thumbnail of Multiplicative background correction for spotted microarrays to improve reproducibility

Genetics Research, Jun 1, 2006

Research paper thumbnail of These authors contributed equally to this work

Understanding genome and chromosome evolution is important for understanding genetic inheritance ... more Understanding genome and chromosome evolution is important for understanding genetic inheritance and evolution. Universal events comprising DNA replication, transcription, repair, mobile genetic element transposition, chromosome rearrangements, mitosis, and meiosis underlie inheritance and variation of living organisms. Although the genome of a species as a whole is important, chromosomes are the basic units subjected to genetic events that coin evolution to a large extent. Now many complete genome sequences are available, we can address evolution and variation of individual chromosomes across species. For example, ‘‘How are the repeat and nonrepeat proportions of genetic codes distributed among different chromosomes in a multichromosome species?’ ’ ‘‘Is there a general rule behind the intuitive observation that chromosome lengths tend to be similar in a species, and if so, can we generalize any findings in chromosome content and size across different taxonomic groups?’ ’ Here, we s...

Research paper thumbnail of Variable selection for large

Research paper thumbnail of Differential Analysis of Directed Networks

arXiv (Cornell University), Jul 26, 2018

We developed a novel statistical method to identify structural differences between networks chara... more We developed a novel statistical method to identify structural differences between networks characterized by structural equation models. We propose to reparameterize the model to separate the differential structures from common structures, and then design an algorithm with calibration and construction stages to identify these differential structures. The calibration stage serves to obtain consistent prediction by building the ℓ 2 regularized regression of each endogenous variables against pre-screened exogenous variables, correcting for potential endogeneity issue. The construction stage consistently selects and estimates both common and differential effects by undertaking ℓ 1 regularized regression of each endogenous variable against the predicts of other endogenous variables as well as its anchoring exogenous variables. Our method allows easy parallel computation at each stage. Theoretical results are obtained to establish non-asymptotic error bounds of predictions and estimates at both stages, as well as the consistency of identified common and differential effects. Our studies on synthetic data demonstrated that our proposed method performed much better than independently constructing the networks. A real data set is analyzed to illustrate the applicability of our method.

Research paper thumbnail of Penalized orthogonal-components regression for large p small n data

Electronic Journal of Statistics, 2009

Here we propose a penalized orthogonal-components regression (POCRE) for large p small n data. Or... more Here we propose a penalized orthogonal-components regression (POCRE) for large p small n data. Orthogonal components are sequentially constructed to maximize, upon standardization, their correlation to the response residuals. A new penalization framework, implemented via empirical Bayes thresholding, is presented to effectively identify sparse predictors of each component. POCRE is computationally efficient owing to its sequential construction of leading sparse principal components. In addition, such construction offers other properties such as grouping highly correlated predictors and allowing for collinear or nearly collinear predictors. With multivariate responses, POCRE can construct common components and thus build up latent-variable models for large p small n data.

Research paper thumbnail of Internet Accessible

Research paper thumbnail of Research Article (Open Accession)

Understanding genome and chromosome evolution is important for understanding genetic inheritance ... more Understanding genome and chromosome evolution is important for understanding genetic inheritance and evolution. Universal events comprising DNA replication, transcription, repair, mobile genetic element transposition, chromosome rearrangements, mitosis, and meiosis underlie inheritance and variation of living organisms. Although the genome of a species as a whole is important, chromosomes are the basic unit subjected to genetic events that coin evolution to a large extent. Now as many complete genome sequences are available, we can address evolution and variation of individual chromosomes across species. For example, "how are the repeat and nonrepeat proportions of genetic codes distributed among different chromosomes in a multichromosome species?" "Is there a general rule behind the intuitive observation that chromosome lengths tend to be similar in a species, and if so, can we generalize any findings in chromosome content and size across different taxonomic groups?" Here we show that chromosomes within a species do not show dramatic fluctuation in their content of mobile genetic elements as the proliferation of these elements increases from unicellular eukaryotes to vertebrates. Furthermore, we demonstrate that, notwithstanding the remarkable plasticity, there is an upper limit to chromosome size variation in diploid eukaryotes with linear chromosomes. Strikingly, variation in chromosome size for 886 chromosomes in 68 eukaryotic genomes (including 22 human autosomes) can be viably captured by a single model, which predicts that vast majority of the chromosomes in a species are expected to have a basepair length between 0.4035 and 1.8626 times the average chromosome length. This conserved boundary of chromosome size variation, which prevails across a wide taxonomic range with few exceptions, indicates that cellular, molecular, and evolutionary mechanisms, possibly together, confine the chromosome lengths around a species-specific average chromosome length.

Research paper thumbnail of Serum metabolomic analysis reveals several novel metabolites in association with excessive alcohol use – an exploratory study

Translational Research, 2021

Appropriate screening tool for excessive alcohol use (EAU) is clinically important as it may help... more Appropriate screening tool for excessive alcohol use (EAU) is clinically important as it may help providers encourage early intervention and prevent adverse outcomes. We hypothesized that patients with excessive alcohol use will have distinct serum metabolites when compared to healthy controls. Serum metabolic profiling of 22 healthy controls and 147 patients with a history of EAU was performed. We employed seemingly unrelated regression to identify the unique metabolites and found 67 metabolites (out of 556), which were differentially expressed in patients with EAU. Sixteen metabolites belong to the sphingolipid metabolism, 13 belong to phospholipid metabolism, and the remaining 38 were metabolites of 25 different pathways. We also found 93 serum metabolites that were significantly associated with the total quantity of alcohol consumption in the last 30 days. A total of 15 metabolites belong to the sphingolipid metabolism, 11 belong to phospholipid metabolism, and 7 metabolites belong to lysolipid. Using a Venn diagram approach, we found the top 10 metabolites with differentially expressed in EAU and significantly associated with the quantity of alcohol consumption, sphingomyelin (d18:2/18:1), sphingomyelin (d18:2/21:0,d16:2/23:0), guanosine, S-methylmethionine, 10-undecenoate (11:1n1), sphingomyelin (d18:1/20:1, d18:2/20:0), sphingomyelin (d18:1/17:0, d17:1/18:0, d19:1/16:0), N-acetylasparagine, sphingomyelin (d18:1/19:0, d19:1/18:0), and 1-palmitoyl-2-palmitoleoyl-GPC (16:0/16:1). The diagnostic performance of the top 10 metabolites, using the area under the ROC curve, was significantly higher than that of commonly used markers. We have identified a unique metaboloic signature among patients with EAU. Future studies to validate and determine the kinetics of these markers as a function of alcohol consumption are needed.

Research paper thumbnail of Integrating Biological Knowledge Into Case–Control Analysis Through Iterated Conditional Modes/Medians Algorithm

Journal of Computational Biology, 2019

Logistic regression is an effective tool in case-control analysis. With the advanced high through... more Logistic regression is an effective tool in case-control analysis. With the advanced high throughput technology, a quest to seek a fast and efficient method in fitting high-dimensional logistic regression has gained much interest. An empirical Bayes model for logistic regression is considered in this article. A spike-and-slab prior is used for variable selection purpose, which plays a vital role in building an effective predictive model while making model interpretable. To increase the power of variable selection, we incorporate biological knowledge through the Ising prior. The development of the iterated conditional modes/medians (ICM/M) algorithm is proposed to fit the logistic model that has computational advantage over Markov Chain Monte Carlo (MCMC) algorithms. The implementation of the ICM/M algorithm for both linear and logistic models can be found in R package icmm that is freely available on Comprehensive R Archive Network (CRAN). Simulation studies were carried out to assess the performances of our method, with lasso and adaptive lasso as benchmark. Overall, the simulation studies show that the ICM/M outperform the others in terms of number of false positives and have competitive predictive ability. An application to a real data set from Parkinson's disease study was also carried out for illustration. To identify important variables, our approach provides flexibility to select variables based on local posterior probabilities while controlling false discovery rate at a desired level rather than relying only on regression coefficients.

Research paper thumbnail of Advanced Statistical Methods for NMR-Based Metabolomics

NMR-Based Metabolomics, 2019

Despite the increasing popularity and applicability of metabolomics for putative biomarker identi... more Despite the increasing popularity and applicability of metabolomics for putative biomarker identification, analysis of the data is challenged by low statistical power resulting from the small sample sizes and large numbers of metabolites and other omics information, as well as confounding demographic and clinical variables. To enhance the statistical power and improve reproducibility of the identified metabolite-based biomarkers, we advocate the use of advanced statistical methods that can simultaneously evaluate the relationship between a group of metabolites and various types of variables including other omics profiles, demographic and clinical data, as well as the complex interactions between them. Accordingly, in this chapter, we describe the method of seemingly unrelated regression that can simultaneously analyze multiple metabolites while controlling the confounding effects of demographic and clinical variables (such as gender, age, BMI, smoking status). We also introduce penalized orthogonal components regression as a screening approach that can handle millions of omics predictors in the model.