Xuekui Zhang | University of Victoria (original) (raw)
Papers by Xuekui Zhang
Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human dise... more Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human diseases by identifying disease-associated single-nucleotide polymorphisms (SNPs). The traditional SNP-wise approach along with multiple testing adjustment is over-conservative and lack of power in many GWASs. In this article, we proposed a model-based clustering method that transforms the challenging high-dimension-small-sample-size problem to low-dimension-large-sample-size problem and borrows information across SNPs by grouping SNPs into three clusters. We pre-specify the patterns of clusters by minor allele frequencies of SNPs between cases and controls, and enforce the patterns with prior distributions. In the simulation studies our proposed novel model outperform traditional SNP-wise approach by showing better controls of false discovery rate (FDR) and higher sensitivity. We re-analyzed two real studies to identifying SNPs associated with severe bortezomib-induced peripheral neuropathy...
As a future trend of healthcare, personalized medicine tailors medical treatments to individual p... more As a future trend of healthcare, personalized medicine tailors medical treatments to individual patients. It requires to identify a subset of patients with the best response to treatment. The subset can be defined by a biomarker (e.g. expression of a gene) and its cutoff value. Topics on subset identification have received massive attention. There are over 2 million hits by keyword searches on Google Scholar. However, how to properly incorporate the identified subsets/biomarkers to design clinical trials is not trivial and rarely discussed in the literature, which leads to a gap between research results and real-world drug development. To fill in this gap, we formulate the problem of clinical trial design into an optimization problem involving high-dimensional integration, and propose a novel computational solution based on Monte-Carlo and smoothing methods. Our method utilizes the modern techniques of General-Purpose computing on Graphics Processing Units for large-scale parallel c...
Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Mining dense subgraphs where vertices connect closely with each other is a common task when analy... more Mining dense subgraphs where vertices connect closely with each other is a common task when analyzing graphs. A very popular notion in subgraph analysis is core decomposition. Recently, Esfahani et al. presented a probabilistic core decomposition algorithm based on graph peeling and Central Limit Theorem (CLT) that is capable of handling very large graphs. Their proposed peeling algorithm (PA) starts from the lowest degree vertices and recursively deletes these vertices, assigning core numbers, and updating the degree of neighbour vertices until it reached the maximum core. However, in many applications, particularly in biology, more valuable information can be obtained from dense sub-communities and we are not interested in small cores where vertices do not interact much with others. To make the previous PA focus more on dense subgraphs, we propose a multi-stage graph peeling algorithm (M-PA) that has a two-stage data screening procedure added before the previous PA. After removing vertices from the graph based on the user-defined thresholds, we can reduce the graph complexity largely and without affecting the vertices in subgraphs that we are interested in. We show that M-PA is more efficient than the previous PA and with the properly set filtering threshold, can produce very similar if not identical dense subgraphs to the previous PA (in terms of graph density and clustering coefficient).
R topics documented: bam2gr........................................... 2 makeRangedDataOutput....... more R topics documented: bam2gr........................................... 2 makeRangedDataOutput.................................. 3 pics............................................. 4 pics-class.......................................... 6 picsError-class....................................... 8 picsFDR........................................... 9 picsList-class........................................ 10 plot-FDR.......................................... 12 segChrRead......................................... 13 segmentPICS........................................ 13 segReads.......................................... 15 segReadsList........................................ 15 segReadsListPE....................................... 16 segReadsPE......................................... 17 setParaEM.......................................... 17 1 2 bam2gr setParaPrior......................................... 18 show............................................. 19 summary..........................
MNase-Seq and ChIP-Seq have evolved as popular techniques to study chromatin and histone modifica... more MNase-Seq and ChIP-Seq have evolved as popular techniques to study chromatin and histone modification. Although many tools have been developed to identify enriched regions, software tools for nucleosome positioning are still limited. We introduce a flexible and powerful open-source R package, PING 2.0, for nucleosome positioning using MNase-Seq data or MNase-or sonicated-ChIP-Seq data combined with either single-end or paired-end sequencing. PING uses a model-based approach, which enables nucleosome predictions even in the presence of low read counts. We illustrate PING using two paired-end datasets from Saccharomyces cerevisiae and compare its performance with nucleR and ChIPseqR.
Collate setClasses.R setMethods.R PING.R postPING.R segmentPING.R
Collate setClasses.R setMethods.R PING.R postPING.R segmentPING.R
Statistics in Medicine
As a future trend of healthcare, personalized medicine tailors medical treatments to individual p... more As a future trend of healthcare, personalized medicine tailors medical treatments to individual patients. It requires to identify a subset of patients with the best response to treatment. The subset can be defined by a biomarker (e.g. expression of a gene) and its cutoff value. Topics on subset identification have received massive attention. There are over 2 million hits by keyword searches on Google Scholar. However, how to properly incorporate the identified subsets/biomarkers to design clinical trials is not trivial and rarely discussed in the literature, which leads to a gap between research results and real-world drug development. To fill in this gap, we formulate the problem of clinical trial design into an optimization problem involving high-dimensional integration, and propose a novel computational solution based on Monte-Carlo and smoothing methods. Our method utilizes the modern techniques of General-Purpose computing on Graphics Processing Units for large-scale parallel computing. Compared to the standard method in three-dimensional problems, our approach is more accurate and 133 times faster. This advantage increases when dimensionality increases. Our method is scalable to higher-dimensional problems since the precision bound is a finite number not affected by dimensionality. Our software will be available on GitHub and CRAN, which can be applied to guide the design of clinical trials to incorporate the biomarker better. Although our research is motivated by the design of clinical trials, the method can be used widely to solve other optimization problems involving high-dimensional integration.
Background: Single-cell RNA sequencing (scRNA-seq) yields valuable insights about gene expression... more Background: Single-cell RNA sequencing (scRNA-seq) yields valuable insights about gene expression and gives critical information about complex tissue cellular composition. In the analysis of single-cell RNA sequencing, the annotations of cell subtypes are often done manually, which is time-consuming and irreproducible. Garnett is a cell-type annotation software based the on elastic net method. Beside cell-type annotation, supervised machine learning methods can also be applied to predict other cell phenotypes from genomic data. Despite the popularity of such applications, there is no existing study to systematically investigate the performance of those supervised algorithms in various sizes of scRNA-seq data sets. Methods and Results: This study evaluates 13 popular supervised machine learning algorithms to classify cell phenotypes, using published real and simulated data sets with diverse cell sizes. The benchmark contained two parts. In the first part, we used real data sets to as...
Microbiology Research
Classification tree is a widely used machine learning method. It has multiple implementations as ... more Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.
Background: Selecting feature genes to predict phenotypes is one of the typical tasks in analyzin... more Background: Selecting feature genes to predict phenotypes is one of the typical tasks in analyzing genomics data. Though many general-purpose algorithms were developed for prediction, dealing with highly correlated genes in the prediction model is still not well addressed. High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models. Furthermore, when a causal gene (whose variants have an actual biological effect on a phenotype) is highly correlated with other genes, most algorithms select the feature gene from the correlated group in a purely data-driven manner. Since the correlation structure among genes could change substantially when condition changes, the prediction model based on not correctly selected feature genes is unreliable. Therefore, we aim to keep the causal biological signal in the prediction process and build a more robust prediction model. Method: We propose a grouping algorithm, which treats...
arXiv: Methodology, 2020
In longitudinal studies, we observe measurements of the same variables at different time points t... more In longitudinal studies, we observe measurements of the same variables at different time points to track the changes in their pattern over time. In such studies, scheduling of the data collection waves (i.e. time of participants' visits) is often pre-determined to accommodate ease of project management and compliance. Hence, it is common to schedule those visits at equally spaced time intervals. However, recent publications based on simulated experiments indicate that the power of studies and the precision of model parameter estimators is related to the participants' visiting schemes. In this paper, we consider the longitudinal studies that investigate the changing pattern of a disease outcome, (e.g. the accelerated cognitive decline of senior adults). Such studies are often analyzed by the broken-stick model, consisting of two segments of linear models connected at an unknown change-point. We formulate this design problem into a high-dimensional optimization problem and der...
Canadian Journal of Statistics-revue Canadienne De Statistique, 2020
arXiv: Methodology, 2019
The shape of the relationship between a continuous exposure variable and a binary disease variabl... more The shape of the relationship between a continuous exposure variable and a binary disease variable is often central to epidemiologic investigations. This paper investigates a number of issues surrounding inference and the shape of the relationship. Presuming that the relationship can be expressed in terms of regression coefficients and a shape parameter, we investigate how well the shape can be inferred in settings which might typify epidemiologic investigations and risk assessment. We also consider a suitable definition of the median effect of exposure, and investigate how precisely this can be inferred. This is done both in the case of using a model acknowledging uncertainty about the shape parameter and in the case of ignoring this uncertainty and using a two-step method, where in step one we transform the predictor and in step two we fit a simple linear model with transformed predictor. All these investigations require a family of exposure-disease relationships indexed by a shap...
arXiv: Applications, 2020
Motivation: Selecting feature genes and predicting cells' phenotype are typical tasks in the ... more Motivation: Selecting feature genes and predicting cells' phenotype are typical tasks in the analysis of scRNA-seq data. Many algorithms were developed for these tasks, but high correlations among genes create challenges specifically in scRNA-seq analysis, which are not well addressed. Highly correlated genes lead to unreliable prediction models due to technical problems, such as multi-collinearity. Most importantly, when a causal gene (whose variants have a true biological effect on the phenotype) is highly correlated with other genes, most algorithms select one of them in a data-driven manner. The correlation structure among genes could change substantially. Hence, it is critical to build a prediction model based on causal genes. Results: To address the issues discussed above, we propose a grouping algorithm that can be integrated into prediction models. Using real benchmark scRNA-seq data and simulated cell phenotypes, we show our novel method significantly outperforms standa...
Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human dise... more Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human diseases by identifying disease-associated single-nucleotide polymorphisms (SNPs). SNP-wise approach, the standard method for analyzing GWAS, tests each SNP individually. Then the P-values are adjusted for multiple testing. Multiple testing adjustment (purely based on p-values) is over-conservative and causes lack of power in many GWASs, due to insufficiently modelling the relationship among SNPs. To address this problem, we propose a novel method, which borrows information across SNPs by grouping SNPs into three clusters. We pre-specify the patterns of clusters by minor allele frequencies of SNPs between cases and controls, and enforce the patterns with prior distributions. Therefore, compared with the traditional approach, it better controls false discovery rate (FDR) and shows higher sensitivity, which is confirmed by our simulation studies. We re-analyzed real data studies on identifying...
Motivation: Selecting feature genes and predicting cells’ phenotype are typical tasks in the anal... more Motivation: Selecting feature genes and predicting cells’ phenotype are typical tasks in the analysis of scRNA-seq data. Many algorithms were developed for these tasks, but high correlations among genes create challenges specifically in scRNA-seq analysis, which are not well addressed. Highly correlated genes lead to collinearity and unreliable model fitting. Highly correlated genes compete with each other in feature selection, which causes underestimation of their importance. Most importantly, when a causal gene is highly correlated other genes, most algorithms select one of them in a data driven manner. The correlation structure among genes could change substantially. Hence, it is critical to build a prediction model based on causal genes but not their highly correlated genes. Results: To address the issues discussed above, we propose a grouping algorithm which can be integrated in prediction models. Using real benchmark scRNA-seq data sets and simulated cell phenotypes, we show o...
ABSTRACTSurvival analysis is a technique to identify prognostic biomarkers and genetic vulnerabil... more ABSTRACTSurvival analysis is a technique to identify prognostic biomarkers and genetic vulnerabilities in cancer studies. Large-scale consortium-based projects have profiled >11,000 adult and >4,000 paediatric tumor cases with clinical outcomes and multi-omics approaches. This provides a resource for investigating molecular-level cancer etiologies using clinical correlations. Although cancers often arise from multiple genetic vulnerabilities and have deregulated gene sets (GSs), existing survival analysis protocols can report only on individual genes. Additionally, there is no systematic method to connect clinical outcomes with experimental (cell line) data. To address these gaps, we developed cSurvival (https://tau.cmmt.ubc.ca/cSurvival). cSurvival provides a user-adjustable analytical pipeline with a curated, integrated database, and offers three main advances: (a) joint analysis with two genomic predictors to identify interacting biomarkers, including new algorithms to iden...
ABSTRACTBackgroundCOVID-19 is a highly transmissible infectious disease that has infected over 12... more ABSTRACTBackgroundCOVID-19 is a highly transmissible infectious disease that has infected over 122 million individuals worldwide. To combat this pandemic, governments around the world have imposed lockdowns. However, the impact of these lockdowns on the rates of COVID-19 transmission in communities is not well known. Here, we used COVID-19 case counts from 3,000+ counties in the United States (US) to determine the relationship between lockdown as well as other county factors and the rate of COVID-19 spread in these communities.MethodsWe merged county-specific COVID-19 case counts with US census data and the date of lockdown for each of the counties. We then applied a Functional Principal Component (FPC) analysis on this dataset to generate scores that described the trajectory of COVID-19 spread across the counties. We used machine learning methods to identify important factors in the county including the date of lockdown that significantly influenced the FPC scores.FindingsWe found ...
Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human dise... more Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human diseases by identifying disease-associated single-nucleotide polymorphisms (SNPs). The traditional SNP-wise approach along with multiple testing adjustment is over-conservative and lack of power in many GWASs. In this article, we proposed a model-based clustering method that transforms the challenging high-dimension-small-sample-size problem to low-dimension-large-sample-size problem and borrows information across SNPs by grouping SNPs into three clusters. We pre-specify the patterns of clusters by minor allele frequencies of SNPs between cases and controls, and enforce the patterns with prior distributions. In the simulation studies our proposed novel model outperform traditional SNP-wise approach by showing better controls of false discovery rate (FDR) and higher sensitivity. We re-analyzed two real studies to identifying SNPs associated with severe bortezomib-induced peripheral neuropathy...
As a future trend of healthcare, personalized medicine tailors medical treatments to individual p... more As a future trend of healthcare, personalized medicine tailors medical treatments to individual patients. It requires to identify a subset of patients with the best response to treatment. The subset can be defined by a biomarker (e.g. expression of a gene) and its cutoff value. Topics on subset identification have received massive attention. There are over 2 million hits by keyword searches on Google Scholar. However, how to properly incorporate the identified subsets/biomarkers to design clinical trials is not trivial and rarely discussed in the literature, which leads to a gap between research results and real-world drug development. To fill in this gap, we formulate the problem of clinical trial design into an optimization problem involving high-dimensional integration, and propose a novel computational solution based on Monte-Carlo and smoothing methods. Our method utilizes the modern techniques of General-Purpose computing on Graphics Processing Units for large-scale parallel c...
Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
Mining dense subgraphs where vertices connect closely with each other is a common task when analy... more Mining dense subgraphs where vertices connect closely with each other is a common task when analyzing graphs. A very popular notion in subgraph analysis is core decomposition. Recently, Esfahani et al. presented a probabilistic core decomposition algorithm based on graph peeling and Central Limit Theorem (CLT) that is capable of handling very large graphs. Their proposed peeling algorithm (PA) starts from the lowest degree vertices and recursively deletes these vertices, assigning core numbers, and updating the degree of neighbour vertices until it reached the maximum core. However, in many applications, particularly in biology, more valuable information can be obtained from dense sub-communities and we are not interested in small cores where vertices do not interact much with others. To make the previous PA focus more on dense subgraphs, we propose a multi-stage graph peeling algorithm (M-PA) that has a two-stage data screening procedure added before the previous PA. After removing vertices from the graph based on the user-defined thresholds, we can reduce the graph complexity largely and without affecting the vertices in subgraphs that we are interested in. We show that M-PA is more efficient than the previous PA and with the properly set filtering threshold, can produce very similar if not identical dense subgraphs to the previous PA (in terms of graph density and clustering coefficient).
R topics documented: bam2gr........................................... 2 makeRangedDataOutput....... more R topics documented: bam2gr........................................... 2 makeRangedDataOutput.................................. 3 pics............................................. 4 pics-class.......................................... 6 picsError-class....................................... 8 picsFDR........................................... 9 picsList-class........................................ 10 plot-FDR.......................................... 12 segChrRead......................................... 13 segmentPICS........................................ 13 segReads.......................................... 15 segReadsList........................................ 15 segReadsListPE....................................... 16 segReadsPE......................................... 17 setParaEM.......................................... 17 1 2 bam2gr setParaPrior......................................... 18 show............................................. 19 summary..........................
MNase-Seq and ChIP-Seq have evolved as popular techniques to study chromatin and histone modifica... more MNase-Seq and ChIP-Seq have evolved as popular techniques to study chromatin and histone modification. Although many tools have been developed to identify enriched regions, software tools for nucleosome positioning are still limited. We introduce a flexible and powerful open-source R package, PING 2.0, for nucleosome positioning using MNase-Seq data or MNase-or sonicated-ChIP-Seq data combined with either single-end or paired-end sequencing. PING uses a model-based approach, which enables nucleosome predictions even in the presence of low read counts. We illustrate PING using two paired-end datasets from Saccharomyces cerevisiae and compare its performance with nucleR and ChIPseqR.
Collate setClasses.R setMethods.R PING.R postPING.R segmentPING.R
Collate setClasses.R setMethods.R PING.R postPING.R segmentPING.R
Statistics in Medicine
As a future trend of healthcare, personalized medicine tailors medical treatments to individual p... more As a future trend of healthcare, personalized medicine tailors medical treatments to individual patients. It requires to identify a subset of patients with the best response to treatment. The subset can be defined by a biomarker (e.g. expression of a gene) and its cutoff value. Topics on subset identification have received massive attention. There are over 2 million hits by keyword searches on Google Scholar. However, how to properly incorporate the identified subsets/biomarkers to design clinical trials is not trivial and rarely discussed in the literature, which leads to a gap between research results and real-world drug development. To fill in this gap, we formulate the problem of clinical trial design into an optimization problem involving high-dimensional integration, and propose a novel computational solution based on Monte-Carlo and smoothing methods. Our method utilizes the modern techniques of General-Purpose computing on Graphics Processing Units for large-scale parallel computing. Compared to the standard method in three-dimensional problems, our approach is more accurate and 133 times faster. This advantage increases when dimensionality increases. Our method is scalable to higher-dimensional problems since the precision bound is a finite number not affected by dimensionality. Our software will be available on GitHub and CRAN, which can be applied to guide the design of clinical trials to incorporate the biomarker better. Although our research is motivated by the design of clinical trials, the method can be used widely to solve other optimization problems involving high-dimensional integration.
Background: Single-cell RNA sequencing (scRNA-seq) yields valuable insights about gene expression... more Background: Single-cell RNA sequencing (scRNA-seq) yields valuable insights about gene expression and gives critical information about complex tissue cellular composition. In the analysis of single-cell RNA sequencing, the annotations of cell subtypes are often done manually, which is time-consuming and irreproducible. Garnett is a cell-type annotation software based the on elastic net method. Beside cell-type annotation, supervised machine learning methods can also be applied to predict other cell phenotypes from genomic data. Despite the popularity of such applications, there is no existing study to systematically investigate the performance of those supervised algorithms in various sizes of scRNA-seq data sets. Methods and Results: This study evaluates 13 popular supervised machine learning algorithms to classify cell phenotypes, using published real and simulated data sets with diverse cell sizes. The benchmark contained two parts. In the first part, we used real data sets to as...
Microbiology Research
Classification tree is a widely used machine learning method. It has multiple implementations as ... more Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.
Background: Selecting feature genes to predict phenotypes is one of the typical tasks in analyzin... more Background: Selecting feature genes to predict phenotypes is one of the typical tasks in analyzing genomics data. Though many general-purpose algorithms were developed for prediction, dealing with highly correlated genes in the prediction model is still not well addressed. High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models. Furthermore, when a causal gene (whose variants have an actual biological effect on a phenotype) is highly correlated with other genes, most algorithms select the feature gene from the correlated group in a purely data-driven manner. Since the correlation structure among genes could change substantially when condition changes, the prediction model based on not correctly selected feature genes is unreliable. Therefore, we aim to keep the causal biological signal in the prediction process and build a more robust prediction model. Method: We propose a grouping algorithm, which treats...
arXiv: Methodology, 2020
In longitudinal studies, we observe measurements of the same variables at different time points t... more In longitudinal studies, we observe measurements of the same variables at different time points to track the changes in their pattern over time. In such studies, scheduling of the data collection waves (i.e. time of participants' visits) is often pre-determined to accommodate ease of project management and compliance. Hence, it is common to schedule those visits at equally spaced time intervals. However, recent publications based on simulated experiments indicate that the power of studies and the precision of model parameter estimators is related to the participants' visiting schemes. In this paper, we consider the longitudinal studies that investigate the changing pattern of a disease outcome, (e.g. the accelerated cognitive decline of senior adults). Such studies are often analyzed by the broken-stick model, consisting of two segments of linear models connected at an unknown change-point. We formulate this design problem into a high-dimensional optimization problem and der...
Canadian Journal of Statistics-revue Canadienne De Statistique, 2020
arXiv: Methodology, 2019
The shape of the relationship between a continuous exposure variable and a binary disease variabl... more The shape of the relationship between a continuous exposure variable and a binary disease variable is often central to epidemiologic investigations. This paper investigates a number of issues surrounding inference and the shape of the relationship. Presuming that the relationship can be expressed in terms of regression coefficients and a shape parameter, we investigate how well the shape can be inferred in settings which might typify epidemiologic investigations and risk assessment. We also consider a suitable definition of the median effect of exposure, and investigate how precisely this can be inferred. This is done both in the case of using a model acknowledging uncertainty about the shape parameter and in the case of ignoring this uncertainty and using a two-step method, where in step one we transform the predictor and in step two we fit a simple linear model with transformed predictor. All these investigations require a family of exposure-disease relationships indexed by a shap...
arXiv: Applications, 2020
Motivation: Selecting feature genes and predicting cells' phenotype are typical tasks in the ... more Motivation: Selecting feature genes and predicting cells' phenotype are typical tasks in the analysis of scRNA-seq data. Many algorithms were developed for these tasks, but high correlations among genes create challenges specifically in scRNA-seq analysis, which are not well addressed. Highly correlated genes lead to unreliable prediction models due to technical problems, such as multi-collinearity. Most importantly, when a causal gene (whose variants have a true biological effect on the phenotype) is highly correlated with other genes, most algorithms select one of them in a data-driven manner. The correlation structure among genes could change substantially. Hence, it is critical to build a prediction model based on causal genes. Results: To address the issues discussed above, we propose a grouping algorithm that can be integrated into prediction models. Using real benchmark scRNA-seq data and simulated cell phenotypes, we show our novel method significantly outperforms standa...
Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human dise... more Genome-wide association studies (GWASs) aim to detect genetic risk factors for complex human diseases by identifying disease-associated single-nucleotide polymorphisms (SNPs). SNP-wise approach, the standard method for analyzing GWAS, tests each SNP individually. Then the P-values are adjusted for multiple testing. Multiple testing adjustment (purely based on p-values) is over-conservative and causes lack of power in many GWASs, due to insufficiently modelling the relationship among SNPs. To address this problem, we propose a novel method, which borrows information across SNPs by grouping SNPs into three clusters. We pre-specify the patterns of clusters by minor allele frequencies of SNPs between cases and controls, and enforce the patterns with prior distributions. Therefore, compared with the traditional approach, it better controls false discovery rate (FDR) and shows higher sensitivity, which is confirmed by our simulation studies. We re-analyzed real data studies on identifying...
Motivation: Selecting feature genes and predicting cells’ phenotype are typical tasks in the anal... more Motivation: Selecting feature genes and predicting cells’ phenotype are typical tasks in the analysis of scRNA-seq data. Many algorithms were developed for these tasks, but high correlations among genes create challenges specifically in scRNA-seq analysis, which are not well addressed. Highly correlated genes lead to collinearity and unreliable model fitting. Highly correlated genes compete with each other in feature selection, which causes underestimation of their importance. Most importantly, when a causal gene is highly correlated other genes, most algorithms select one of them in a data driven manner. The correlation structure among genes could change substantially. Hence, it is critical to build a prediction model based on causal genes but not their highly correlated genes. Results: To address the issues discussed above, we propose a grouping algorithm which can be integrated in prediction models. Using real benchmark scRNA-seq data sets and simulated cell phenotypes, we show o...
ABSTRACTSurvival analysis is a technique to identify prognostic biomarkers and genetic vulnerabil... more ABSTRACTSurvival analysis is a technique to identify prognostic biomarkers and genetic vulnerabilities in cancer studies. Large-scale consortium-based projects have profiled >11,000 adult and >4,000 paediatric tumor cases with clinical outcomes and multi-omics approaches. This provides a resource for investigating molecular-level cancer etiologies using clinical correlations. Although cancers often arise from multiple genetic vulnerabilities and have deregulated gene sets (GSs), existing survival analysis protocols can report only on individual genes. Additionally, there is no systematic method to connect clinical outcomes with experimental (cell line) data. To address these gaps, we developed cSurvival (https://tau.cmmt.ubc.ca/cSurvival). cSurvival provides a user-adjustable analytical pipeline with a curated, integrated database, and offers three main advances: (a) joint analysis with two genomic predictors to identify interacting biomarkers, including new algorithms to iden...
ABSTRACTBackgroundCOVID-19 is a highly transmissible infectious disease that has infected over 12... more ABSTRACTBackgroundCOVID-19 is a highly transmissible infectious disease that has infected over 122 million individuals worldwide. To combat this pandemic, governments around the world have imposed lockdowns. However, the impact of these lockdowns on the rates of COVID-19 transmission in communities is not well known. Here, we used COVID-19 case counts from 3,000+ counties in the United States (US) to determine the relationship between lockdown as well as other county factors and the rate of COVID-19 spread in these communities.MethodsWe merged county-specific COVID-19 case counts with US census data and the date of lockdown for each of the counties. We then applied a Functional Principal Component (FPC) analysis on this dataset to generate scores that described the trajectory of COVID-19 spread across the counties. We used machine learning methods to identify important factors in the county including the date of lockdown that significantly influenced the FPC scores.FindingsWe found ...