Hufeng Zhou | Harvard University (original) (raw)
Papers by Hufeng Zhou
MACIE (Multi-dimensional Annotation Class Integrative Estimation) is an unsupervised multivariate... more MACIE (Multi-dimensional Annotation Class Integrative Estimation) is an unsupervised multivariate mixed model framework to assess multi-dimensional functional impacts for both coding and non-coding variants in the human genome. MACIE integrates a variety of functional annotations, including protein function scores, evolutionary conservation scores, and epigenetic annotations from ENCODE and Roadmap Epigenomics, and estimates the joint posterior probabilities of each genetic variant being functional. For each non-coding and synonymous coding variant, the MACIE score is a vector of length 4, representing the estimated joint posterior probabilities of "not evolutionarily conserved and regulatory functional" (MACIE01); "evolutionarily conserved and not regulatory functional" (MACIE10); "not evolutionarily conserved and not regulatory functional" (MACIE00); "both evolutionarily conserved and regulatory functional (MACIE11). MACIE_conserved is the estima...
Nature Genetics, 2022
Analyses of data from genome-wide association studies on unrelated individuals have shown that, f... more Analyses of data from genome-wide association studies on unrelated individuals have shown that, for human traits and diseases, approximately one-third to two-thirds of heritability is captured by common SNPs. However, it is not known whether the remaining heritability is due to the imperfect tagging of causal variants by common SNPs, in particular whether the causal variants are rare, or whether it is overestimated due to bias in inference from pedigree data. Here we estimated heritability for height and body mass index (BMI) from whole-genome sequence data on 25,465 unrelated individuals of European ancestry. The estimated heritability was 0.68 (standard error 0.10) for height and 0.30 (standard error 0.10) for body mass index. Low minor allele frequency variants in low linkage disequilibrium (LD) with neighboring variants were enriched for heritability, to a greater extent for protein-altering variants, consistent with negative selection. Our results imply that rare variants, in particular those in regions of low linkage disequilibrium, are a major source of the still missing heritability of complex traits and disease.
Journal of Virology, 2020
EBV is associated with ∼200,000 cancers each year. In vitro , EBV can transform primary human B l... more EBV is associated with ∼200,000 cancers each year. In vitro , EBV can transform primary human B lymphocytes into immortalized cell lines. EBV-encoded proteins, along with noncoding RNAs and microRNAs, hijack cellular proteins and pathways to control cell growth. EBV nuclear proteins usurp normal transcriptional programs to activate the expression of key oncogenes, including MYC, to provide a proliferation signal. EBV nuclear antigens also repress CDKN2A to suppress senescence. EBV membrane protein activates NF-κB to provide survival signals. EBV genomes are maintained by EBNA1, which tethers EBV episomes to the host chromosomes during mitosis. However, little is known about where EBV episomes are located in interphase cells. In interphase cells, EBV promoters drive the expression of latency genes, while oriP functions as an enhancer for these promoters. In this study, integrative analyses of published lymphoblastoid cell line (LCL) Hi-C data and our 4C-seq experiments position EBV e...
d EBV transcription factors and NF-kB subunits converge into EBV super-enhancers d MYC and BCL2 e... more d EBV transcription factors and NF-kB subunits converge into EBV super-enhancers d MYC and BCL2 expression is driven by EBV super-enhancers d EBV super-enhancers are co-occupied by B cell transcription factors and cofactors d EBV super-enhancers are sensitive to perturbations
First and foremost, I would like to express my immense gratitude to my supervisor Professor Limso... more First and foremost, I would like to express my immense gratitude to my supervisor Professor Limsoon Wong. He helped me successfully make the transition from being an experimental biologist to become a competent computational biologist and initiated my academic journey. Over the past few years, I have benefited tremendously from his excellent guidance, persistent support, and invaluable advice. Working with him was extremely pleasant. I have learnt a lot from him in many aspects of doing research. His enthusiasm, dedication and preciseness have deeply influenced me. I want to thank my family. I am deeply indebted to my parents Hongcao Zhou and Lifang Hu for their unconditional love, understanding and support. Their love and support are the source of motivation and happiness in my life.
All in-text references underlined in blue are linked to publications on ResearchGate, letting you... more All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Attempts to identify and prioritize functional DNA elements in coding and noncoding regions, part... more Attempts to identify and prioritize functional DNA elements in coding and noncoding regions, particularly through use of in silico functional annotation data,continue to increase in popularity. However, specific functional roles may vary widely from one variant to another, making it challenging to summarize different aspects of variant function. Here we propose Multi-dimensional Annotation Class Integrative Estimation (MACIE), an unsupervised multivariate mixed model framework capable of integrating annotations of diverse origin to assess multi-dimensional functional roles for both coding and noncoding variants. Unlike existing one-dimensional scoring methods, MACIE views variant functionality as a composite attribute encompassing multiple characteristics, and estimates the joint posterior functional probability vector of each genomic position, a quantity that offers richer and more interpretable information in the presence of multiple aspects of functionality. Applied to a variety ...
bioRxiv, 2021
We developed a computationally efficient method, Ancestral Frequency estimation in Admixed popula... more We developed a computationally efficient method, Ancestral Frequency estimation in Admixed populations (AFA), to estimate the frequencies of bi-allelic variants in admixed populations with an unlimited number of ancestries. AFA uses maximum likelihood estimation by modeling the conditional probability of having an allele given proportions of genetic ancestries. It can be applied using either global or local proportions of genetic ancestries. Simulations mimicking admixture demonstrated the high accuracy of the method. We implemented the method on data from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), an admixed population with three predominant continental ancestries: Amerindian, European, and African. Comparison of the European and African estimated frequencies to the respective gnomAD frequencies demonstrated high correlations, with Pearson R2=0.97-0.99. We provide a genome-wide dataset of the estimated three ancestral allele frequencies in HCHS/SOL for all ava...
Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare variants’ (RV... more Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare variants’ (RVs) associations with complex human traits. Variant set analysis is a powerful approach to study RV association, and a key component of it is constructing RV sets for analysis. However, existing methods have limited ability to define analysis units in the noncoding genome. Furthermore, there is a lack of robust pipelines for comprehensive and scalable noncoding RV association analysis. Here we propose a computationally-efficient noncoding RV association-detection framework that uses STAAR (variant-set test for association using annotation information) to group noncoding variants in gene-centric analysis based on functional categories. We also propose SCANG (scan the genome)-STAAR, which uses dynamic window sizes and incorporates multiple functional annotations, in a non-gene-centric analysis. We furthermore develop STAARpipeline to perform flexible noncoding RV association analysis, inclu...
International Journal of Obesity, 2021
Background/objectives Neck circumference, an index of upper airway fat, has been suggested to be ... more Background/objectives Neck circumference, an index of upper airway fat, has been suggested to be an important measure of body-fat distribution with unique associations with health outcomes such as obstructive sleep apnea and metabolic disease. This study aims to study the genetic bases of neck circumference. Methods We conducted a multi-ethnic genome-wide association study of neck circumference, adjusted and unadjusted for BMI, in up to 15,090 European Ancestry (EA) and African American (AA) individuals. Because sexually dimorphic associations have been observed for anthropometric traits, we conducted both sex-combined and sex-specific analysis. Results We identified rs227724 near the Noggin (NOG) gene as a possible quantitative locus for neck circumference in men (N = 8831, P = 1.74 × 10−9) but not in women (P = 0.08). The association was replicated in men (N = 1554, P = 0.045) in an independent dataset. This locus was previously reported to be associated with human height and with...
Nature Genetics, 2020
A n increasing number of whole-genome/exome sequencing (WGS/WES) studies are being conducted to i... more A n increasing number of whole-genome/exome sequencing (WGS/WES) studies are being conducted to investigate the genetic bases of human diseases and traits, including the Trans-Omics for Precision Medicine Program (TOPMed) of the National Heart, Lung, and Blood Institute and the Genome Sequencing Program (GSP) of the National Human Genome Research Institute. Such studies enable the assessment of associations between complex traits and both coding and noncoding RVs (minor allele frequency (MAF) < 1%) across the genome. However, single-variant analyses typically have low power to identify associations with RVs 1-3. To improve power, variant set tests have been proposed to jointly test the effects of given sets of multiple RVs. These methods include the burden test 4-7 , sequence kernel association test (SKAT) 8 and their various combinations 9-12. In parallel, external biological information provided by functional annotations, such as conservation scores and predicted enhancer status, has been successfully used to prioritize plausibly causal common variants in fine-mapping studies, partitioning heritability in GWAS and predicting genetic risk 13-17. It is of substantial interest to incorporate variant functional annotations effectively to boost the power of RV analysis of WGS association studies 18,19 .
Genetic Epidemiology, 2020
Clinical trial results have recently demonstrated that inhibiting inflammation by targeting the i... more Clinical trial results have recently demonstrated that inhibiting inflammation by targeting the interleukin-1b pathway can offer a significant reduction in lung cancer incidence and mortality, highlighting a pressing and unmet need to understand the benefits of inflammation-focused lung cancer therapies at the genetic level. While numerous genome-wide association studies (GWAS) have explored the genetic etiology of lung cancer, there remains a large gap between the type of information that may be gleaned from an association study and the depth of understanding necessary to explain and drive translational findings. Thus, in this work we jointly model and integrate extensive multi-omics data sources, utilizing a total of 40 genome-wide functional annotations that augment previously published results from the International Lung Cancer Consortium (ILCCO) GWAS, to prioritize and characterize single nucleotide polymorphisms (SNPs) that increase risk of squamous cell lung cancer through the inflammatory and immune responses. Our work bridges the gap between correlative analysis and translational follow-up research, refining GWAS association measures in an interpretable and systematic manner. In particular, re-analysis of the ILCCO data highlights the impact of highly-associated SNPs from nuclear factor-κB signaling pathway genes as well as major histocompatibility complex mediated variation in immune responses. One consequence of prioritizing likely functional SNPs is the pruning of variants that might be selected for follow-up work by over an order of magnitude, from potentially tens of thousands to hundreds. The strategies we introduce provide informative and interpretable approaches for incorporating extensive genome-wide annotation data in analysis of genetic association studies.
Cancer Research, 2020
Genome-wide association studies (GWAS) have revealed susceptible genetic risk factors for lung ca... more Genome-wide association studies (GWAS) have revealed susceptible genetic risk factors for lung cancer, highlighting the role of smoking, family history, and DNA damage repair genes in disease etiology. Many studies have focused on European populations; however, lung cancer is a leading cause of cancer incidence and mortality around the world. Previous GWAS analyses have been focusing on a single population-based analyses to exclude the confounding effects such as the presence of systematic allele frequency differences between populations. Another efficient tool for GWAS of complex genetic diseases and traits is meta-analysis providing a practical strategy for detecting genetic variants with modest effect sizes. This study aimed to identify novel genetic susceptibility loci in a large, multiethnic GWAS of lung cancer. The HRC imputation of lung GWAS was carried out in the Sanger Imputation Server. The imputed GWAS with 34,429 cases and 35,732 controls from OncoArray lung cancer GWAS ...
de novoMutations (DNMs), or mutations that appear in an individual despite not being seen in thei... more de novoMutations (DNMs), or mutations that appear in an individual despite not being seen in their parents, are an important source of genetic variation whose impact is relevant to studies of human evolution, genetics, and disease. Utilizing high-coverage whole genome sequencing data as part of the Trans-Omics for Precision Medicine (TOPMed) program, we directly estimate and analyze DNM counts, rates, and spectra from 1,465 trios across an array of diverse human populations. Using the resulting call set of 86,865 single nucleotide DNMs, we find a significant positive correlation between local recombination rate and local DNM rate, which together can explain up to 35.5% of the genome-wide variation in population level rare genetic variation from 41K unrelated TOPMed samples. While genome-wide heterozygosity does correlate weakly with DNM count, we do not find significant differences in DNM rate between individuals of European, African, and Latino ancestry, nor across ancestrally dist...
Whole genome sequencing (WGS) studies are being widely conducted to identify rare variants associ... more Whole genome sequencing (WGS) studies are being widely conducted to identify rare variants associated with human diseases and disease-related traits. Classical single-marker association analyses for rare variants have limited power, and variant-set based analyses are commonly used to analyze rare variants. However, existing variant-set based approaches need to pre-specify genetic regions for analysis, and hence are not directly applicable to WGS data due to the large number of intergenic and intron regions that consist of a massive number of non-coding variants. The commonly used sliding window method requires pre-specifying fixed window sizes, which are often unknown as a priori, are difficult to specify in practice and are subject to limitations given genetic association region sizes are likely to vary across the genome and phenotypes. We propose a computationally-efficient and dynamic scan statistic method (Scan the Genome (SCANG)) for analyzing WGS data that flexibly detects the...
Developmental and comparative immunology, 2018
Haemophilus parasuis, an important swine pathogen, was recently proven able to invade into endoth... more Haemophilus parasuis, an important swine pathogen, was recently proven able to invade into endothelial or epithelial cell in vitro. NOD1/2 are specialized NLRs that participate in the recognition of pathogens able to invade intracellularly and therefore, we assessed that the contribution of NOD1/2 to inflammation responses during H. parasuis infection. We observed that H. parasuis infection enhanced NOD2 expression and RIP2 phosphorylation in porcine kidney 15 cells. Our results also showed that knock down of NOD1/2 or RIP2 expression respectively significantly decreased H. parasuis-induced NF-κB activity, while the phosphorylation level of p38, JNK or ERK was not changed. Moreover, real-time PCR result showed that NOD1, NOD2 or RIP2 was involved in the expression of CCL4, CCL5 and IL-8. Inhibition of NOD1 and NOD2 significantly reduced CCL5 promoter activity, even in a more effective way compared with inhibition of TLR.
Cell host & microbe, Jan 11, 2017
Epstein-Barr virus (EBV) transforms B cells to continuously proliferating lymphoblastoid cell lin... more Epstein-Barr virus (EBV) transforms B cells to continuously proliferating lymphoblastoid cell lines (LCLs), which represent an experimental model for EBV-associated cancers. EBV nuclear antigens (EBNAs) and LMP1 are EBV transcriptional regulators that are essential for LCL establishment, proliferation, and survival. Starting with the 3D genome organization map of LCL, we constructed a comprehensive EBV regulome encompassing 1,992 viral/cellular genes and enhancers. Approximately 30% of genes essential for LCL growth were linked to EBV enhancers. Deleting EBNA2 sites significantly reduced their target gene expression. Additional EBV super-enhancer (ESE) targets included MCL1, IRF4, and EBF. MYC ESE looping to the transcriptional stat site of MYC was dependent on EBNAs. Deleting MYC ESEs greatly reduced MYC expression and LCL growth. EBNA3A/3C altered CDKN2A/B spatial organization to suppress senescence. EZH2 inhibition decreased the looping at the CDKN2A/B loci and reduced LCL growth...
Proceedings of the National Academy of Sciences, 2017
Significance Epigenetic alterations in nasopharyngeal carcinoma (NPC) are very frequent at the DN... more Significance Epigenetic alterations in nasopharyngeal carcinoma (NPC) are very frequent at the DNA level. Histone modifications are frequently altered in cancers. Because histone modifications are reversible, histone-modifying enzymes or other epigenetic regulators are ideal therapeutic targets, and drugs targeting these enzymes have been proven effective in cancer treatment. Understanding the NPC histone code provides unique insights into NPC pathogenesis and will likely contribute to the identification of unique therapeutics. Using genome-wide analyses of histone modifications, we generated an NPC epigenetic landscape and identified a key oncogene whose expression correlated with patient overall survival, suggesting that epigenetic profiling can effectively identify key oncogenic pathways. These studies provide proof-of-concept strategies for further characterization of the NPC epigenome on a larger scale.
MACIE (Multi-dimensional Annotation Class Integrative Estimation) is an unsupervised multivariate... more MACIE (Multi-dimensional Annotation Class Integrative Estimation) is an unsupervised multivariate mixed model framework to assess multi-dimensional functional impacts for both coding and non-coding variants in the human genome. MACIE integrates a variety of functional annotations, including protein function scores, evolutionary conservation scores, and epigenetic annotations from ENCODE and Roadmap Epigenomics, and estimates the joint posterior probabilities of each genetic variant being functional. For each non-coding and synonymous coding variant, the MACIE score is a vector of length 4, representing the estimated joint posterior probabilities of "not evolutionarily conserved and regulatory functional" (MACIE01); "evolutionarily conserved and not regulatory functional" (MACIE10); "not evolutionarily conserved and not regulatory functional" (MACIE00); "both evolutionarily conserved and regulatory functional (MACIE11). MACIE_conserved is the estima...
Nature Genetics, 2022
Analyses of data from genome-wide association studies on unrelated individuals have shown that, f... more Analyses of data from genome-wide association studies on unrelated individuals have shown that, for human traits and diseases, approximately one-third to two-thirds of heritability is captured by common SNPs. However, it is not known whether the remaining heritability is due to the imperfect tagging of causal variants by common SNPs, in particular whether the causal variants are rare, or whether it is overestimated due to bias in inference from pedigree data. Here we estimated heritability for height and body mass index (BMI) from whole-genome sequence data on 25,465 unrelated individuals of European ancestry. The estimated heritability was 0.68 (standard error 0.10) for height and 0.30 (standard error 0.10) for body mass index. Low minor allele frequency variants in low linkage disequilibrium (LD) with neighboring variants were enriched for heritability, to a greater extent for protein-altering variants, consistent with negative selection. Our results imply that rare variants, in particular those in regions of low linkage disequilibrium, are a major source of the still missing heritability of complex traits and disease.
Journal of Virology, 2020
EBV is associated with ∼200,000 cancers each year. In vitro , EBV can transform primary human B l... more EBV is associated with ∼200,000 cancers each year. In vitro , EBV can transform primary human B lymphocytes into immortalized cell lines. EBV-encoded proteins, along with noncoding RNAs and microRNAs, hijack cellular proteins and pathways to control cell growth. EBV nuclear proteins usurp normal transcriptional programs to activate the expression of key oncogenes, including MYC, to provide a proliferation signal. EBV nuclear antigens also repress CDKN2A to suppress senescence. EBV membrane protein activates NF-κB to provide survival signals. EBV genomes are maintained by EBNA1, which tethers EBV episomes to the host chromosomes during mitosis. However, little is known about where EBV episomes are located in interphase cells. In interphase cells, EBV promoters drive the expression of latency genes, while oriP functions as an enhancer for these promoters. In this study, integrative analyses of published lymphoblastoid cell line (LCL) Hi-C data and our 4C-seq experiments position EBV e...
d EBV transcription factors and NF-kB subunits converge into EBV super-enhancers d MYC and BCL2 e... more d EBV transcription factors and NF-kB subunits converge into EBV super-enhancers d MYC and BCL2 expression is driven by EBV super-enhancers d EBV super-enhancers are co-occupied by B cell transcription factors and cofactors d EBV super-enhancers are sensitive to perturbations
First and foremost, I would like to express my immense gratitude to my supervisor Professor Limso... more First and foremost, I would like to express my immense gratitude to my supervisor Professor Limsoon Wong. He helped me successfully make the transition from being an experimental biologist to become a competent computational biologist and initiated my academic journey. Over the past few years, I have benefited tremendously from his excellent guidance, persistent support, and invaluable advice. Working with him was extremely pleasant. I have learnt a lot from him in many aspects of doing research. His enthusiasm, dedication and preciseness have deeply influenced me. I want to thank my family. I am deeply indebted to my parents Hongcao Zhou and Lifang Hu for their unconditional love, understanding and support. Their love and support are the source of motivation and happiness in my life.
All in-text references underlined in blue are linked to publications on ResearchGate, letting you... more All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Attempts to identify and prioritize functional DNA elements in coding and noncoding regions, part... more Attempts to identify and prioritize functional DNA elements in coding and noncoding regions, particularly through use of in silico functional annotation data,continue to increase in popularity. However, specific functional roles may vary widely from one variant to another, making it challenging to summarize different aspects of variant function. Here we propose Multi-dimensional Annotation Class Integrative Estimation (MACIE), an unsupervised multivariate mixed model framework capable of integrating annotations of diverse origin to assess multi-dimensional functional roles for both coding and noncoding variants. Unlike existing one-dimensional scoring methods, MACIE views variant functionality as a composite attribute encompassing multiple characteristics, and estimates the joint posterior functional probability vector of each genomic position, a quantity that offers richer and more interpretable information in the presence of multiple aspects of functionality. Applied to a variety ...
bioRxiv, 2021
We developed a computationally efficient method, Ancestral Frequency estimation in Admixed popula... more We developed a computationally efficient method, Ancestral Frequency estimation in Admixed populations (AFA), to estimate the frequencies of bi-allelic variants in admixed populations with an unlimited number of ancestries. AFA uses maximum likelihood estimation by modeling the conditional probability of having an allele given proportions of genetic ancestries. It can be applied using either global or local proportions of genetic ancestries. Simulations mimicking admixture demonstrated the high accuracy of the method. We implemented the method on data from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), an admixed population with three predominant continental ancestries: Amerindian, European, and African. Comparison of the European and African estimated frequencies to the respective gnomAD frequencies demonstrated high correlations, with Pearson R2=0.97-0.99. We provide a genome-wide dataset of the estimated three ancestral allele frequencies in HCHS/SOL for all ava...
Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare variants’ (RV... more Large-scale whole-genome sequencing studies have enabled analysis of noncoding rare variants’ (RVs) associations with complex human traits. Variant set analysis is a powerful approach to study RV association, and a key component of it is constructing RV sets for analysis. However, existing methods have limited ability to define analysis units in the noncoding genome. Furthermore, there is a lack of robust pipelines for comprehensive and scalable noncoding RV association analysis. Here we propose a computationally-efficient noncoding RV association-detection framework that uses STAAR (variant-set test for association using annotation information) to group noncoding variants in gene-centric analysis based on functional categories. We also propose SCANG (scan the genome)-STAAR, which uses dynamic window sizes and incorporates multiple functional annotations, in a non-gene-centric analysis. We furthermore develop STAARpipeline to perform flexible noncoding RV association analysis, inclu...
International Journal of Obesity, 2021
Background/objectives Neck circumference, an index of upper airway fat, has been suggested to be ... more Background/objectives Neck circumference, an index of upper airway fat, has been suggested to be an important measure of body-fat distribution with unique associations with health outcomes such as obstructive sleep apnea and metabolic disease. This study aims to study the genetic bases of neck circumference. Methods We conducted a multi-ethnic genome-wide association study of neck circumference, adjusted and unadjusted for BMI, in up to 15,090 European Ancestry (EA) and African American (AA) individuals. Because sexually dimorphic associations have been observed for anthropometric traits, we conducted both sex-combined and sex-specific analysis. Results We identified rs227724 near the Noggin (NOG) gene as a possible quantitative locus for neck circumference in men (N = 8831, P = 1.74 × 10−9) but not in women (P = 0.08). The association was replicated in men (N = 1554, P = 0.045) in an independent dataset. This locus was previously reported to be associated with human height and with...
Nature Genetics, 2020
A n increasing number of whole-genome/exome sequencing (WGS/WES) studies are being conducted to i... more A n increasing number of whole-genome/exome sequencing (WGS/WES) studies are being conducted to investigate the genetic bases of human diseases and traits, including the Trans-Omics for Precision Medicine Program (TOPMed) of the National Heart, Lung, and Blood Institute and the Genome Sequencing Program (GSP) of the National Human Genome Research Institute. Such studies enable the assessment of associations between complex traits and both coding and noncoding RVs (minor allele frequency (MAF) < 1%) across the genome. However, single-variant analyses typically have low power to identify associations with RVs 1-3. To improve power, variant set tests have been proposed to jointly test the effects of given sets of multiple RVs. These methods include the burden test 4-7 , sequence kernel association test (SKAT) 8 and their various combinations 9-12. In parallel, external biological information provided by functional annotations, such as conservation scores and predicted enhancer status, has been successfully used to prioritize plausibly causal common variants in fine-mapping studies, partitioning heritability in GWAS and predicting genetic risk 13-17. It is of substantial interest to incorporate variant functional annotations effectively to boost the power of RV analysis of WGS association studies 18,19 .
Genetic Epidemiology, 2020
Clinical trial results have recently demonstrated that inhibiting inflammation by targeting the i... more Clinical trial results have recently demonstrated that inhibiting inflammation by targeting the interleukin-1b pathway can offer a significant reduction in lung cancer incidence and mortality, highlighting a pressing and unmet need to understand the benefits of inflammation-focused lung cancer therapies at the genetic level. While numerous genome-wide association studies (GWAS) have explored the genetic etiology of lung cancer, there remains a large gap between the type of information that may be gleaned from an association study and the depth of understanding necessary to explain and drive translational findings. Thus, in this work we jointly model and integrate extensive multi-omics data sources, utilizing a total of 40 genome-wide functional annotations that augment previously published results from the International Lung Cancer Consortium (ILCCO) GWAS, to prioritize and characterize single nucleotide polymorphisms (SNPs) that increase risk of squamous cell lung cancer through the inflammatory and immune responses. Our work bridges the gap between correlative analysis and translational follow-up research, refining GWAS association measures in an interpretable and systematic manner. In particular, re-analysis of the ILCCO data highlights the impact of highly-associated SNPs from nuclear factor-κB signaling pathway genes as well as major histocompatibility complex mediated variation in immune responses. One consequence of prioritizing likely functional SNPs is the pruning of variants that might be selected for follow-up work by over an order of magnitude, from potentially tens of thousands to hundreds. The strategies we introduce provide informative and interpretable approaches for incorporating extensive genome-wide annotation data in analysis of genetic association studies.
Cancer Research, 2020
Genome-wide association studies (GWAS) have revealed susceptible genetic risk factors for lung ca... more Genome-wide association studies (GWAS) have revealed susceptible genetic risk factors for lung cancer, highlighting the role of smoking, family history, and DNA damage repair genes in disease etiology. Many studies have focused on European populations; however, lung cancer is a leading cause of cancer incidence and mortality around the world. Previous GWAS analyses have been focusing on a single population-based analyses to exclude the confounding effects such as the presence of systematic allele frequency differences between populations. Another efficient tool for GWAS of complex genetic diseases and traits is meta-analysis providing a practical strategy for detecting genetic variants with modest effect sizes. This study aimed to identify novel genetic susceptibility loci in a large, multiethnic GWAS of lung cancer. The HRC imputation of lung GWAS was carried out in the Sanger Imputation Server. The imputed GWAS with 34,429 cases and 35,732 controls from OncoArray lung cancer GWAS ...
de novoMutations (DNMs), or mutations that appear in an individual despite not being seen in thei... more de novoMutations (DNMs), or mutations that appear in an individual despite not being seen in their parents, are an important source of genetic variation whose impact is relevant to studies of human evolution, genetics, and disease. Utilizing high-coverage whole genome sequencing data as part of the Trans-Omics for Precision Medicine (TOPMed) program, we directly estimate and analyze DNM counts, rates, and spectra from 1,465 trios across an array of diverse human populations. Using the resulting call set of 86,865 single nucleotide DNMs, we find a significant positive correlation between local recombination rate and local DNM rate, which together can explain up to 35.5% of the genome-wide variation in population level rare genetic variation from 41K unrelated TOPMed samples. While genome-wide heterozygosity does correlate weakly with DNM count, we do not find significant differences in DNM rate between individuals of European, African, and Latino ancestry, nor across ancestrally dist...
Whole genome sequencing (WGS) studies are being widely conducted to identify rare variants associ... more Whole genome sequencing (WGS) studies are being widely conducted to identify rare variants associated with human diseases and disease-related traits. Classical single-marker association analyses for rare variants have limited power, and variant-set based analyses are commonly used to analyze rare variants. However, existing variant-set based approaches need to pre-specify genetic regions for analysis, and hence are not directly applicable to WGS data due to the large number of intergenic and intron regions that consist of a massive number of non-coding variants. The commonly used sliding window method requires pre-specifying fixed window sizes, which are often unknown as a priori, are difficult to specify in practice and are subject to limitations given genetic association region sizes are likely to vary across the genome and phenotypes. We propose a computationally-efficient and dynamic scan statistic method (Scan the Genome (SCANG)) for analyzing WGS data that flexibly detects the...
Developmental and comparative immunology, 2018
Haemophilus parasuis, an important swine pathogen, was recently proven able to invade into endoth... more Haemophilus parasuis, an important swine pathogen, was recently proven able to invade into endothelial or epithelial cell in vitro. NOD1/2 are specialized NLRs that participate in the recognition of pathogens able to invade intracellularly and therefore, we assessed that the contribution of NOD1/2 to inflammation responses during H. parasuis infection. We observed that H. parasuis infection enhanced NOD2 expression and RIP2 phosphorylation in porcine kidney 15 cells. Our results also showed that knock down of NOD1/2 or RIP2 expression respectively significantly decreased H. parasuis-induced NF-κB activity, while the phosphorylation level of p38, JNK or ERK was not changed. Moreover, real-time PCR result showed that NOD1, NOD2 or RIP2 was involved in the expression of CCL4, CCL5 and IL-8. Inhibition of NOD1 and NOD2 significantly reduced CCL5 promoter activity, even in a more effective way compared with inhibition of TLR.
Cell host & microbe, Jan 11, 2017
Epstein-Barr virus (EBV) transforms B cells to continuously proliferating lymphoblastoid cell lin... more Epstein-Barr virus (EBV) transforms B cells to continuously proliferating lymphoblastoid cell lines (LCLs), which represent an experimental model for EBV-associated cancers. EBV nuclear antigens (EBNAs) and LMP1 are EBV transcriptional regulators that are essential for LCL establishment, proliferation, and survival. Starting with the 3D genome organization map of LCL, we constructed a comprehensive EBV regulome encompassing 1,992 viral/cellular genes and enhancers. Approximately 30% of genes essential for LCL growth were linked to EBV enhancers. Deleting EBNA2 sites significantly reduced their target gene expression. Additional EBV super-enhancer (ESE) targets included MCL1, IRF4, and EBF. MYC ESE looping to the transcriptional stat site of MYC was dependent on EBNAs. Deleting MYC ESEs greatly reduced MYC expression and LCL growth. EBNA3A/3C altered CDKN2A/B spatial organization to suppress senescence. EZH2 inhibition decreased the looping at the CDKN2A/B loci and reduced LCL growth...
Proceedings of the National Academy of Sciences, 2017
Significance Epigenetic alterations in nasopharyngeal carcinoma (NPC) are very frequent at the DN... more Significance Epigenetic alterations in nasopharyngeal carcinoma (NPC) are very frequent at the DNA level. Histone modifications are frequently altered in cancers. Because histone modifications are reversible, histone-modifying enzymes or other epigenetic regulators are ideal therapeutic targets, and drugs targeting these enzymes have been proven effective in cancer treatment. Understanding the NPC histone code provides unique insights into NPC pathogenesis and will likely contribute to the identification of unique therapeutics. Using genome-wide analyses of histone modifications, we generated an NPC epigenetic landscape and identified a key oncogene whose expression correlated with patient overall survival, suggesting that epigenetic profiling can effectively identify key oncogenic pathways. These studies provide proof-of-concept strategies for further characterization of the NPC epigenome on a larger scale.