Haiyan Huang - Academia.edu (original) (raw)
Papers by Haiyan Huang
Journal of the American Statistical Association, 2009
A major task in understanding biological processes is to elucidate the relationships between gene... more A major task in understanding biological processes is to elucidate the relationships between genes involved in the underlying biological pathways. Microarray data from an increasing number of biologically interrelated experiments now allows for more complete portrayals of functional gene relationships in the pathways. In current studies of gene relationships, the presence of expression dependencies attributable to the biologically interrelated experiments, however, has been widely ignored. When unaccounted for, these (experiment) dependencies can result in inaccurate inferences of functional gene relationships, and hence incorrect biological conclusions. This article contributes a framework consisting of a model and an estimation procedure to infer gene relationships when there are two-way dependencies in the gene expression matrix (the gene-wise and experiment-wise dependencies). The main aspect of the framework is the use of a Kronecker product covariance matrix to model the gene-experiment interactions. The resulting novel gene coexpression measure, named Knorm correlation, can be understood as a natural extension of the widely used Pearson coefficient when the experiment correlation matrix is known. Compared with the Pearson coefficient, the Knorm correlation has a smaller estimation variance. The Knorm is also asymptotically consistent with the Pearson coefficient. When the experiment correlation matrix is unknown, the Knorm correlation is computed based on the estimated experiment correlation matrix by an iterative estimation procedure. We demonstrate the advantages of the Knorm correlation in both simulation studies and real datasets. The Knorm correlation estimation procedure is implemented in an R package (Knorm) that is freely available from the Bioconductor website.
Statistics in Biosciences, 2017
One goal of single-cell RNA sequencing (scRNA seq) is to expose possible heterogeneity within cel... more One goal of single-cell RNA sequencing (scRNA seq) is to expose possible heterogeneity within cell populations due to meaningful, biological variation. Examining cell-to-cell heterogeneity, and further, identifying subpopulations of cells based on scRNA seq data has been of common interest in life science research. A key component to successfully identifying cell subpopulations (or clustering cells) is the (dis)similarity measure used to group the cells. In this paper, we introduce a novel measure, named SIDEseq, to assess cell-to-cell similarity using scRNA seq data. SIDEseq first identifies a list of putative differentially expressed (DE) genes for each pair of cells. SIDEseq then integrates the information from all the DE gene lists (corresponding to all pairs of cells) to build a similarity measure between two cells. SIDEseq can be implemented in any clustering algorithm that requires a (dis)similarity matrix. This new measure incorporates information from all cells when evaluating the similarity between any two cells, a characteristic not commonly found in existing (dis)similarity measures. This property is advantageous for two reasons: (a) borrowing information from cells of different subpopulations allows for the investigation of pairwise cell relationships from a global perspective and (b) information from other cells of the same subpopulation could help to ensure a robust relationship assessment. We applied SIDEseq to a newly generated human ovarian cancer scRNA seq dataset, a public human embryo scRNA seq dataset, and several simulated datasets. The clustering results suggest that the SIDEseq measure is capable of uncovering important relationships between cells, and outperforms or at least does as well as several popular (dis)similarity measures when used on these datasets.
CHANCE, 2006
The availability of whole genome sequence data has facilitated the development of high-throughput... more The availability of whole genome sequence data has facilitated the development of high-throughput technologies for monitoring biological signals on a genomic scale. The revolutionary microarray technology, first introduced in 1995 (Schena et al., 1995), is now one of the most valuable techniques for global gene expression profiling. Other high-throughput genomic technologies, such as Serial Analysis of Gene Expression (SAGE) (Velculescu et al., 1995), mass spectrometry for protein identification (Henzel et al., 1993) and ChIP-chip for DNA binding (Ren et al., 2000), have also been widely used for different purposes in current biological and medical research.
Methods in Molecular Biology, 2008
To gain insights into the biological function and relevance of genes using serial analysis of gen... more To gain insights into the biological function and relevance of genes using serial analysis of gene expression (SAGE) transcription profiles, one essential method is to perform clustering analysis on a group genes with similar expression patterns. A successful clustering analysis depends on the use of effective distance or similarity measures. For this purpose, by considering the specific properties of SAGE technology, we modeled the SAGE data by Poisson statistics and developed two Poisson-based measures to assess similarity of gene expression profiles. By employing these two distances into a K-means clustering procedure, we further developed a software package to perform clustering analysis on SAGE data. The software implementing our Poisson-based algorithms can be downloaded from http://genome.dfci.harvard.edu/sager. Our algorithm is guaranteed to converge to a local maximum when Poisson likelihood-based measure is used. The results from simulation and experimental mouse retina data demonstrate that the Poisson-based distances are more appropriate and reliable for analyzing SAGE data compared to other commonly used distances or similarity measures.
Genome research, 2014
We report a statistical study to discover transcriptome similarity of developmental stages from D... more We report a statistical study to discover transcriptome similarity of developmental stages from D. melanogaster and C. elegans using modENCODE RNA-seq data. We focus on "stage-associated genes" that capture specific transcriptional activities in each stage and use them to map pairwise stages within and between the two species by a hypergeometric test. Within each species, temporally adjacent stages exhibit high transcriptome similarity, as expected. Additionally, fly female adults and worm adults are mapped with fly and worm embryos, respectively, due to maternal gene expression. Between fly and worm, an unexpected strong collinearity is observed in the time course from early embryos to late larvae. Moreover, a second parallel pattern is found between fly prepupae through adults and worm late embryos through adults, consistent with the second large wave of cell proliferation and differentiation in the fly life cycle. The results indicate a partially duplicated developmenta...
Proceedings of the National Academy of Sciences, 2014
Significance Coexpression analysis is one of the earliest tools for inferring gene associations u... more Significance Coexpression analysis is one of the earliest tools for inferring gene associations using expression data but faces new challenges in this “big data” era. In a large heterogeneous dataset, it is likely that gene relationships may change or only exist in a subset of the samples, and they can be nonlinear or nonfunctional. We propose two new robust count statistics to account for local patterns in gene expression profiles. The statistics are generalizable to detect statistical dependence in other application domains. The performance of the statistics is evaluated against a number of popular bivariate dependence measures, showing favorable results. The asymptotic studies of the statistics provide an interesting addition to the combinatorics literature.
Nature, Jan 28, 2014
Despite the large evolutionary distances between metazoan species, they can show remarkable commo... more Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for ind...
Nature, Jan 28, 2014
The transcriptome is the readout of the genome. Identifying common features in it across distant ... more The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-ex...
Proceedings of the National Academy of Sciences, 2010
The rapid accumulation of gene expression data has offered unprecedented opportunities to study h... more The rapid accumulation of gene expression data has offered unprecedented opportunities to study human diseases. The National Center for Biotechnology Information Gene Expression Omnibus is currently the largest database that systematically documents the genome-wide molecular basis of diseases. However, thus far, this resource has been far from fully utilized. This paper describes the first study to transform public gene expression repositories into an automated disease diagnosis database. Particularly, we have developed a systematic framework, including a two-stage Bayesian learning approach, to achieve the diagnosis of one or multiple diseases for a query expression profile along a hierarchical disease taxonomy. Our approach, including standardizing cross-platform gene expression data and heterogeneous disease annotations, allows analyzing both sources of information in a unified probabilistic system. A high level of overall diagnostic accuracy was shown by cross validation. It was...
Proceedings of the National Academy of Sciences, 2011
Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts hav... more Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called “sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation” (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the s...
PLoS Biology, 2004
The vertebrate retina is comprised of seven major cell types that are generated in overlapping bu... more The vertebrate retina is comprised of seven major cell types that are generated in overlapping but well-defined intervals. To identify genes that might regulate retinal development, gene expression in the developing retina was profiled at multiple time points using serial analysis of gene expression (SAGE). The expression patterns of 1,051 genes that showed developmentally dynamic expression by SAGE were investigated using in situ hybridization. A molecular atlas of gene expression in the developing and mature retina was thereby constructed, along with a taxonomic classification of developmental gene expression patterns. Genes were identified that label both temporal and spatial subsets of mitotic progenitor cells. For each developing and mature major retinal cell type, genes selectively expressed in that cell type were identified. The gene expression profiles of retinal Mü ller glia and mitotic progenitor cells were found to be highly similar, suggesting that Mü ller glia might serve to produce multiple retinal cell types under the right conditions. In addition, multiple transcripts that were evolutionarily conserved that did not appear to encode open reading frames of more than 100 amino acids in length (''noncoding RNAs'') were found to be dynamically and specifically expressed in developing and mature retinal cell types. Finally, many photoreceptor-enriched genes that mapped to chromosomal intervals containing retinal disease genes were identified. These data serve as a starting point for functional investigations of the roles of these genes in retinal development and physiology.
Planta, 2009
Using transcript proWle analysis, we explored the nature of the stem cell niche in roots of maize... more Using transcript proWle analysis, we explored the nature of the stem cell niche in roots of maize (Zea mays). Toward assessing a role for speciWc genes in the establishment and maintenance of the niche, we perturbed the niche and simultaneously monitored the spatial expression patterns of genes hypothesized as essential. Our results allow us to quantify and localize gene activities to speciWc portions of the niche: to the quiescent center (QC) or the proximal meristem (PM), or to both. The data point to molecular, biochemical and physiological processes associated with the speciWcation and maintenance of the niche, and include reduced expression of metabolism-, redox-and certain cell cycle-associated transcripts in the QC, enrichment of auxin-associated transcripts within the entire niche, controls for the state of diVerentiation of QC cells, a role for cytokinins speciWcally in the PM portion of the niche, processes (repair machinery) for maintaining DNA integrity and a role for gene silencing in niche stabilization. To provide additional support for the hypothesized roles of the above-mentioned and other transcripts in niche speciWca-tion, we overexpressed, in Arabidopsis, homologs of representative genes (eight) identiWed as highly enriched or reduced in the maize root QC. We conclude that the coordinated changes in expression of auxin-, redox-, cell cycleand metabolism-associated genes suggest the linkage of gene networks at the level of transcription, thereby providing additional insights into events likely associated with root stem cell niche establishment and maintenance. Keywords Quiescent center • Root • Stem cell • Stem cell niche • Zea mays Abbreviations AA Ascorbic acid AAO ASCORBATE OXIDASE ARF AUXIN RESPONSE FACTOR AGO4 Argonaute-related gene 4 AHL1 AT-hook motif nuclear localized protein 1 ATH AT-hook protein gene APX ASCORBATE PEROXIDASE ARR ARABIDOPSIS RESPONSE REGULATOR (ARR) AUX1 Auxin inXux transporter CAF-1 CHROMATIN ASSEMBLY FACTOR-1 CDKA A-type cyclin-dependent kinase CKS1 CYCLIN-DEPENDENT KINASE REGULATORY SUBUNIT CYCD D-type cyclin CYCC C-type cyclin CYCL L-type cyclin DHAR DEHYDROASCORBATE REDUCTASE EZ ELONGATION ZONE FAS2 FASCIATA 2 GAPDH GLYCERALDEHYDE-3-PHOSPHATE DEHYDROGENASE
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2009
The landscape of genomics has changed drastically in the last two decades. Increasingly inexpensi... more The landscape of genomics has changed drastically in the last two decades. Increasingly inexpensive sequencing has shifted the primary focus from the acquisition of biological sequences to the study of biological function. Assays have been developed to study many intricacies of biological systems, and publicly available databases have given rise to integrative analyses that combine information from many sources to draw complex conclusions. Such research was the focus of the recent workshop at the Isaac Newton Institute, ‘High dimensional statistics in biology’. Many computational methods from modern genomics and related disciplines were presented and discussed. Using, as much as possible, the material from these talks, we give an overview of modern genomics: from the essential assays that make data-generation possible, to the statistical methods that yield meaningful inference. We point to current analytical challenges, where novel methods, or novel applications of extant methods, a...
Nature Biotechnology, 2005
Cancer Cell, 2004
Here we describe the comprehensive gene expression profiles of each cell type composing normal br... more Here we describe the comprehensive gene expression profiles of each cell type composing normal breast tissue and in situ and invasive breast carcinomas using serial analysis of gene expression. Based on these data, we determined that extensive gene expression changes occur in all cell types during cancer progression and that a significant fraction of altered genes encode secreted proteins and receptors. Despite the dramatic gene expression changes in all cell types, genetic alterations were detected only in cancer epithelial cells. The CXCL14 and CXCL12 chemokines overexpressed in tumor myoepithelial cells and myofibroblasts, respectively, bind to receptors on epithelial cells and enhance their proliferation, migration, and invasion. Thus, chemokines may play a role in breast tumorigenesis by acting as paracrine factors.
BMC Bioinformatics, 2009
Background Disease classification has been an important application of microarray technology. How... more Background Disease classification has been an important application of microarray technology. However, most microarray-based classifiers can only handle data generated within the same study, since microarray data generated by different laboratories or with different platforms can not be compared directly due to systematic variations. This issue has severely limited the practical use of microarray-based disease classification. Results In this study, we tested the feasibility of disease classification by integrating the large amount of heterogeneous microarray datasets from the public microarray repositories. Cross-platform data compatibility is created by deriving expression log-rank ratios within datasets. One may then compare vectors of log-rank ratios across datasets. In addition, we systematically map textual annotations of datasets to concepts in Unified Medical Language System (UMLS), permitting quantitative analysis of the phenotype "distance" between datasets and au...
Bioinformatics, 2012
Motivation: Pathway genes are considered as a group of genes that work cooperatively in the same ... more Motivation: Pathway genes are considered as a group of genes that work cooperatively in the same pathway constituting a fundamental functional grouping in a biological process. Identifying pathway genes has been one of the major tasks in understanding biological processes. However, due to the difficulty in characterizing/inferring different types of biological gene relationships, as well as several computational issues arising from dealing with high-dimensional biological data, deducing genes in pathways remain challenging. Results: In this work, we elucidate higher level gene-gene interactions by evaluating the conditional dependencies between genes, i.e. the relationships between genes after removing the influences of a set of previously known pathway genes. These previously known pathway genes serve as seed genes in our model and will guide the detection of other genes involved in the same pathway. The detailed statistical techniques involve the estimation of a precision matrix whose elements are known to be proportional to partial correlations (i.e. conditional dependencies) between genes under appropriate normality assumptions. Likelihood ratio tests on two forms of precision matrices are further performed to see if a candidate pathway gene is conditionally independent of all the previously known pathway genes. When used effectively, this is a promising approach to recover gene relationships that would have otherwise been missed by standard methods. The advantage of the proposed method is demonstrated using both simulation studies and real datasets. We also demonstrated the importance of taking into account experimental dependencies in the simulation and real data studies.
The Annals of Applied Statistics, 2010
Large-scale statistical analysis of data sets associated with genome sequences plays an important... more Large-scale statistical analysis of data sets associated with genome sequences plays an important role in modern biology. A key component of such statistical analyses is the computation of p-values and confidence bounds for statistics defined on the genome. Currently such computation is commonly achieved through ad hoc simulation measures. The method of randomization, which is at the heart of these simulation procedures, can significantly affect the resulting statistical conclusions. Most simulation schemes introduce a variety of hidden assumptions regarding the nature of the randomness in the data, resulting in a failure to capture biologically meaningful relationships. To address the need for a method of assessing the significance of observations within large scale genomic studies, where there often exists a complex dependency structure between observations, we propose a unified solution built upon a data subsampling approach. We propose a piecewise stationary model for genome sequences and show that the subsampling approach gives correct answers under this model. We illustrate the method on three simulation studies and two real data examples.
The Annals of Applied Statistics, 2011
Reproducibility is essential to reliable scientific discovery in highthroughput experiments. In t... more Reproducibility is essential to reliable scientific discovery in highthroughput experiments. In this work we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model, from which we derive a quantitative reproducibility score, which we call the "irreproducible discovery rate" (IDR) analogous to the FDR. This score can be computed at each set of paired replicate ranks and permits the principled setting of thresholds both for assessing reproducibility and combining replicates. Since our approach permits an arbitrary scale for each replicate, it provides useful descriptive measures in a wide variety of situations to be explored. We study the performance of the algorithm using simulations and give a heuristic analysis of its theoretical properties. We demonstrate the effectiveness of our method in a ChIP-seq experiment.
Journal of the American Statistical Association, 2009
A major task in understanding biological processes is to elucidate the relationships between gene... more A major task in understanding biological processes is to elucidate the relationships between genes involved in the underlying biological pathways. Microarray data from an increasing number of biologically interrelated experiments now allows for more complete portrayals of functional gene relationships in the pathways. In current studies of gene relationships, the presence of expression dependencies attributable to the biologically interrelated experiments, however, has been widely ignored. When unaccounted for, these (experiment) dependencies can result in inaccurate inferences of functional gene relationships, and hence incorrect biological conclusions. This article contributes a framework consisting of a model and an estimation procedure to infer gene relationships when there are two-way dependencies in the gene expression matrix (the gene-wise and experiment-wise dependencies). The main aspect of the framework is the use of a Kronecker product covariance matrix to model the gene-experiment interactions. The resulting novel gene coexpression measure, named Knorm correlation, can be understood as a natural extension of the widely used Pearson coefficient when the experiment correlation matrix is known. Compared with the Pearson coefficient, the Knorm correlation has a smaller estimation variance. The Knorm is also asymptotically consistent with the Pearson coefficient. When the experiment correlation matrix is unknown, the Knorm correlation is computed based on the estimated experiment correlation matrix by an iterative estimation procedure. We demonstrate the advantages of the Knorm correlation in both simulation studies and real datasets. The Knorm correlation estimation procedure is implemented in an R package (Knorm) that is freely available from the Bioconductor website.
Statistics in Biosciences, 2017
One goal of single-cell RNA sequencing (scRNA seq) is to expose possible heterogeneity within cel... more One goal of single-cell RNA sequencing (scRNA seq) is to expose possible heterogeneity within cell populations due to meaningful, biological variation. Examining cell-to-cell heterogeneity, and further, identifying subpopulations of cells based on scRNA seq data has been of common interest in life science research. A key component to successfully identifying cell subpopulations (or clustering cells) is the (dis)similarity measure used to group the cells. In this paper, we introduce a novel measure, named SIDEseq, to assess cell-to-cell similarity using scRNA seq data. SIDEseq first identifies a list of putative differentially expressed (DE) genes for each pair of cells. SIDEseq then integrates the information from all the DE gene lists (corresponding to all pairs of cells) to build a similarity measure between two cells. SIDEseq can be implemented in any clustering algorithm that requires a (dis)similarity matrix. This new measure incorporates information from all cells when evaluating the similarity between any two cells, a characteristic not commonly found in existing (dis)similarity measures. This property is advantageous for two reasons: (a) borrowing information from cells of different subpopulations allows for the investigation of pairwise cell relationships from a global perspective and (b) information from other cells of the same subpopulation could help to ensure a robust relationship assessment. We applied SIDEseq to a newly generated human ovarian cancer scRNA seq dataset, a public human embryo scRNA seq dataset, and several simulated datasets. The clustering results suggest that the SIDEseq measure is capable of uncovering important relationships between cells, and outperforms or at least does as well as several popular (dis)similarity measures when used on these datasets.
CHANCE, 2006
The availability of whole genome sequence data has facilitated the development of high-throughput... more The availability of whole genome sequence data has facilitated the development of high-throughput technologies for monitoring biological signals on a genomic scale. The revolutionary microarray technology, first introduced in 1995 (Schena et al., 1995), is now one of the most valuable techniques for global gene expression profiling. Other high-throughput genomic technologies, such as Serial Analysis of Gene Expression (SAGE) (Velculescu et al., 1995), mass spectrometry for protein identification (Henzel et al., 1993) and ChIP-chip for DNA binding (Ren et al., 2000), have also been widely used for different purposes in current biological and medical research.
Methods in Molecular Biology, 2008
To gain insights into the biological function and relevance of genes using serial analysis of gen... more To gain insights into the biological function and relevance of genes using serial analysis of gene expression (SAGE) transcription profiles, one essential method is to perform clustering analysis on a group genes with similar expression patterns. A successful clustering analysis depends on the use of effective distance or similarity measures. For this purpose, by considering the specific properties of SAGE technology, we modeled the SAGE data by Poisson statistics and developed two Poisson-based measures to assess similarity of gene expression profiles. By employing these two distances into a K-means clustering procedure, we further developed a software package to perform clustering analysis on SAGE data. The software implementing our Poisson-based algorithms can be downloaded from http://genome.dfci.harvard.edu/sager. Our algorithm is guaranteed to converge to a local maximum when Poisson likelihood-based measure is used. The results from simulation and experimental mouse retina data demonstrate that the Poisson-based distances are more appropriate and reliable for analyzing SAGE data compared to other commonly used distances or similarity measures.
Genome research, 2014
We report a statistical study to discover transcriptome similarity of developmental stages from D... more We report a statistical study to discover transcriptome similarity of developmental stages from D. melanogaster and C. elegans using modENCODE RNA-seq data. We focus on "stage-associated genes" that capture specific transcriptional activities in each stage and use them to map pairwise stages within and between the two species by a hypergeometric test. Within each species, temporally adjacent stages exhibit high transcriptome similarity, as expected. Additionally, fly female adults and worm adults are mapped with fly and worm embryos, respectively, due to maternal gene expression. Between fly and worm, an unexpected strong collinearity is observed in the time course from early embryos to late larvae. Moreover, a second parallel pattern is found between fly prepupae through adults and worm late embryos through adults, consistent with the second large wave of cell proliferation and differentiation in the fly life cycle. The results indicate a partially duplicated developmenta...
Proceedings of the National Academy of Sciences, 2014
Significance Coexpression analysis is one of the earliest tools for inferring gene associations u... more Significance Coexpression analysis is one of the earliest tools for inferring gene associations using expression data but faces new challenges in this “big data” era. In a large heterogeneous dataset, it is likely that gene relationships may change or only exist in a subset of the samples, and they can be nonlinear or nonfunctional. We propose two new robust count statistics to account for local patterns in gene expression profiles. The statistics are generalizable to detect statistical dependence in other application domains. The performance of the statistics is evaluated against a number of popular bivariate dependence measures, showing favorable results. The asymptotic studies of the statistics provide an interesting addition to the combinatorics literature.
Nature, Jan 28, 2014
Despite the large evolutionary distances between metazoan species, they can show remarkable commo... more Despite the large evolutionary distances between metazoan species, they can show remarkable commonalities in their biology, and this has helped to establish fly and worm as model organisms for human biology. Although studies of individual elements and factors have explored similarities in gene regulation, a large-scale comparative analysis of basic principles of transcriptional regulatory features is lacking. Here we map the genome-wide binding locations of 165 human, 93 worm and 52 fly transcription regulatory factors, generating a total of 1,019 data sets from diverse cell types, developmental stages, or conditions in the three species, of which 498 (48.9%) are presented here for the first time. We find that structural properties of regulatory networks are remarkably conserved and that orthologous regulatory factor families recognize similar binding motifs in vivo and show some similar co-associations. Our results suggest that gene-regulatory properties previously observed for ind...
Nature, Jan 28, 2014
The transcriptome is the readout of the genome. Identifying common features in it across distant ... more The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-ex...
Proceedings of the National Academy of Sciences, 2010
The rapid accumulation of gene expression data has offered unprecedented opportunities to study h... more The rapid accumulation of gene expression data has offered unprecedented opportunities to study human diseases. The National Center for Biotechnology Information Gene Expression Omnibus is currently the largest database that systematically documents the genome-wide molecular basis of diseases. However, thus far, this resource has been far from fully utilized. This paper describes the first study to transform public gene expression repositories into an automated disease diagnosis database. Particularly, we have developed a systematic framework, including a two-stage Bayesian learning approach, to achieve the diagnosis of one or multiple diseases for a query expression profile along a hierarchical disease taxonomy. Our approach, including standardizing cross-platform gene expression data and heterogeneous disease annotations, allows analyzing both sources of information in a unified probabilistic system. A high level of overall diagnostic accuracy was shown by cross validation. It was...
Proceedings of the National Academy of Sciences, 2011
Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts hav... more Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called “sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation” (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the s...
PLoS Biology, 2004
The vertebrate retina is comprised of seven major cell types that are generated in overlapping bu... more The vertebrate retina is comprised of seven major cell types that are generated in overlapping but well-defined intervals. To identify genes that might regulate retinal development, gene expression in the developing retina was profiled at multiple time points using serial analysis of gene expression (SAGE). The expression patterns of 1,051 genes that showed developmentally dynamic expression by SAGE were investigated using in situ hybridization. A molecular atlas of gene expression in the developing and mature retina was thereby constructed, along with a taxonomic classification of developmental gene expression patterns. Genes were identified that label both temporal and spatial subsets of mitotic progenitor cells. For each developing and mature major retinal cell type, genes selectively expressed in that cell type were identified. The gene expression profiles of retinal Mü ller glia and mitotic progenitor cells were found to be highly similar, suggesting that Mü ller glia might serve to produce multiple retinal cell types under the right conditions. In addition, multiple transcripts that were evolutionarily conserved that did not appear to encode open reading frames of more than 100 amino acids in length (''noncoding RNAs'') were found to be dynamically and specifically expressed in developing and mature retinal cell types. Finally, many photoreceptor-enriched genes that mapped to chromosomal intervals containing retinal disease genes were identified. These data serve as a starting point for functional investigations of the roles of these genes in retinal development and physiology.
Planta, 2009
Using transcript proWle analysis, we explored the nature of the stem cell niche in roots of maize... more Using transcript proWle analysis, we explored the nature of the stem cell niche in roots of maize (Zea mays). Toward assessing a role for speciWc genes in the establishment and maintenance of the niche, we perturbed the niche and simultaneously monitored the spatial expression patterns of genes hypothesized as essential. Our results allow us to quantify and localize gene activities to speciWc portions of the niche: to the quiescent center (QC) or the proximal meristem (PM), or to both. The data point to molecular, biochemical and physiological processes associated with the speciWcation and maintenance of the niche, and include reduced expression of metabolism-, redox-and certain cell cycle-associated transcripts in the QC, enrichment of auxin-associated transcripts within the entire niche, controls for the state of diVerentiation of QC cells, a role for cytokinins speciWcally in the PM portion of the niche, processes (repair machinery) for maintaining DNA integrity and a role for gene silencing in niche stabilization. To provide additional support for the hypothesized roles of the above-mentioned and other transcripts in niche speciWca-tion, we overexpressed, in Arabidopsis, homologs of representative genes (eight) identiWed as highly enriched or reduced in the maize root QC. We conclude that the coordinated changes in expression of auxin-, redox-, cell cycleand metabolism-associated genes suggest the linkage of gene networks at the level of transcription, thereby providing additional insights into events likely associated with root stem cell niche establishment and maintenance. Keywords Quiescent center • Root • Stem cell • Stem cell niche • Zea mays Abbreviations AA Ascorbic acid AAO ASCORBATE OXIDASE ARF AUXIN RESPONSE FACTOR AGO4 Argonaute-related gene 4 AHL1 AT-hook motif nuclear localized protein 1 ATH AT-hook protein gene APX ASCORBATE PEROXIDASE ARR ARABIDOPSIS RESPONSE REGULATOR (ARR) AUX1 Auxin inXux transporter CAF-1 CHROMATIN ASSEMBLY FACTOR-1 CDKA A-type cyclin-dependent kinase CKS1 CYCLIN-DEPENDENT KINASE REGULATORY SUBUNIT CYCD D-type cyclin CYCC C-type cyclin CYCL L-type cyclin DHAR DEHYDROASCORBATE REDUCTASE EZ ELONGATION ZONE FAS2 FASCIATA 2 GAPDH GLYCERALDEHYDE-3-PHOSPHATE DEHYDROGENASE
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2009
The landscape of genomics has changed drastically in the last two decades. Increasingly inexpensi... more The landscape of genomics has changed drastically in the last two decades. Increasingly inexpensive sequencing has shifted the primary focus from the acquisition of biological sequences to the study of biological function. Assays have been developed to study many intricacies of biological systems, and publicly available databases have given rise to integrative analyses that combine information from many sources to draw complex conclusions. Such research was the focus of the recent workshop at the Isaac Newton Institute, ‘High dimensional statistics in biology’. Many computational methods from modern genomics and related disciplines were presented and discussed. Using, as much as possible, the material from these talks, we give an overview of modern genomics: from the essential assays that make data-generation possible, to the statistical methods that yield meaningful inference. We point to current analytical challenges, where novel methods, or novel applications of extant methods, a...
Nature Biotechnology, 2005
Cancer Cell, 2004
Here we describe the comprehensive gene expression profiles of each cell type composing normal br... more Here we describe the comprehensive gene expression profiles of each cell type composing normal breast tissue and in situ and invasive breast carcinomas using serial analysis of gene expression. Based on these data, we determined that extensive gene expression changes occur in all cell types during cancer progression and that a significant fraction of altered genes encode secreted proteins and receptors. Despite the dramatic gene expression changes in all cell types, genetic alterations were detected only in cancer epithelial cells. The CXCL14 and CXCL12 chemokines overexpressed in tumor myoepithelial cells and myofibroblasts, respectively, bind to receptors on epithelial cells and enhance their proliferation, migration, and invasion. Thus, chemokines may play a role in breast tumorigenesis by acting as paracrine factors.
BMC Bioinformatics, 2009
Background Disease classification has been an important application of microarray technology. How... more Background Disease classification has been an important application of microarray technology. However, most microarray-based classifiers can only handle data generated within the same study, since microarray data generated by different laboratories or with different platforms can not be compared directly due to systematic variations. This issue has severely limited the practical use of microarray-based disease classification. Results In this study, we tested the feasibility of disease classification by integrating the large amount of heterogeneous microarray datasets from the public microarray repositories. Cross-platform data compatibility is created by deriving expression log-rank ratios within datasets. One may then compare vectors of log-rank ratios across datasets. In addition, we systematically map textual annotations of datasets to concepts in Unified Medical Language System (UMLS), permitting quantitative analysis of the phenotype "distance" between datasets and au...
Bioinformatics, 2012
Motivation: Pathway genes are considered as a group of genes that work cooperatively in the same ... more Motivation: Pathway genes are considered as a group of genes that work cooperatively in the same pathway constituting a fundamental functional grouping in a biological process. Identifying pathway genes has been one of the major tasks in understanding biological processes. However, due to the difficulty in characterizing/inferring different types of biological gene relationships, as well as several computational issues arising from dealing with high-dimensional biological data, deducing genes in pathways remain challenging. Results: In this work, we elucidate higher level gene-gene interactions by evaluating the conditional dependencies between genes, i.e. the relationships between genes after removing the influences of a set of previously known pathway genes. These previously known pathway genes serve as seed genes in our model and will guide the detection of other genes involved in the same pathway. The detailed statistical techniques involve the estimation of a precision matrix whose elements are known to be proportional to partial correlations (i.e. conditional dependencies) between genes under appropriate normality assumptions. Likelihood ratio tests on two forms of precision matrices are further performed to see if a candidate pathway gene is conditionally independent of all the previously known pathway genes. When used effectively, this is a promising approach to recover gene relationships that would have otherwise been missed by standard methods. The advantage of the proposed method is demonstrated using both simulation studies and real datasets. We also demonstrated the importance of taking into account experimental dependencies in the simulation and real data studies.
The Annals of Applied Statistics, 2010
Large-scale statistical analysis of data sets associated with genome sequences plays an important... more Large-scale statistical analysis of data sets associated with genome sequences plays an important role in modern biology. A key component of such statistical analyses is the computation of p-values and confidence bounds for statistics defined on the genome. Currently such computation is commonly achieved through ad hoc simulation measures. The method of randomization, which is at the heart of these simulation procedures, can significantly affect the resulting statistical conclusions. Most simulation schemes introduce a variety of hidden assumptions regarding the nature of the randomness in the data, resulting in a failure to capture biologically meaningful relationships. To address the need for a method of assessing the significance of observations within large scale genomic studies, where there often exists a complex dependency structure between observations, we propose a unified solution built upon a data subsampling approach. We propose a piecewise stationary model for genome sequences and show that the subsampling approach gives correct answers under this model. We illustrate the method on three simulation studies and two real data examples.
The Annals of Applied Statistics, 2011
Reproducibility is essential to reliable scientific discovery in highthroughput experiments. In t... more Reproducibility is essential to reliable scientific discovery in highthroughput experiments. In this work we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model, from which we derive a quantitative reproducibility score, which we call the "irreproducible discovery rate" (IDR) analogous to the FDR. This score can be computed at each set of paired replicate ranks and permits the principled setting of thresholds both for assessing reproducibility and combining replicates. Since our approach permits an arbitrary scale for each replicate, it provides useful descriptive measures in a wide variety of situations to be explored. We study the performance of the algorithm using simulations and give a heuristic analysis of its theoretical properties. We demonstrate the effectiveness of our method in a ChIP-seq experiment.