Richard Sandstrom - Academia.edu (original) (raw)

Papers by Richard Sandstrom

Research paper thumbnail of An integrated encyclopedia of DNA elements in the human genome

Nature, 2012

The human genome encodes the blueprint of life, but the function of the vast majority of its near... more The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

Research paper thumbnail of Erratum: Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Research paper thumbnail of Genome Sequencing of Autism-Affected Families Reveals Disruption of Putative Noncoding Regulatory DNA

The American Journal of Human Genetics, 2015

We performed whole-genome sequencing (WGS) of 208 genomes from 53 families affected by simplex au... more We performed whole-genome sequencing (WGS) of 208 genomes from 53 families affected by simplex autism. For the majority of these families, no copy-number variant (CNV) or candidate de novo gene-disruptive single-nucleotide variant (SNV) had been detected by microarray or whole-exome sequencing (WES). We integrated multiple CNV and SNV analyses and extensive experimental validation to identify additional candidate mutations in eight families. We report that compared to control individuals, probands showed a significant (p = 0.03) enrichment of de novo and private disruptive mutations within fetal CNS DNase I hypersensitive sites (i.e., putative regulatory regions). This effect was only observed within 50 kb of genes that have been previously associated with autism risk, including genes where dosage sensitivity has already been established by recurrent disruptive de novo protein-coding mutations (ARID1B, SCN2A, NR3C2, PRKCA, and DSCAM). In addition, we provide evidence of gene-disruptive CNVs (in DISC1, WNT7A, RBFOX1, and MBD5), as well as smaller de novo CNVs and exon-specific SNVs missed by exome sequencing in neurodevelopmental genes (e.g., CANX, SAE1, and PIK3CA). Our results suggest that the detection of smaller, often multiple CNVs affecting putative regulatory elements might help explain additional risk of simplex autism.

Research paper thumbnail of DNase I hypersensitivity mapping, genomic footprinting, and transcription factor networks in plants

Current Plant Biology, 2015

Research paper thumbnail of Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Nature genetics, Jan 26, 2015

The function of human regulatory regions depends exquisitely on their local genomic environment a... more The function of human regulatory regions depends exquisitely on their local genomic environment and on cellular context, complicating experimental analysis of common disease- and trait-associated variants that localize within regulatory DNA. We use allelically resolved genomic DNase I footprinting data encompassing 166 individuals and 114 cell types to identify >60,000 common variants that directly influence transcription factor occupancy and regulatory DNA accessibility in vivo. The unprecedented scale of these data enables systematic analysis of the impact of sequence variation on transcription factor occupancy in vivo. We leverage this analysis to develop accurate models of variation affecting the recognition sites for diverse transcription factors and apply these models to discriminate nearly 500,000 common regulatory variants likely to affect transcription factor occupancy across the human genome. The approach and results provide a new foundation for the analysis and interpr...

Research paper thumbnail of Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Nature Genetics, 2015

The function of human regulatory regions depends exquisitely on their local genomic environment a... more The function of human regulatory regions depends exquisitely on their local genomic environment and on cellular context, complicating experimental analysis of common disease- and trait-associated variants that localize within regulatory DNA. We use allelically resolved genomic DNase I footprinting data encompassing 166 individuals and 114 cell types to identify >60,000 common variants that directly influence transcription factor occupancy and regulatory DNA accessibility in vivo. The unprecedented scale of these data enables systematic analysis of the impact of sequence variation on transcription factor occupancy in vivo. We leverage this analysis to develop accurate models of variation affecting the recognition sites for diverse transcription factors and apply these models to discriminate nearly 500,000 common regulatory variants likely to affect transcription factor occupancy across the human genome. The approach and results provide a new foundation for the analysis and interpretation of noncoding variation in complete human genomes and for systems-level investigation of disease-associated variants.

Research paper thumbnail of DNase I–hypersensitive exons colocalize with promoters and distal regulatory elements

Nature Genetics, 2013

The precise splicing of genes confers an enormous transcriptional complexity to the human genome.... more The precise splicing of genes confers an enormous transcriptional complexity to the human genome. The majority of gene splicing occurs cotranscriptionally, permitting epigenetic modifications to affect splicing outcomes. Here we show that select exonic regions are demarcated within the three-dimensional structure of the human genome. We identify a subset of exons that exhibit DNase I hypersensitivity and are accompanied by 'phantom' signals in chromatin immunoprecipitation and sequencing (ChIP-seq) that result from cross-linking with proximal promoter- or enhancer-bound factors. The capture of structural features by ChIP-seq is confirmed by chromatin interaction analysis that resolves local intragenic loops that fold exons close to cognate promoters while excluding intervening intronic sequences. These interactions of exons with promoters and enhancers are enriched for alternative splicing events, an effect reflected in cell type-specific periexonic DNase I hypersensitivity patterns. Collectively, our results connect local genome topography, chromatin structure and cis-regulatory landscapes with the generation of human transcriptional complexity by cotranscriptional splicing.

Research paper thumbnail of An expansive human regulatory lexicon encoded in transcription factor footprints

Nature, 2012

Regulatory factor binding to genomic DNA protects the underlying sequence from cleavage by DNaseI... more Regulatory factor binding to genomic DNA protects the underlying sequence from cleavage by DNaseI, leaving nucleotide-resolution footprints. Using genomic DNaseI footprinting across 41 diverse cell and tissue types, we detected 45 million factor occupancy events within regulatory regions, representing differential binding to 8.4 million distinct short sequence elements. Here we show that this small genomic sequence compartment, roughly twice the size of the exome, encodes an expansive repertoire of conserved recognition sequences for DNA-binding proteins that nearly doubles the size of the human cis-regulatory lexicon. We find that genetic variants affecting allelic chromatin states are concentrated in footprints, and that these elements are preferentially sheltered from DNA methylation. High-resolution DNaseI cleavage patterns mirror nucleotide-level evolutionary conservation and track the crystallographic topography of protein-DNA interfaces, indicating that transcription factor structure has been evolutionarily imprinted on the human genome sequence. We identify a stereotyped 50 base-pair footprint that precisely defines the site of transcript origination within thousands of human promoters. Finally, we describe a large collection of novel regulatory factor recognition motifs that are highly conserved in both sequence and function, and exhibit cell-selective occupancy patterns that closely parallel major regulators of development, differentiation, and pluripotency.

Research paper thumbnail of The accessible chromatin landscape of the human genome

Nature, 2012

DNase I hypersensitive sites (DHSs) are markers of regulatory DNA and have underpinned the discov... more DNase I hypersensitive sites (DHSs) are markers of regulatory DNA and have underpinned the discovery of all classes of cis-regulatory elements including enhancers, promoters, insulators, silencers and locus control regions. Here we present the first extensive map of human DHSs identified through genome-wide profiling in 125 diverse cell and tissue types. We identify ∼2.9 million DHSs that encompass virtually all known experimentally validated cis-regulatory sequences and expose a vast trove of novel elements, most with highly cell-selective regulation. Annotating these elements using ENCODE data reveals novel relationships between chromatin accessibility, transcription, DNA methylation and regulatory factor occupancy patterns. We connect ∼580,000 distal DHSs with their target promoters, revealing systematic pairing of different classes of distal DHSs and specific promoter types. Patterning of chromatin accessibility at many regulatory regions is organized with dozens to hundreds of co-activated elements, and the transcellular DNase I sensitivity pattern at a given region can predict cell-type-specific functional behaviours. The DHS landscape shows signatures of recent functional evolutionary constraint. However, the DHS compartment in pluripotent and immortalized cells exhibits higher mutation rates than that in highly differentiated cells, exposing an unexpected link between chromatin accessibility, proliferative potential and patterns of human variation.

Research paper thumbnail of Late-replicating heterochromatin is characterized by decreased cytosine methylation in the human genome

Genome Research, 2011

Heterochromatin is believed to be associated with increased levels of cytosine methylation. With ... more Heterochromatin is believed to be associated with increased levels of cytosine methylation. With the recent availability of genome-wide, high-resolution molecular data reflecting chromatin organization and methylation, such relationships can be explored systematically. As well-defined surrogates for heterochromatin, we tested the relationship between DNA replication timing and DNase hypersensitivity with cytosine methylation in two human cell types, unexpectedly finding the later-replicating, more heterochromatic regions to be less methylated than early replicating regions. When we integrated gene-expression data into the study, we found that regions of increased gene expression were earlier replicating, as previously identified, and that transcription-targeted cytosine methylation in gene bodies contributes to the positive correlation with early replication. A self-organizing map (SOM) approach was able to identify genomic regions with early replication and increased methylation, but lacking annotated transcripts, loci missed in simple two variable analyses, possibly encoding unrecognized intergenic transcripts. We conclude that the relationship of cytosine methylation with heterochromatin is not simple and depends on whether the genomic context is tandemly repetitive sequences often found near centromeres, which are known to be heterochromatic and methylated, or the remaining majority of the genome, where cytosine methylation is targeted preferentially to the transcriptionally active, euchromatic compartment of the genome. 5 These authors contributed equally to this work.

Research paper thumbnail of Zebrafish globin switching occurs in two developmental stages and is controlled by the LCR

Developmental Biology, 2012

Globin gene switching is a complex, highly regulated process allowing expression of distinct glob... more Globin gene switching is a complex, highly regulated process allowing expression of distinct globin genes at specific developmental stages. Here, for the first time, we have characterized all of the zebrafish globins based on the completed genomic sequence. Two distinct chromosomal loci, termed major (chromosome 3) and minor (chromosome 12), harbor the globin genes containing α/ β pairs in a 5′-3′ to 3′-5′ orientation. Both these loci share synteny with the mammalian α-globin locus. Zebrafish globin expression was assayed during development and demonstrated two globin switches, similar to human development. A conserved regulatory element, the locus control region (LCR), was revealed by analyzing DNase I hypersensitive sites, H3K4 trimethylation marks and GATA1 binding sites. Surprisingly, the position of these sites with relation to the globin genes is evolutionarily conserved, despite a lack of overall sequence conservation. Motifs within the zebrafish LCR include CACCC, GATA, and NFE2 sites, suggesting functional interactions with known transcription factors but not the same LCR architecture. Functional homology to the mammalian α-LCR MCS-R2 region was confirmed by robust and specific reporter expression in erythrocytes of transgenic zebrafish. Our studies provide a comprehensive characterization of the zebrafish globin loci and clarify the regulation of globin switching.

Research paper thumbnail of Developmental Fate and Cellular Maturity Encoded in Human Regulatory DNA Landscapes

Cell, 2013

Cellular-state information between generations of developing cells may be propagated via regulato... more Cellular-state information between generations of developing cells may be propagated via regulatory regions. We report consistent patterns of gain and loss of DNase I-hypersensitive sites (DHSs) as cells progress from embryonic stem cells (ESCs) to terminal fates. DHS patterns alone convey rich information about cell fate and lineage relationships distinct from information conveyed by gene expression. Developing cells share a proportion of their DHS landscapes with ESCs; that proportion decreases continuously in each cell type as differentiation progresses, providing a quantitative benchmark of developmental maturity. Developmentally stable DHSs densely encode binding sites for transcription factors involved in autoregulatory feedback circuits. In contrast to normal cells, cancer cells extensively reactivate silenced ESC DHSs and those from developmental programs external to the cell lineage from which the malignancy derives. Our results point to changes in regulatory DNA landscapes as quantitative indicators of cell-fate transitions, lineage relationships, and dysfunction.

Research paper thumbnail of Comprehensive characterization of erythroid-specific enhancers in the genomic regions of human Kruppel-like factors

BMC Genomics, 2013

Background: Mapping of DNase I hypersensitive sites (DHSs) is a powerful tool to experimentally i... more Background: Mapping of DNase I hypersensitive sites (DHSs) is a powerful tool to experimentally identify cisregulatory elements (CREs). Among CREs, enhancers are abundant and predominantly act in driving cell-specific gene expression. Krüppel-like factors (KLFs) are a family of eukaryotic transcription factors. Several KLFs have been demonstrated to play important roles in hematopoiesis. However, transcriptional regulation of KLFs via CREs, particularly enhancers, in erythroid cells has been poorly understood. Results: In this study, 23 erythroid-specific or putative erythroid-specific DHSs were identified by DNase-seq in the genomic regions of 17 human KLFs, and their enhancer activities were evaluated using dual-luciferase reporter (DLR) assay. Of the 23 erythroid-specific DHSs, the enhancer activities of 15 DHSs were comparable to that of the classical enhancer HS2 in driving minimal promoter (minP). Fifteen DHSs, some overlapping those that increased minP activities, acted as enhancers when driving the corresponding KLF promoters (KLF-Ps) in erythroid cells; of these, 10 DHSs were finally characterized as erythroid-specific KLF enhancers. These 10 erythroid-specific KLF enhancers were further confirmed using chromatin immunoprecipitation coupled to sequencing (ChIP-seq) data-based bioinformatic and biochemical analyses.

Research paper thumbnail of BEDOPS: high-performance genomic feature operations

Bioinformatics, 2012

The large and growing number of genome-wide datasets highlights the need for high-performance fea... more The large and growing number of genome-wide datasets highlights the need for high-performance feature analysis and data comparison methods, in addition to efficient data storage and retrieval techniques. We introduce BEDOPS, a software suite for common genomic analysis tasks which offers improved flexibility, scalability and execution time characteristics over previously published packages. The suite includes a utility to compress large inputs into a lossless format that can provide greater space savings and faster data extractions than alternatives.

Research paper thumbnail of Supporting Information - Probing DNA shape and methylation state on a genomic scale with DNase I

Cell Culture and DNA Extraction. IMR90 human fetal pulmonary fibroblast cells (ATCC) were culture... more Cell Culture and DNA Extraction. IMR90 human fetal pulmonary fibroblast cells (ATCC) were cultured in a 5% (vol/vol) CO 2 humidified incubator. Cells were passaged to 70% confluence, and harvested using 15 mL Accutase. Cell viability was confirmed using Trypan blue staining. DNA was extracted from 5 × 10 6 cells using a 1:1 mixture of phenol-chloroform (phase lock, Eppendorf), and cleaned and concentrated using a minielute column (Qiagen).

Research paper thumbnail of DNase I hypersensitivity analysis of the mouse brain and retina identifies region-specific regulatory elements

Epigenetics & chromatin, 2015

The brain, spinal cord, and neural retina comprise the central nervous system (CNS) of vertebrate... more The brain, spinal cord, and neural retina comprise the central nervous system (CNS) of vertebrates. Understanding the regulatory mechanisms that underlie the enormous cell-type diversity of the CNS is a significant challenge. Whole-genome mapping of DNase I-hypersensitive sites (DHSs) has been used to identify cis-regulatory elements in many tissues. We have applied this approach to the mouse CNS, including developing and mature neural retina, whole brain, and two well-characterized brain regions, the cerebellum and the cerebral cortex. For the various regions and developmental stages of the CNS that we analyzed, there were approximately the same number of DHSs; however, there were many DHSs unique to each CNS region and developmental stage. Many of the DHSs are likely to mark enhancers that are specific to the specific CNS region and developmental stage. We validated the DNase I mapping approach for identification of CNS enhancers using the existing VISTA Browser database and with ...

Research paper thumbnail of Native Elongating Transcript Sequencing Reveals Human Transcriptional Activity at Nucleotide Resolution

Cell, Jan 23, 2015

Major features of transcription by human RNA polymerase II (Pol II) remain poorly defined due to ... more Major features of transcription by human RNA polymerase II (Pol II) remain poorly defined due to a lack of quantitative approaches for visualizing Pol II progress at nucleotide resolution. We developed a simple and powerful approach for performing native elongating transcript sequencing (NET-seq) in human cells that globally maps strand-specific Pol II density at nucleotide resolution. NET-seq exposes a mode of antisense transcription that originates downstream and converges on transcription from the canonical promoter. Convergent transcription is associated with a distinctive chromatin configuration and is characteristic of lower-expressed genes. Integration of NET-seq with genomic footprinting data reveals stereotypic Pol II pausing coincident with transcription factor occupancy. Finally, exons retained in mature transcripts display Pol II pausing signatures that differ markedly from skipped exons, indicating an intrinsic capacity for Pol II to recognize exons with different proce...

Research paper thumbnail of 103 Probing DNA shape and methylation state on a genomic scale with DNase I

Research paper thumbnail of Resolving the complexity of the human genome using single-molecule sequencing

Nature, 2014

The human genome is arguably the most complete mammalian reference assembly, yet more than 160 eu... more The human genome is arguably the most complete mammalian reference assembly, yet more than 160 euchromatic gaps remain and aspects of its structural variation remain poorly understood ten years after its completion. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome--78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.

Research paper thumbnail of Dynamic reprogramming of chromatin accessibility during Drosophila embryo development

Genome Biology, 2011

Background: The development of complex organisms is believed to involve progressive restrictions ... more Background: The development of complex organisms is believed to involve progressive restrictions in cellular fate. Understanding the scope and features of chromatin dynamics during embryogenesis, and identifying regulatory elements important for directing developmental processes remain key goals of developmental biology.

Research paper thumbnail of An integrated encyclopedia of DNA elements in the human genome

Nature, 2012

The human genome encodes the blueprint of life, but the function of the vast majority of its near... more The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

Research paper thumbnail of Erratum: Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Research paper thumbnail of Genome Sequencing of Autism-Affected Families Reveals Disruption of Putative Noncoding Regulatory DNA

The American Journal of Human Genetics, 2015

We performed whole-genome sequencing (WGS) of 208 genomes from 53 families affected by simplex au... more We performed whole-genome sequencing (WGS) of 208 genomes from 53 families affected by simplex autism. For the majority of these families, no copy-number variant (CNV) or candidate de novo gene-disruptive single-nucleotide variant (SNV) had been detected by microarray or whole-exome sequencing (WES). We integrated multiple CNV and SNV analyses and extensive experimental validation to identify additional candidate mutations in eight families. We report that compared to control individuals, probands showed a significant (p = 0.03) enrichment of de novo and private disruptive mutations within fetal CNS DNase I hypersensitive sites (i.e., putative regulatory regions). This effect was only observed within 50 kb of genes that have been previously associated with autism risk, including genes where dosage sensitivity has already been established by recurrent disruptive de novo protein-coding mutations (ARID1B, SCN2A, NR3C2, PRKCA, and DSCAM). In addition, we provide evidence of gene-disruptive CNVs (in DISC1, WNT7A, RBFOX1, and MBD5), as well as smaller de novo CNVs and exon-specific SNVs missed by exome sequencing in neurodevelopmental genes (e.g., CANX, SAE1, and PIK3CA). Our results suggest that the detection of smaller, often multiple CNVs affecting putative regulatory elements might help explain additional risk of simplex autism.

Research paper thumbnail of DNase I hypersensitivity mapping, genomic footprinting, and transcription factor networks in plants

Current Plant Biology, 2015

Research paper thumbnail of Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Nature genetics, Jan 26, 2015

The function of human regulatory regions depends exquisitely on their local genomic environment a... more The function of human regulatory regions depends exquisitely on their local genomic environment and on cellular context, complicating experimental analysis of common disease- and trait-associated variants that localize within regulatory DNA. We use allelically resolved genomic DNase I footprinting data encompassing 166 individuals and 114 cell types to identify >60,000 common variants that directly influence transcription factor occupancy and regulatory DNA accessibility in vivo. The unprecedented scale of these data enables systematic analysis of the impact of sequence variation on transcription factor occupancy in vivo. We leverage this analysis to develop accurate models of variation affecting the recognition sites for diverse transcription factors and apply these models to discriminate nearly 500,000 common regulatory variants likely to affect transcription factor occupancy across the human genome. The approach and results provide a new foundation for the analysis and interpr...

Research paper thumbnail of Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo

Nature Genetics, 2015

The function of human regulatory regions depends exquisitely on their local genomic environment a... more The function of human regulatory regions depends exquisitely on their local genomic environment and on cellular context, complicating experimental analysis of common disease- and trait-associated variants that localize within regulatory DNA. We use allelically resolved genomic DNase I footprinting data encompassing 166 individuals and 114 cell types to identify >60,000 common variants that directly influence transcription factor occupancy and regulatory DNA accessibility in vivo. The unprecedented scale of these data enables systematic analysis of the impact of sequence variation on transcription factor occupancy in vivo. We leverage this analysis to develop accurate models of variation affecting the recognition sites for diverse transcription factors and apply these models to discriminate nearly 500,000 common regulatory variants likely to affect transcription factor occupancy across the human genome. The approach and results provide a new foundation for the analysis and interpretation of noncoding variation in complete human genomes and for systems-level investigation of disease-associated variants.

Research paper thumbnail of DNase I–hypersensitive exons colocalize with promoters and distal regulatory elements

Nature Genetics, 2013

The precise splicing of genes confers an enormous transcriptional complexity to the human genome.... more The precise splicing of genes confers an enormous transcriptional complexity to the human genome. The majority of gene splicing occurs cotranscriptionally, permitting epigenetic modifications to affect splicing outcomes. Here we show that select exonic regions are demarcated within the three-dimensional structure of the human genome. We identify a subset of exons that exhibit DNase I hypersensitivity and are accompanied by 'phantom' signals in chromatin immunoprecipitation and sequencing (ChIP-seq) that result from cross-linking with proximal promoter- or enhancer-bound factors. The capture of structural features by ChIP-seq is confirmed by chromatin interaction analysis that resolves local intragenic loops that fold exons close to cognate promoters while excluding intervening intronic sequences. These interactions of exons with promoters and enhancers are enriched for alternative splicing events, an effect reflected in cell type-specific periexonic DNase I hypersensitivity patterns. Collectively, our results connect local genome topography, chromatin structure and cis-regulatory landscapes with the generation of human transcriptional complexity by cotranscriptional splicing.

Research paper thumbnail of An expansive human regulatory lexicon encoded in transcription factor footprints

Nature, 2012

Regulatory factor binding to genomic DNA protects the underlying sequence from cleavage by DNaseI... more Regulatory factor binding to genomic DNA protects the underlying sequence from cleavage by DNaseI, leaving nucleotide-resolution footprints. Using genomic DNaseI footprinting across 41 diverse cell and tissue types, we detected 45 million factor occupancy events within regulatory regions, representing differential binding to 8.4 million distinct short sequence elements. Here we show that this small genomic sequence compartment, roughly twice the size of the exome, encodes an expansive repertoire of conserved recognition sequences for DNA-binding proteins that nearly doubles the size of the human cis-regulatory lexicon. We find that genetic variants affecting allelic chromatin states are concentrated in footprints, and that these elements are preferentially sheltered from DNA methylation. High-resolution DNaseI cleavage patterns mirror nucleotide-level evolutionary conservation and track the crystallographic topography of protein-DNA interfaces, indicating that transcription factor structure has been evolutionarily imprinted on the human genome sequence. We identify a stereotyped 50 base-pair footprint that precisely defines the site of transcript origination within thousands of human promoters. Finally, we describe a large collection of novel regulatory factor recognition motifs that are highly conserved in both sequence and function, and exhibit cell-selective occupancy patterns that closely parallel major regulators of development, differentiation, and pluripotency.

Research paper thumbnail of The accessible chromatin landscape of the human genome

Nature, 2012

DNase I hypersensitive sites (DHSs) are markers of regulatory DNA and have underpinned the discov... more DNase I hypersensitive sites (DHSs) are markers of regulatory DNA and have underpinned the discovery of all classes of cis-regulatory elements including enhancers, promoters, insulators, silencers and locus control regions. Here we present the first extensive map of human DHSs identified through genome-wide profiling in 125 diverse cell and tissue types. We identify ∼2.9 million DHSs that encompass virtually all known experimentally validated cis-regulatory sequences and expose a vast trove of novel elements, most with highly cell-selective regulation. Annotating these elements using ENCODE data reveals novel relationships between chromatin accessibility, transcription, DNA methylation and regulatory factor occupancy patterns. We connect ∼580,000 distal DHSs with their target promoters, revealing systematic pairing of different classes of distal DHSs and specific promoter types. Patterning of chromatin accessibility at many regulatory regions is organized with dozens to hundreds of co-activated elements, and the transcellular DNase I sensitivity pattern at a given region can predict cell-type-specific functional behaviours. The DHS landscape shows signatures of recent functional evolutionary constraint. However, the DHS compartment in pluripotent and immortalized cells exhibits higher mutation rates than that in highly differentiated cells, exposing an unexpected link between chromatin accessibility, proliferative potential and patterns of human variation.

Research paper thumbnail of Late-replicating heterochromatin is characterized by decreased cytosine methylation in the human genome

Genome Research, 2011

Heterochromatin is believed to be associated with increased levels of cytosine methylation. With ... more Heterochromatin is believed to be associated with increased levels of cytosine methylation. With the recent availability of genome-wide, high-resolution molecular data reflecting chromatin organization and methylation, such relationships can be explored systematically. As well-defined surrogates for heterochromatin, we tested the relationship between DNA replication timing and DNase hypersensitivity with cytosine methylation in two human cell types, unexpectedly finding the later-replicating, more heterochromatic regions to be less methylated than early replicating regions. When we integrated gene-expression data into the study, we found that regions of increased gene expression were earlier replicating, as previously identified, and that transcription-targeted cytosine methylation in gene bodies contributes to the positive correlation with early replication. A self-organizing map (SOM) approach was able to identify genomic regions with early replication and increased methylation, but lacking annotated transcripts, loci missed in simple two variable analyses, possibly encoding unrecognized intergenic transcripts. We conclude that the relationship of cytosine methylation with heterochromatin is not simple and depends on whether the genomic context is tandemly repetitive sequences often found near centromeres, which are known to be heterochromatic and methylated, or the remaining majority of the genome, where cytosine methylation is targeted preferentially to the transcriptionally active, euchromatic compartment of the genome. 5 These authors contributed equally to this work.

Research paper thumbnail of Zebrafish globin switching occurs in two developmental stages and is controlled by the LCR

Developmental Biology, 2012

Globin gene switching is a complex, highly regulated process allowing expression of distinct glob... more Globin gene switching is a complex, highly regulated process allowing expression of distinct globin genes at specific developmental stages. Here, for the first time, we have characterized all of the zebrafish globins based on the completed genomic sequence. Two distinct chromosomal loci, termed major (chromosome 3) and minor (chromosome 12), harbor the globin genes containing α/ β pairs in a 5′-3′ to 3′-5′ orientation. Both these loci share synteny with the mammalian α-globin locus. Zebrafish globin expression was assayed during development and demonstrated two globin switches, similar to human development. A conserved regulatory element, the locus control region (LCR), was revealed by analyzing DNase I hypersensitive sites, H3K4 trimethylation marks and GATA1 binding sites. Surprisingly, the position of these sites with relation to the globin genes is evolutionarily conserved, despite a lack of overall sequence conservation. Motifs within the zebrafish LCR include CACCC, GATA, and NFE2 sites, suggesting functional interactions with known transcription factors but not the same LCR architecture. Functional homology to the mammalian α-LCR MCS-R2 region was confirmed by robust and specific reporter expression in erythrocytes of transgenic zebrafish. Our studies provide a comprehensive characterization of the zebrafish globin loci and clarify the regulation of globin switching.

Research paper thumbnail of Developmental Fate and Cellular Maturity Encoded in Human Regulatory DNA Landscapes

Cell, 2013

Cellular-state information between generations of developing cells may be propagated via regulato... more Cellular-state information between generations of developing cells may be propagated via regulatory regions. We report consistent patterns of gain and loss of DNase I-hypersensitive sites (DHSs) as cells progress from embryonic stem cells (ESCs) to terminal fates. DHS patterns alone convey rich information about cell fate and lineage relationships distinct from information conveyed by gene expression. Developing cells share a proportion of their DHS landscapes with ESCs; that proportion decreases continuously in each cell type as differentiation progresses, providing a quantitative benchmark of developmental maturity. Developmentally stable DHSs densely encode binding sites for transcription factors involved in autoregulatory feedback circuits. In contrast to normal cells, cancer cells extensively reactivate silenced ESC DHSs and those from developmental programs external to the cell lineage from which the malignancy derives. Our results point to changes in regulatory DNA landscapes as quantitative indicators of cell-fate transitions, lineage relationships, and dysfunction.

Research paper thumbnail of Comprehensive characterization of erythroid-specific enhancers in the genomic regions of human Kruppel-like factors

BMC Genomics, 2013

Background: Mapping of DNase I hypersensitive sites (DHSs) is a powerful tool to experimentally i... more Background: Mapping of DNase I hypersensitive sites (DHSs) is a powerful tool to experimentally identify cisregulatory elements (CREs). Among CREs, enhancers are abundant and predominantly act in driving cell-specific gene expression. Krüppel-like factors (KLFs) are a family of eukaryotic transcription factors. Several KLFs have been demonstrated to play important roles in hematopoiesis. However, transcriptional regulation of KLFs via CREs, particularly enhancers, in erythroid cells has been poorly understood. Results: In this study, 23 erythroid-specific or putative erythroid-specific DHSs were identified by DNase-seq in the genomic regions of 17 human KLFs, and their enhancer activities were evaluated using dual-luciferase reporter (DLR) assay. Of the 23 erythroid-specific DHSs, the enhancer activities of 15 DHSs were comparable to that of the classical enhancer HS2 in driving minimal promoter (minP). Fifteen DHSs, some overlapping those that increased minP activities, acted as enhancers when driving the corresponding KLF promoters (KLF-Ps) in erythroid cells; of these, 10 DHSs were finally characterized as erythroid-specific KLF enhancers. These 10 erythroid-specific KLF enhancers were further confirmed using chromatin immunoprecipitation coupled to sequencing (ChIP-seq) data-based bioinformatic and biochemical analyses.

Research paper thumbnail of BEDOPS: high-performance genomic feature operations

Bioinformatics, 2012

The large and growing number of genome-wide datasets highlights the need for high-performance fea... more The large and growing number of genome-wide datasets highlights the need for high-performance feature analysis and data comparison methods, in addition to efficient data storage and retrieval techniques. We introduce BEDOPS, a software suite for common genomic analysis tasks which offers improved flexibility, scalability and execution time characteristics over previously published packages. The suite includes a utility to compress large inputs into a lossless format that can provide greater space savings and faster data extractions than alternatives.

Research paper thumbnail of Supporting Information - Probing DNA shape and methylation state on a genomic scale with DNase I

Cell Culture and DNA Extraction. IMR90 human fetal pulmonary fibroblast cells (ATCC) were culture... more Cell Culture and DNA Extraction. IMR90 human fetal pulmonary fibroblast cells (ATCC) were cultured in a 5% (vol/vol) CO 2 humidified incubator. Cells were passaged to 70% confluence, and harvested using 15 mL Accutase. Cell viability was confirmed using Trypan blue staining. DNA was extracted from 5 × 10 6 cells using a 1:1 mixture of phenol-chloroform (phase lock, Eppendorf), and cleaned and concentrated using a minielute column (Qiagen).

Research paper thumbnail of DNase I hypersensitivity analysis of the mouse brain and retina identifies region-specific regulatory elements

Epigenetics & chromatin, 2015

The brain, spinal cord, and neural retina comprise the central nervous system (CNS) of vertebrate... more The brain, spinal cord, and neural retina comprise the central nervous system (CNS) of vertebrates. Understanding the regulatory mechanisms that underlie the enormous cell-type diversity of the CNS is a significant challenge. Whole-genome mapping of DNase I-hypersensitive sites (DHSs) has been used to identify cis-regulatory elements in many tissues. We have applied this approach to the mouse CNS, including developing and mature neural retina, whole brain, and two well-characterized brain regions, the cerebellum and the cerebral cortex. For the various regions and developmental stages of the CNS that we analyzed, there were approximately the same number of DHSs; however, there were many DHSs unique to each CNS region and developmental stage. Many of the DHSs are likely to mark enhancers that are specific to the specific CNS region and developmental stage. We validated the DNase I mapping approach for identification of CNS enhancers using the existing VISTA Browser database and with ...

Research paper thumbnail of Native Elongating Transcript Sequencing Reveals Human Transcriptional Activity at Nucleotide Resolution

Cell, Jan 23, 2015

Major features of transcription by human RNA polymerase II (Pol II) remain poorly defined due to ... more Major features of transcription by human RNA polymerase II (Pol II) remain poorly defined due to a lack of quantitative approaches for visualizing Pol II progress at nucleotide resolution. We developed a simple and powerful approach for performing native elongating transcript sequencing (NET-seq) in human cells that globally maps strand-specific Pol II density at nucleotide resolution. NET-seq exposes a mode of antisense transcription that originates downstream and converges on transcription from the canonical promoter. Convergent transcription is associated with a distinctive chromatin configuration and is characteristic of lower-expressed genes. Integration of NET-seq with genomic footprinting data reveals stereotypic Pol II pausing coincident with transcription factor occupancy. Finally, exons retained in mature transcripts display Pol II pausing signatures that differ markedly from skipped exons, indicating an intrinsic capacity for Pol II to recognize exons with different proce...

Research paper thumbnail of 103 Probing DNA shape and methylation state on a genomic scale with DNase I

Research paper thumbnail of Resolving the complexity of the human genome using single-molecule sequencing

Nature, 2014

The human genome is arguably the most complete mammalian reference assembly, yet more than 160 eu... more The human genome is arguably the most complete mammalian reference assembly, yet more than 160 euchromatic gaps remain and aspects of its structural variation remain poorly understood ten years after its completion. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome--78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.

Research paper thumbnail of Dynamic reprogramming of chromatin accessibility during Drosophila embryo development

Genome Biology, 2011

Background: The development of complex organisms is believed to involve progressive restrictions ... more Background: The development of complex organisms is believed to involve progressive restrictions in cellular fate. Understanding the scope and features of chromatin dynamics during embryogenesis, and identifying regulatory elements important for directing developmental processes remain key goals of developmental biology.