Discovery and characterization of chromatin states for systematic annotation of the human genome - PubMed (original) (raw)

Discovery and characterization of chromatin states for systematic annotation of the human genome

Jason Ernst et al. Nat Biotechnol. 2010 Aug.

Abstract

A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal 'chromatin states' in human T cells, based on recurrent and spatially coherent combinations of chromatin marks. We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, large-scale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.

PubMed Disclaimer

Figures

Figure 1

Figure 1. Example of chromatin state annotation

Input chromatin mark information and resulting chromatin state annotation for a 120kb region of human chromosome 7 surrounding the CAPZA2 gene. For each 200-bp interval, the input ChIP-Seq sequence tag count (black bars) is processed into a binary presence/absence call for each of 18 acetylation marks (light blue), 20 methylation marks (pink), and CTCF/Pol2/H2AZ (brown). The precise combination of these marks in each interval in their spatial context is used to infer the most probable chromatin state assignment (colored boxes). Although chromatin states were learned independent of any prior genome annotation, they correlate strongly with upstream and downstream promoters (red), 5′-proximal and distal transcribed regions (purple), active intergenic regions (yellow), repressed (grey) and repetitive (blue) regions (state descriptions shown in Supplementary Table 1). This example illustrates that even when the signal coming from chromatin marks is noisy, the resulting chromatin state annotation is very robust, directly interpretable, and shows a strong correspondence with the gene annotation. Several spatially-coherent transitions are seen from large-scale repressed to active intergenic regions near active genes, from upstream to downstream promoter states surrounding the TSS, and from 5′-proximal to distal transcribed regions along the body of the gene. The frequent transitions to state 16 correlate with annotated Alu elements (57% overlap vs. 4% and 25% for states 13 and 15 respectively). Transitions to state 13 are likely due to enhancer elements in the first intron of CAPZA2, a region where regulatory elements are commonly found, and correlate with several enhancer marks. While maximum-probability state assignments are shown here, the full posterior probability for each state in this region is shown in Supplementary Figure 2.

Figure 2

Figure 2. Chromatin state definition and functional interpretation

a. Chromatin mark combinations associated with each state. Each row shows the specific combination of marks associated with each chromatin state, and the frequencies between 0 and 1 with which they occur (color scale). These correspond to the emission probability parameters of the Hidden Markov Model (HMM) learned across the genome during model training (values shown in Supplementary Fig. 2). Marks and states colored as in Figure 1. b. Genomic and functional enrichments of chromatin states. % denotes percentage, xF denotes fold enrichment. In order columns are: percentage of the genome assigned to the state; percentage of state that overlaps a 200bp-interval within 2kb of an annotated RefSeq Transcription Start Site (TSS); percentage of RefSeq TSS found in the state; fold enrichment for TSS; percentage of state overlapping a RefSeq transcribed region; average expression level of genomic intervals overlapping the state; fold enrichment for ZNF-named gene; fold enrichment for RefSeq 5′ Untranslated Region (5′-UTR) exon and introns; fold enrichment for RefSeq exons; fold enrichment for spliced exons (2nd exon or later); fold enrichment for RefSeq 3′ Untranslated Region (3′-UTR) exons and introns; fold enrichment for RefSeq transcription end sites (TES); fold enrichment for PhastCons conserved elements; fold enrichment for DNaseI hypersensitive sites; median fold enrichment for transcription factor binding sites over a set of experiments (expanded in Supplementary Fig. 23); fold-enrichment for CpG islands; percentage of GC nucleotides; percent overlapping experimental nuclear lamina data; percent overlapping a RepeatMasker element (expanded in Supplementary Fig. 31). All enrichments are based on the posterior probability assignments. Genome total indicates the total % of 200bp intersecting the feature or the genome average for expression and %GC. c. Brief biological state description and interpretation (‘chr’: chromatin, ‘enh’: enhancer, full descriptions in Supplementary Table 1).

Figure 3

Figure 3. Promoter and transcribed chromatin states show distinct functional and positional enrichments

a. Distinct Gene Ontology (GO) functional enrichments (fold and corrected p-values) found for genes associated with different promoter states at their Transcription Start Site (TSS). For additional states and GO terms, see Supplementary Figure 29. b. Distinct positional biases of promoter states with respect to nearest RefSeq TSS distinguish states peaking upstream, only downstream, and centered at the TSS. c. Positional biases of Transcribed States with respect to TSS, nearest spliced exon start, and Transcription End Sites (TES). These distinguish 5′-proximal states (12–23, left panel), 5′-distal states (24–28), states strongly enriched for spliced exons (middle panel, see also Supplementary Fig. 24 for plot for States 24–28), and TES-associated states (with state 27 being particularly precisely positioned, right panel).

Figure 4

Figure 4. SNP and GWAS enrichments for chromatin states

a. Several chromatin states show enrichments for disease association datasets. For each state is shown: genome percentage; fold enrichment for SNPs from the HapMap CEU population; fold enrichment from a collection of 1640 genome-wide association study (GWAS) Single Nucleotide Polymorphisms (SNPs) associated with a variety of diseases and traits from numerous studies (Hindorff et al, 2009); fold enrichment of GWAS SNPs relative to the HapMap CEU SNP enrichment; significance of GWAS SNPs relative to the underlying SNP frequency (when the corrected p-value <0.01). b. Example of intergenic SNP in GWAS-enriched state 33, found 40kb downstream of the IKZF2 gene and associated with plasma eosinophil count levels. SNP significance as reported in the supplement of (Gudbjartsson, et al, 2009) is shown for each SNP in the region (blue circles) and associated chromatin state annotation (similar to Fig. 1). Red circle denotes top SNP and its overlap with state 33. In addition to top SNPs, secondary SNPs were also frequently found at or near GWAS-enriched states in several cases.

Figure 5

Figure 5. Discovery power of chromatin states for genome annotation

a. Comparison of the power to discover transcription start sites (TSS) for individual chromatin marks (red), chromatin states (blue) ordered by their TSS enrichment, and a directed experimental approach based on CAGE sequence tag data read counts from all available cell types (gold), while the chromatin-states and marks only use data from CD4 T-cells. Both chromatin states and CAGE tags are compared using a Receiver Operating Characteristic (ROC) curve that shows the false positive (x-axis) and true positive (y-axis) rates at varying prediction thresholds in the task of predicting if a 200bp interval intersects a RefSeq TSS. Thin red curve compares performance of H3K4me3 mark at varying intensity thresholds. b. Comparison of the power to detect RefSeq transcribed regions for chromatin states and marks as in a, and directed experimental information coming from Expressed Sequence Tag (EST) data (gold) based on sequence counts from all available cell types, . c. Independent experimental and comparative information provides support that a significant fraction of ‘false positives’ in panels a and b are genuine novel unannotated TSS and transcribed regions currently missing from RefSeq. Percentage of each state supported by a CAGE tag (column 1), and the same percentage for locations at least 2kb away from a RefSeq TSS (column 2), suggests that many promoter-associated states outside RefSeq promoters are supported by CAGE tag evidence. Similarly, percentage of each state overlapping a GenBank mRNA (column 3), and the same percentage specifically outside RefSeq genes (column 4), suggest that transcription-associated states outside RefSeq genes are supported by mRNA evidence. Similar support is found by GenBank Expressed Sequence Tags (ESTs) and evolutionarily-conserved predicted new exons (Supplementary Fig. 33).

Figure 6

Figure 6. Recovery of chromatin states with subsets of marks

a. The figure shows the ordering of marks based on a greedy forward selection algorithm to optimize a squared error penalty on state mis-assignments (Online Methods). Conditioned on all the marks to the left having already been profiled, the mark listed is the optimal selection for one additional mark to be profiled based on the target optimization function. Below each mark is the percentage of a state with identical assignments using the subset of marks. b. Comparison of the percentage of each state recovered between the first 10 marks based on the greedy method and the 10 marks used in (Cui et al, 2009) (Supplementary Fig. 39). The two columns after the state IDs are the proportion of the states recovered using the greedy algorithm and the set used in (Cui et al, 2009). c. The figure shows a progressive decrease in squared error for state mis–assignment as a function of the number of marks selected based on the greedy algorithm.

Similar articles

Cited by

References

    1. Bernstein BE, Meissner A, Lander ES. The mammalian epigenome. Cell. 2007;128:669–681. - PubMed
    1. Kouzarides T. Chromatin modifications and their function. Cell. 2007;128:693–705. - PubMed
    1. Strahl BD, Allis CD. The language of covalent histone modifications. Nature. 2000;403:41–45. - PubMed
    1. Schreiber SL, Bernstein BE. Signaling network model of chromatin. Cell. 2002;111:771–778. - PubMed
    1. Barski A, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources