ISMARA: automated modeling of genomic signals as a democracy of regulatory motifs - PubMed (original) (raw)

ISMARA: automated modeling of genomic signals as a democracy of regulatory motifs

Piotr J Balwierz et al. Genome Res. 2014 May.

Abstract

Accurate reconstruction of the regulatory networks that control gene expression is one of the key current challenges in molecular biology. Although gene expression and chromatin state dynamics are ultimately encoded by constellations of binding sites recognized by regulators such as transcriptions factors (TFs) and microRNAs (miRNAs), our understanding of this regulatory code and its context-dependent read-out remains very limited. Given that there are thousands of potential regulators in mammals, it is not practical to use direct experimentation to identify which of these play a key role for a particular system of interest. We developed a methodology that models gene expression or chromatin modifications in terms of genome-wide predictions of regulatory sites and completely automated it into a web-based tool called ISMARA (Integrated System for Motif Activity Response Analysis). Given only gene expression or chromatin state data across a set of samples as input, ISMARA identifies the key TFs and miRNAs driving expression/chromatin changes and makes detailed predictions regarding their regulatory roles. These include predicted activities of the regulators across the samples, their genome-wide targets, enriched gene categories among the targets, and direct interactions between the regulators. Applying ISMARA to data sets from well-studied systems, we show that it consistently identifies known key regulators ab initio. We also present a number of novel predictions including regulatory interactions in innate immunity, a master regulator of mucociliary differentiation, TFs consistently disregulated in cancer, and TFs that mediate specific chromatin modifications.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Outline of the Integrated System for Motif Activity Response Analysis. (A) ISMARA starts from a curated genome-wide collection of promoters and their associated transcripts. Using a comparative genomic Bayesian methodology (Arnold et al. 2012a), transcription factor binding sites (TFBSs) for ∼200 regulatory motifs are predicted in proximal promoters. Similarly, miRNA target sites for ∼100 seed families are annotated in the 3′ UTRs of transcripts associated with each promoter (Friedman et al. 2009). (B) Users provide measurements of gene expression (microarray, RNA-seq) or chromatin state (ChIP-seq). The raw data are processed automatically, and a signal is calculated for each promoter in each sample. For ChIP-seq data, the signal is calculated from the read density in a region around the transcription start. For gene expression data, the signal is calculated from read densities across the associated transcripts (RNA-seq) or intensities of associated probes (microarray). (C) The site predictions and measured signals are summarized in two large matrices. The components N pm of matrix N contain the total number of sites for motif m (TF or miRNA) associated with promoter p. The components E ps of matrix E contain the signal associated with promoter p in sample s. (D) The linear MARA model is used to explain the signal levels E ps in terms of bindings sites N pm and unknown motif activities A ms, which are inferred by the model. The constants c p and formula image correspond to basal levels for each promoter and sample, respectively. (E) As output, ISMARA provides the inferred motif activity profiles A ms of all motifs across the samples, s, sorted by the significance of the motifs. A sorted list of all predicted target promoters is provided for each motif, together with the network of known interactions between these targets (provided by the String database,

http://string-db.org/

) and a list of Gene Ontology categories that are enriched among the predicted targets. Finally, for each motif, a local network of predicted direct regulatory interactions with other regulators is provided.

Figure 2.

Figure 2.

Results for the Illumina Body Map 2. Each panel corresponds to a motif (indicated with name and sequence logo) and shows the inferred motif activities across the 16 tissues (activities with error bars in panels A and C, and activity _Z_-values in panels B and D). Tables show Gene Ontology categories enriched among predicted targets of each motif, and individual target promoters (D). The networks (B,C) show direct regulatory interactions between the motif and other regulators. (A) Red and black curves correspond to motif activities from two replicate measurements. The inset shows the correlation between motif activity and HNF1A mRNA levels. (B) The inset shows that MYB is predicted to directly target the RFX4 promoter with target score 8.134. (C) The regulatory network inset and GO table show that hsa-miR-124/hsa-miR-506 is predicted to directly target many TFs. (D) The red bars show _Z_-values of the average motif activity of the SREBF motif for samples coming from older (age 58–86) and younger (age 19–47) donors.

Figure 3.

Figure 3.

Analysis of an inflammatory response time series of human umbilical vein endothelial cells responding to TNF. (A) Time-dependent activities of the three most significant motifs, i.e., NFKB1/REL/RELA (red), IRF1/2/7 (black), and XBP1 (blue). Error bars denote standard deviations of the inferred activities. (B) Summary of the inferred core regulatory network. Selected top motifs are shown together with interactions between them and pathways/functional categories that are enriched among the targets of these motifs. The intensity of the color corresponds to the _Z_-score of the motif, its time-dependent activity is indicated inside the node, and the thickness of each edge corresponds to its target score S pm.

Figure 4.

Figure 4.

Mucociliary differentiation. (A) Inferred RFX motif activity profile in mucociliary differentiation of bronchial epithelial cells from three independent donors (black, red, and blue lines). (B) Key predicted regulators and their targets in this system. Selected top motifs are shown together with predicted interactions between them and pathways/functional categories that are enriched among predicted targets of these motifs. The intensity of the color corresponds to the _Z_-score of the motif, its time-dependent activity for each donor is indicated inside the node, and the thickness of the edges corresponds to the target score S pm. (C) mRNA expression profiles of the RFX2 (solid) and RFX3 (dashed) genes across the differentiation (colors of the donors as in A).

Figure 5.

Figure 5.

ISMARA predicts TFs involved in recruiting specific chromatin marks. (A) Activity across cell types of the SNAI1..3 motif in explaining expression (black), and levels of the chromatin marks H3K4me3 (dark green), H3K4me2 (light green), H3K9ac (dark blue), H3K27ac (light blue), and H3K36me3 (brown). (B) First principal component explaining the majority of variation in chromatin mark levels across all cell types. The bars indicate the relative contributions to the principal component of each mark. (C) Motif activities of the SNAI1..3 motif, as in A, but after removal of the first principal component. (D) _Z_-values and specificities (see text) of motifs for explaining H3K27me3 levels. The REST motif, with both highest _Z_-value and highest specificity, is indicated in red. (E) As in D, for H3K9ac levels. The two most significant motifs are shown in red. (F) As in D and E, for H3K27ac levels. (G) Activity, after removal of the first principal component, of the RFX motif for explaining H3K9ac (dark blue) and H3K27ac (light blue) levels. (H) As in G, for the ATF5_CREB motif.

Similar articles

Cited by

References

    1. Aceto N, Sausgruber N, Brinkhaus H, Gaidatzis D, Martiny-Baron G, Mazzarol G, Confalonieri S, Quarto M, Hu G, Balwierz PJ, et al. 2012. Tyrosine phosphatase SHP2 promotes breast cancer progression and maintains tumor-initiating cells via activation of key transcription factors and a positive feedback signaling loop. Nat Med 18: 529–537 - PubMed
    1. Anders S, Huber W 2010. Differential expression analysis for sequence count data. Genome Biol 11: R106. - PMC - PubMed
    1. Arner E, Mejhert N, Kulyté A, Balwierz PJ, Pachkov M, Cormont M, Lorente-Cebrián S, Ehrlund A, Laurencikiene J, Hedén P, et al. 2012. Adipose tissue microRNAs as regulators of CCL2 production in human obesity. Diabetes 61: 1986–1993 - PMC - PubMed
    1. Arnold P, Erb I, Pachkov M, Molina N, van Nimwegen E 2012a. MotEvo: integrated Bayesian probabilistic methods for inferring regulatory sites and motifs on multiple alignments of DNA sequences. Bioinformatics 28: 487–494 - PubMed
    1. Arnold P, Schöler A, Pachkov M, Balwierz P, Jørgensen H, Stadler MB, van Nimwegen E, Schübeler D 2012b. Modeling of epigenome dynamics identifies transcription factors that mediate Polycomb targeting. Genome Res 23: 60–73 - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources