MAGIC: A tool for predicting transcription factors and cofactors driving gene sets using ENCODE data - PubMed (original) (raw)

MAGIC: A tool for predicting transcription factors and cofactors driving gene sets using ENCODE data

Avtar Roopra. PLoS Comput Biol. 2020.

Abstract

Transcriptomic profiling is an immensely powerful hypothesis generating tool. However, accurately predicting the transcription factors (TFs) and cofactors that drive transcriptomic differences between samples is challenging. A number of algorithms draw on ChIP-seq tracks to define TFs and cofactors behind gene changes. These approaches assign TFs and cofactors to genes via a binary designation of 'target', or 'non-target' followed by Fisher Exact Tests to assess enrichment of TFs and cofactors. ENCODE archives 2314 ChIP-seq tracks of 684 TFs and cofactors assayed across a 117 human cell lines under a multitude of growth and maintenance conditions. The algorithm presented herein, Mining Algorithm for GenetIc Controllers (MAGIC), uses ENCODE ChIP-seq data to look for statistical enrichment of TFs and cofactors in gene bodies and flanking regions in gene lists without an a priori binary classification of genes as targets or non-targets. When compared to other TF mining resources, MAGIC displayed favourable performance in predicting TFs and cofactors that drive gene changes in 4 settings: 1) A cell line expressing or lacking single TF, 2) Breast tumors divided along PAM50 designations 3) Whole brain samples from WT mice or mice lacking a single TF in a particular neuronal subtype 4) Single cell RNAseq analysis of neurons divided by Immediate Early Gene expression levels. In summary, MAGIC is a standalone application that produces meaningful predictions of TFs and cofactors in transcriptomic experiments.

PubMed Disclaimer

Conflict of interest statement

The author has declared that no competing interests exist.

Figures

Fig 1

Fig 1. Comparing ranks of manipulated factors predicted by MAGIC, CHEA3, TFEA and Enrichr.

MCF7(shCon_vs_shREST), TCGA(Lum_vs_Basal), Brain(WT_vs_CTCFko) and DGC(Quiet_vs_Reactive) datasets were analyzed by MAGIC, CHEA3 (using all available libraries: ARCHS4 co-expression, ENCODE ChIP-Seq, Enrichr Queries, GTEx co-expression, Literature mining, ReMAP ChIP-seq, Mean Rank and Top Rank), TFEA and Enrichr. The reciprocal integer ranks for REST, ESR1, CTCF and FOS (the Factors manipulated in MCF7(shCon_vs_shREST), TCGA(Lum_vs_Basal), Brain(WT_vs_CTCFko) and DGC(Quiet_vs_Reactive)) are plotted. The top rank = 1, second rank = 0.5 etc. ND = Not Determined; Factor not present in library.

Fig 2

Fig 2. Manipulated transcription factors and associated cofactors are preferentially ranked by MAGIC compared to CHEA3, TFEA and Enrichr.

(A) Emperical cumulatives were generated of factional ranks (1/Integer ranks) for manipulated factors and associated cofactors for the 4 datasets using all algorithms and libraries as in Fig 1. The difference between the cumulative of all scaled fractional ranks and a uniform distribution for the manipulated factor and associated cofactors is plotted against the fractional rank. Kolmogorov-smirnov tests of each distribution against a uniform distribution yields p<10−10 for all tests. (B) Area Under Curve (AUC) for D(r)-r x r curves in panel A. (C) The D(r)-r curves in panel A were scaled for the rank of the manipulated actor. For each algorithm, D(r)-r was multiplied by the fractional rank of the manipulated factor (FR). (D) AUCs for curves in panel C.

Fig 3

Fig 3. MAGIC demonstrates skill at calling manipulated factors as assessed by Precision Recall and Receiver Operator Characteristics.

(A) Precision Recall curves for the four datasets and all algorithms and libraries. (B) Receiver Operator Characteristic curves for the four datasets and all algorithms and libraries. As in panel A, data was not balanced prior to graphing. (C) ROC versus PR AUCs for all algorithms and libraries. (D) PR AUCs were scaled for fractional rank of the manipulated factor by multiplying PR UAC by FR.

Fig 4

Fig 4. MAGIC requires a valid background gene list for optimal performance.

(A) D(r)-r curves for the 4 datasets generated for MAGIC outputs in the presence or absence of a background list. Kolmogorov-smirnov statistics for the 2 curves: MCF7(shCon_vs_shREST); D = 0.10, p = 8.6x10-5. TCGA(Lum_vs_Basal); D = 0.15, p = 4.6x10-11. Brain(WT_vs_CTCFko); D = 0.26, p = 1.1x10-16. DGC(Quiet_vs_Reactive); D = 0.11, p = 3.9x10-9. (B) Precision Recall curves for MAGIC outputs in the presence or absence of a background list: MCF7(shCon_vs_shREST); 0.84 vs 0.78, TCGA(Lum_vs_Basal); 0.83 vs 0.81, Brain(WT_vs_CTCFko); 0.91 vs 0.71, DGC(Quiet_vs_Reactive); 0.72 vs 0.67. Black vertical line denotes 80% Recall (C) Receiver Operator Characteristics curves for MAGIC outputs in the presence or absence of a background list (unbalanced). (D) Emperical cumulative distribution for False Discovery Rates associated with MAGIC outputs in the presence or absence of a background list. For all datasets, Kolmogorov-smirnov p <10−4. (E) The integer ranks for the top 50 Factors called by MAGIC in the presence of a background list were compared to their ranks in the absence of a background list.

Similar articles

Cited by

References

    1. Lambert SA, Jolma A, Campitelli LF, Das PK, Yin Y, Albu M, et al. The Human Transcription Factors. Cell. 2018;172(4):650–65. Epub 2018/02/10. 10.1016/j.cell.2018.01.029 . - DOI - PubMed
    1. Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–20. Epub 2013/09/28. 10.1038/ng.2764 - DOI - PMC - PubMed
    1. Cusanovich DA, Pavlovic B, Pritchard JK, Gilad Y. The functional consequences of variation in transcription factor binding. PLoS Genet. 2014;10(3):e1004226 Epub 2014/03/08. 10.1371/journal.pgen.1004226 - DOI - PMC - PubMed
    1. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, et al. DNA-binding specificities of human transcription factors. Cell. 2013;152(1–2):327–39. Epub 2013/01/22. 10.1016/j.cell.2012.12.009 . - DOI - PubMed
    1. Kwon AT, Arenillas DJ, Worsley Hunt R, Wasserman WW. oPOSSUM-3: advanced analysis of regulatory motif over-representation across genes or ChIP-Seq datasets. G3 (Bethesda). 2012;2(9):987–1002. Epub 2012/09/14. 10.1534/g3.112.003202 - DOI - PMC - PubMed

MeSH terms

Substances

LinkOut - more resources