Quantifying similarity between motifs - PubMed (original) (raw)

Quantifying similarity between motifs

Shobhit Gupta et al. Genome Biol. 2007.

Abstract

A common question within the context of de novo motif discovery is whether a newly discovered, putative motif resembles any previously discovered motif in an existing database. To answer this question, we define a statistical measure of motif-motif similarity, and we describe an algorithm, called Tomtom, for searching a database of motifs with a given query motif. Experimental simulations demonstrate the accuracy of Tomtom's E values and its effectiveness in finding similar motifs.

PubMed Disclaimer

Figures

Figure 1

An aligned pair of similar motifs. The query and target motifs are both derived from JASPAR motif NF-Y, following the simulation protocol described in the text. Tomtom assigns an E value of 3.81 × e-10 to this particular match. The figure was created using a version of seqlogo [26], modified to display aligned pairs of Logos.

Figure 2

Score distribution histogram for a query motif of length 12. The figure contains 12 histograms overlaid on top of each other. Each histogram corresponds to the frequency distribution of scores, for an offset of zero relative to a query motif of width 12. The first (red) histogram is for the alignment involving only the first query column, the next (light green) histogram relates to the first two query columns, and so on.

Figure 3

Accuracy of motif comparison P values. The figure plots the computed motif P value as a function of the empirical (rank-based) P value from searching shuffled query motifs against shuffled target motifs. The central line corresponds to y = x, and the two adjacent dotted lines correspond to y = 0.5_x_ and y = 2_x_. The P values are computed using the euclidean distance.

Figure 4

Measuring retrieval accuracy. Motif retrieval accuracy is estimated using simulated JASPAR motifs, as described in the text. The figure plots the percentage of correct query-target pairs (true positives) as a function of the percentage of incorrect pairs (false positives) as we traverse the list of query-target pairs sorted by Tomtom P value or any of the other three methods of combining column-wise scores. The solid and dashed lines correspond to width-normalized scores scores (P values, arithmetic mean, and geometric mean), and the green dotted line represents sum of column scores. This figure is for euclidean distance (ED) at a sampling rate of S/8.

Figure 5

E value based retrieval rate. The figure plots the percentage of query motifs that successfully matched the correct JASPAR target as a function of the number of sites used to create the query motif. Here 'success' means that the top-ranked motif is the correct target and has an E value less than 0.01. ALLR, average log-likelihood ratio; ED, euclidean distance; FIET, Fisher-Irwin exact test; KLD, Kullback-Leibler divergence; PCC, Pearson correlation coefficient; PCST, Pearson _χ_2 test; SW, Sandelin-Wasserman function.

Cited by

OVO positively regulates essential maternal pathways by binding near the transcriptional start sites in the Drosophila female germline.
Benner L, Muron S, Gomez JG, Oliver B. Benner L, et al. Elife. 2024 Sep 18;13:RP94631. doi: 10.7554/eLife.94631. Elife. 2024. PMID: 39291827 Free PMC article.
29 mammalian genomes reveal novel exaptations of mobile elements for likely regulatory functions in the human genome.
Lowe CB, Haussler D. Lowe CB, et al. PLoS One. 2012;7(8):e43128. doi: 10.1371/journal.pone.0043128. Epub 2012 Aug 27. PLoS One. 2012. PMID: 22952639 Free PMC article.
Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors.
Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M, Weng Z. Wang J, et al. Genome Res. 2012 Sep;22(9):1798-812. doi: 10.1101/gr.139105.112. Genome Res. 2012. PMID: 22955990 Free PMC article.
Transcriptomic characterization of Trichoderma harzianum T34 primed tomato plants: assessment of biocontrol agent induced host specific gene expression and plant growth promotion.
Aamir M, Shanmugam V, Dubey MK, Husain FM, Adil M, Ansari WA, Rai A, Sah P. Aamir M, et al. BMC Plant Biol. 2023 Nov 8;23(1):552. doi: 10.1186/s12870-023-04502-6. BMC Plant Biol. 2023. PMID: 37940862 Free PMC article.
Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information.
Onimaru K, Nishimura O, Kuraku S. Onimaru K, et al. PLoS One. 2020 Jul 23;15(7):e0235748. doi: 10.1371/journal.pone.0235748. eCollection 2020. PLoS One. 2020. PMID: 32701977 Free PMC article.

References

1. Maniatis T, Goodbourn S, Fischer JA. Regulation of inducible and tissue-specific gene expression. Science. 1987;236:1237–1245. doi: 10.1126/science.3296191. - DOI - PubMed
1. Pawson T, Nash P. Assembly of cell regulatory systems through protein interaction domains. Science. 2003;300:445–452. doi: 10.1126/science.1083653. - DOI - PubMed
1. Tompa M, Li N, Bailey T, Church G, Moor BD, Eskin E, Favorov A, Frith M, Fu Y, Kent W, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–144. doi: 10.1038/nbt1053. - DOI - PubMed
1. Sandelin A, Alkema W, Engstrom P, Wasserman W, Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucliec Acids Res. 2004;32:D91–D94. doi: 10.1093/nar/gkh012. - DOI - PMC - PubMed
1. Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 2000;28:316–319. doi: 10.1093/nar/28.1.316. - DOI - PMC - PubMed

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Quantifying similarity between motifs - PubMed (original) (raw)