Quantifying similarity between motifs - PubMed (original) (raw)

Quantifying similarity between motifs

Shobhit Gupta et al. Genome Biol. 2007.

Abstract

A common question within the context of de novo motif discovery is whether a newly discovered, putative motif resembles any previously discovered motif in an existing database. To answer this question, we define a statistical measure of motif-motif similarity, and we describe an algorithm, called Tomtom, for searching a database of motifs with a given query motif. Experimental simulations demonstrate the accuracy of Tomtom's E values and its effectiveness in finding similar motifs.

PubMed Disclaimer

Figures

Figure 1

Figure 1

An aligned pair of similar motifs. The query and target motifs are both derived from JASPAR motif NF-Y, following the simulation protocol described in the text. Tomtom assigns an E value of 3.81 × e-10 to this particular match. The figure was created using a version of seqlogo [26], modified to display aligned pairs of Logos.

Figure 2

Figure 2

Score distribution histogram for a query motif of length 12. The figure contains 12 histograms overlaid on top of each other. Each histogram corresponds to the frequency distribution of scores, for an offset of zero relative to a query motif of width 12. The first (red) histogram is for the alignment involving only the first query column, the next (light green) histogram relates to the first two query columns, and so on.

Figure 3

Figure 3

Accuracy of motif comparison P values. The figure plots the computed motif P value as a function of the empirical (rank-based) P value from searching shuffled query motifs against shuffled target motifs. The central line corresponds to y = x, and the two adjacent dotted lines correspond to y = 0.5_x_ and y = 2_x_. The P values are computed using the euclidean distance.

Figure 4

Figure 4

Measuring retrieval accuracy. Motif retrieval accuracy is estimated using simulated JASPAR motifs, as described in the text. The figure plots the percentage of correct query-target pairs (true positives) as a function of the percentage of incorrect pairs (false positives) as we traverse the list of query-target pairs sorted by Tomtom P value or any of the other three methods of combining column-wise scores. The solid and dashed lines correspond to width-normalized scores scores (P values, arithmetic mean, and geometric mean), and the green dotted line represents sum of column scores. This figure is for euclidean distance (ED) at a sampling rate of S/8.

Figure 5

Figure 5

E value based retrieval rate. The figure plots the percentage of query motifs that successfully matched the correct JASPAR target as a function of the number of sites used to create the query motif. Here 'success' means that the top-ranked motif is the correct target and has an E value less than 0.01. ALLR, average log-likelihood ratio; ED, euclidean distance; FIET, Fisher-Irwin exact test; KLD, Kullback-Leibler divergence; PCC, Pearson correlation coefficient; PCST, Pearson _χ_2 test; SW, Sandelin-Wasserman function.

Similar articles

Cited by

References

    1. Maniatis T, Goodbourn S, Fischer JA. Regulation of inducible and tissue-specific gene expression. Science. 1987;236:1237–1245. doi: 10.1126/science.3296191. - DOI - PubMed
    1. Pawson T, Nash P. Assembly of cell regulatory systems through protein interaction domains. Science. 2003;300:445–452. doi: 10.1126/science.1083653. - DOI - PubMed
    1. Tompa M, Li N, Bailey T, Church G, Moor BD, Eskin E, Favorov A, Frith M, Fu Y, Kent W, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–144. doi: 10.1038/nbt1053. - DOI - PubMed
    1. Sandelin A, Alkema W, Engstrom P, Wasserman W, Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucliec Acids Res. 2004;32:D91–D94. doi: 10.1093/nar/gkh012. - DOI - PMC - PubMed
    1. Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 2000;28:316–319. doi: 10.1093/nar/28.1.316. - DOI - PMC - PubMed

MeSH terms

LinkOut - more resources