RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors (original) (raw)

Abstract

Motivation: The sequence specificity of DNA-binding proteins is typically represented as a position weight matrix in which each base position contributes independently to relative affinity. Assessment of the accuracy and broad applicability of this representation has been limited by the lack of extensive DNA-binding data. However, new microarray techniques, in which preferences for all possible K-mers are measured, enable a broad comparison of both motif representation and methods for motif discovery. Here, we consider the problem of accounting for all of the binding data in such experiments, rather than the highest affinity binding data. We introduce the RankMotif++, an algorithm designed for finding motifs whenever sequences are associated with a semi-quantitative measure of protein-DNA-binding affinity. RankMotif++ learns motif models by maximizing the likelihood of a set of binding preferences under a probabilistic model of how sequence binding affinity translates into binding preference observations. Because RankMotif++ makes few assumptions about the relationship between binding affinity and the semi-quantitative readout, it is applicable to a wide variety of experimental assays of DNA-binding preference.

Results: By several criteria, RankMotif++ predicts binding affinity better than two widely used motif finding algorithms (MDScan, MatrixREDUCE) or more recently developed algorithms (PREGO, Seed and Wobble), and its performance is comparable to a motif model that separately assigns affinities to 8-mers. Our results validate the PWM model and provide an approximation of the precision and recall that can be expected in a genomic scan.

Availability: RankMotif++ is available upon request.

Contact:quaid.morris@utoronto.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Understanding protein–DNA interactions is critical to understanding genome function. Recognition of specific DNA sequences by proteins is central in processes such as control of gene expression and chromosome replication, and as a consequence plays an extensive role in shaping genome sequence. On the basis of domain structure, over 6% of all human genes encode potential DNA-binding proteins (Messina et al., 2004), and the majority of sequence conserved among verterbrates is non-coding and may be regulatory (Pennacchio et al., 2006; Siepel et al., 2005).

Given the importance of protein–DNA interactions in computational biology, it is remarkable that the simple position weight matrix (PWM) model remains a virtually unchanged standard more than 20 years after its introduction (Berg and von Hippel, 1987). The PWM is a simple model that predicts the binding energy of a protein–DNA complex. It assumes that each DNA base in the binding site of the protein makes an independent, additive contribution to the binding energy of the protein–DNA complex. Other representations of DNA-binding preferences, such as position frequency matrices (PFMs), that assume that bases contribute independently to the binding affinity can be transformed into PWMs.

For most transcription factors (TFs), the PWM is derived from a few dozen sequences that the TF is known to bind. These sequences are most commonly obtained from in vitro selection (Vlieghe, 2006), a procedure that selects for and reports only the most highly specific binding sequences (Roulet et al., 2002). PWMs are increasingly inferred from ChIP-chip data as well (MacIssac et al., 2006), despite binding sites in vivo being influenced by factors other than binding specificity [e.g. protein–protein interactions and chromatin structure (Liu _et al_., 2005)]. In fact, one recent report suggests widespread use of predicted weak DNA-binding sites in vivo (Tanay, 2006), suggesting that stringent in vitro selections may not capture all of the information relevant to in vivo function of TF DNA-binding activities.

To our knowledge, the additive PWM model has not been rigorously challenged in its ability to correctly predict affinity to all possible binding sequences, largely due to the difficulty of making measurements on an exhaustive collection of sequences in vitro and the presence of confounding factors in vivo. Until recently, only a relatively small number of sequences had ever been tested for any given TF (Man and Stormo, 2001; Mukherjee et al., 2004). However, development in the last two years of microarray-based techniques for measuring protein–DNA associations (Berger et al., 2006; Liu et al., 2005; Mukherjee et al., 2004; Warren et al., 2006) promises a dramatic expansion in protein–DNA interaction data.

These data present new opportunities to revisit both the PWM model of protein–DNA affinity as well as the various algorithms designed to learn this model from binding data. The validity of the PWM model can be tested by measuring its ability to reproduce the protein–DNA interaction data. However, fitting a PWM model, or motif, to these microarray-based measurements of binding affinity is difficult. In these data, a microarray intensity level is a semi-quantitative measurement of the affinity of the TF for the DNA sequence represented by the probe. Traditional motif finding algorithms like Bioprospector (Liu et al., 2001) and MEME (Bailey and Elkan, 1994) mostly ignore this intensity and treat all probe sequences above a given threshold equally. As a result the motifs produced by these algorithms are very sensitive to the ad hoc choice of the threshold.

Two newer popular motif finding algorithms, MDScan (Liu et al., 2002) and MatrixREDUCE (Foat et al., 2005, 2006), incorporate intensity data into their search but make different assumptions about the relationship between binding affinity and the microarray intensity. MDScan uses the intensity data to classify bound probes into two groups based on the rank of their intensity. The higher ranked group contributes more to the PWM. MatrixREDUCE uses the microarray intensity as a surrogate for binding affinity and fits a motif model by assuming a linear relationship between the two. Though microarray intensity can be to used to predict the relative binding preferences of a TF for given pairs of probes, the exact functional form of the relationship between the TF-binding affinity for a probe sequence and its intensity level remains unclear and may vary depending on the experimental technique.

To address this problem, algorithms have recently been developed that fit motif models to intensity rankings rather than the intensities themselves (Berger et al., 2006; Chua et al., 2006; Tanay, 2006). One of these, PREGO (Tanay, 2006), was designed originally for ChIP-chip data and fits a PWM model by attempting to maximize the Spearman rank correlation between binding affinities and the microarray intensities using hill-climbing optimization. Another, Seed and Wobble (Berger et al., 2006), was designed for protein binding microarray (PBM) data (Mukherjee et al., 2004) and calculates an enrichment score similar to a Mann-Whitney U statistic for each 8mer and gapped 8mer. A PWM model is constructed by combining the enrichment scores of the best ‘seed’ 8mer and those of single base ‘wobbles’ around the seed using a Boltzmann distribution. These and other algorithms based on intensity rankings may be sensitive to noise in the microarray intensity levels that can translate into large changes in the relative rankings of probes with similar TF-binding affinities.

Here we introduce a new motif finding algorithm, RankMotif++, which uses the microarray data to infer relative binding preferences of the TF for pairs of probes and learns a motif model that is consistent with these preferences. By predicting binding preferences inferred from microarray intensities rather than the intensities themselves, RankMotif++ is applicable for a wide variety of relationships between the TF-binding affinity and semi-quantitative readout of that affinity. Because it relies on inferred binding preferences rather than absolute rankings, RankMotif++ can be less sensitive to noise in probe intensity measurements. We evaluate our algorithm and test the general validity of the PWM model of binding affinity using data from a recent study by Berger et al. (2006).

Berger et al. (2006) used PBMs to analyze five different TFs. These were each analyzed on two different array designs containing independent de Bruijn sequences, such that each 10mer was represented once on each array, but embedded in a different 35-base context from the other array. We have used these data to both gauge the reproducibility of the individual K-mer-binding affinities inferred from the PBM data for the TFs, as well as the accuracy of the PWM model derived from several state-of-the-art approaches. The results of this analysis support the general validity of the PWM model, highlight some cases where it may or may not be advantageous to use individual K-mer scores, and also serve as a guide to the accuracy of genomic scans using either a PWM or a library of K-mers.

2 ALGORITHM

The RankMotif++ algorithm is based on a probabilistic model of binding preferences. The binding preferences of the TF can be inferred from the microarray intensity data using standard statistical techniques. In our model, the probability that one probe will be observed to be preferentially bound by the TF over another probe depends on the ratio of the affinities assigned to each probe sequence by the motif model. RankMotif++ fits PWM by searching for those that maximize the likelihood of the observed preference data under this probabilistic model.

2.1 Probabilistic model of binding preferences

Let a potentially noisy observation of the binding preferences of a TF be represented by a set of binary random variables formula⁠, where N is the number of probes and X ij = 1 if probe sequence i is observed to be preferred over sequence j. By definition we set X ij = 1 − X ji.

Let the outcome of a particular experimental assay, for example a PBM experiment be represented by a configuration of a subset of the binary random variables in X. Though, in general, a given experiment may result in an observation of only a subset of the possible preferences, to simplify the presentation of the model, we will assume that all possible preferences are observed. Later on we will describe how to relax this assumption. Without loss of generality, we will also assume for the time being that the probes are indexed so that if i > j then X ij = 1.

Let si be the sequence associated with probe i and formula be the set of all probe sequences, and let Θ = {θ 1, θ 2, … , θ M} represent the parameters of the motif model. Our model also includes a scaling factor formula that governs the effect that changes in predicted binding affinity have on the probability of observing a preference in the microarray data. The value w is also fit by maximizing the log likelihood.

Given Θ, w and S, our model assumes that the preferences are conditionally independent, so the log likelihood of Θ and w, formula⁠, can be written as

formula

We model the probability of observing a given preference P(X ij = 1 | Θ, w, S) as depending only upon the scaled log ratio of the binding affinities of the motif model for sequences si and sj. If we set formula where g(s i, Θ ) is the binding affinity of motif model Θ for si, then we write this probability as:

formula

(1)

where σ (x) = 1/(1+e_−_x) is the well-known logistic sigmoid function. Note that we can rewrite equation (1) as

formula

In the following when the values of Θ, w and S are obvious, we will use the notation z ij to represent P(X ij = 1 |Θ, w, S).

2.2 Maximum likelihood estimation of the motif model

We maximize the likelihood of the binding model with respect to Θ and w using conjugate gradient descent (Press et al., 2002) on the log likelihood of the binding data. In this section, we describe how the gradient of the log likelihood is calculated.

Since X ij = 1−X ji, we can write

formula

(2)

and thus

formula

(3)

Further noting that

formula

and that

formula

after taking the derivative of equation (3) with respect to an element θ m of Θ, we get

formula

(4)

which can be simplified to

formula

(5)

where r i = 1+∑ ji X ij is simply the rank of probe i in a list of probe intensities sorted in increasing order, and formula can be interpreted as the expected rank of i under the binding preference model. So the partial derivative of the cost function with respect to θ m depends upon the difference between the actual and expected rank of probe i, weighted by the derivative of f(s i,Θ ) with respect to θ m. Similarly, the partial derivative with respect to w is

formula

(6)

2.3 Definition of the motif model and composition function

None of the above derivation depends on the particular form of our motif model Θ or our method of estimating the binding affinity f (· , · ). As such, our binding preference model can easily be generalized to more complex motif models, so long as formula exists for all elements θ m of Θ.

To investigate the accuracy of the PWM model and allow direct comparison against other PWM-based motif finding algorithms, we take Θ as representing a PWM, whose (b,k)_th_element, Θ bk, is the weight of base formula at position k in the binding site. Each column, Θ k, of the PWM can be used to define a multinomial distribution over D as follows:

formula

(7)

and the probability of an arbitrary K-mer s under the model is

formula

(8)

where sk is the _k_-th element of s.

A given sequence can contain multiple K-mers that are good matches to the PWM. A number of methods have been proposed to combine the probabilities assigned to the individual K-mers within a sequence into a composite measurement of the binding affinity of the PWM for the whole sequence. We use a combination mechanism similar to the GOMER (Granek and Clarke, 2005) scoring function. In particular, we define formula to be the probability under the PWM that the TF binds to at least one of the K-mers that are subsequences of si, i.e.

formula

(9)

where Li is the length of sequence si, and formula is the subsequence of si that starts at the (t+1)-th element and ends at the (t+K)-th element inclusive. Assuming that all the K-mers in si can be bound independently, g(s i, Θ ) is the probability that at least one of them will be bound. Our combination mechanism differs from GOMER in how we calculate the probability of binding to a given K-mer using the PWM. We made this change in order to ensure that our K-mer probability function had continuous partial derivatives.

The partial derivatives of f(s i,Θ ) with respect to Θ bk are:

formula

(10)

where δ s t+k,b = 1 if s t+k = b, otherwise δ s t+k,b = 0. Inserting equation (10) into equation (5) gives the partial derivatives necessary to complete the computation of the gradient of L(Θ, w).

2.4 Partial observations of binding preferences

When the difference (or ratio) between the observed intensities of two probes is within the noise level of the assay, we treat the TF-binding preference between those two probes as being unobserved. These unobserved preferences are easily handled in the binding preference model. Because the observed preferences are conditionally independent given the parameters of the model and the probe sequences, unobserved preferences can simply be marginalized out of the likelihood. The only change that needs to be made to the model is in the definition of ri and formula in equation (6). If T is the set of probe pairs (i, j) for which binding preferences have been observed, then the updated definition of ri and formula are

formula

Note that if (i,j) ∈ T then (j,i) ∈ T.

3 METHODS

3.1 PBM intensity data

We downloaded the PBM intensity measures from the Supplementary Material section of Berger et al. (2006). These data contain measurements from two different universal microarrays for each of the five transcription factors Cbf1, Ceh22, Oct1, Rap1 and Zif268.

We normalized the microarray intensity data in two steps. We first translated the microarray intensities by adding a constant to each measurement so that the minimum intensity was equal to one. We then log-transformed the translated data. We call these log-intensities the ‘normalized intensities’ and use yi to denote the normalized intensity of probe i.

3.2 Deriving protein-binding preferences from normalized intensities

Because microarray intensity data is noisy, the relative probe intensity levels are not always a reliable indicator of transcription factor binding preferences. As such, to define a reliable set of protein-binding preferences, we separately estimated the intensity noise level for each array and only predict a binding preference when one of the probes has a significantly higher normalized intensity than the other.

Typically in microarray-based assays of transcription factor binding affinity, only a small proportion of probes detect sequence-specific binding. As such, we can estimate the SD of the noise in the normalized intensity of non-specific binding using a robust estimate of the SD, σ, of the entire distribution of normalized intensity levels. In our experience, σ is also an empirical upper bound on the SD of the normalized intensity noise affecting probes detecting sequence-specific binding. As such, in the following, we will use σ as an estimate of the SD of the noise affecting all probes.

Our robust estimate is calculated by computing the median absolute deviation (MAD) of the distribution of normalized intensities and setting σ to the MAD divided by 0.6745 (the MAD of the unit normal distribution).

We define the set of positive (i.e. bound) probes as those whose normalized intensity formula⁠, where my is the median of {_y_1, _y_2, … , y N}. These are probes whose normalized intensity we deem to be significantly different, given the noise level σ, from the distribution of intensities of probes detecting non-specific binding. The proportion of positives varies between 0.9 and 5.4% among the ten arrays. Probes that are not positives are called negatives.

We set X ij = 1 if and only if yi has significantly higher normalized intensity than yj given σ. We represent that condition as formula⁠. If by the former condition neither X ij = 1 or X ji = 1, i.e. formula⁠, then we deem the intensity difference between the two probes to be within the noise level of the array. In this case, we represent X ij as being unobserved and thus do not try to predict the binding preference between these two probes.

3.3 Fitting RankMotif++ motif models

We use all the positives and a random subset of the negatives to learn the motif model in RankMotif++. The negative subset is a random subsampling of all negatives with size of 800.

Like many other motif finders, the log likelihood optimized by RankMotif++ is not convex, so different initializations can generate different results. For the experiments reported here, we used three different initialization for each motif length and we found that these three were enough to locate good local optima.

To initialize RankMotif++ for motifs of length seven, we identified the three 7mers most associated high intensity probes. We used the Wilcoxon–Mann–Whitney test on each 7mer to score the distribution of intensities of the probes that did contain the 7mer against the distribution of those that did not. From the three 7mers with the lowest _P_-values, we generate three initial position frequency matrices to seed the motif search. For each of these 7mers, we initialize a PFM by putting 0.7 for the consensus base in each column and 0.1 for the non-consensus base. We transform the PFM to a PWM by taking the log of each entry. In some but not all cases, the final PWM that our optimization procedure converges on has a similar consensus to the initial PWM.

To initialize the search for motifs of length K > 7, we generate two PFMs derived from the highest scoring RankMotif++ PFM of length K − 1 by adding a column of 0.25 on either side. We also generate one random PWM with weights uniformly distributed between −0.05 and 0.05. Occasionally the search starting from the randomly initialized PWM leads to the highest scoring PWM.

We computed motifs with widths from 7 to 13, and we used the motif model with the highest log likelihood for the test array evaluations.

3.4 Fitting the MatrixREDUCE motif models

We used MatrixREDUCE to generate one motif of length range between 7 and 13 by setting parameters formula between 7 and 13 and setting formula to 1. In computing this one motif, MatrixREDUCE considers a number of initial seeds and uses the highest scoring one among them, and it automatically chooses the best motif width among the range input. The transcription factors in the PBM study are unlikely to bind as homodimers so we set formula and formula to 0. We use the default settings for the other parameters.

We applied MatrixREDUCE to both the PBM intensity data directly and to the normalized intensities of the positive and negative probe sets used by RankMotif++. The overall performance of MatrixREDUCE on the benchmarks was similar for the two different inputs. The PWMs in Figure 1 are generated from raw PBM intensities but for consistency with other methods, we benchmark MatrixREDUCE on the normalized intensities inputs.

Sequence logo representations of the PWMs generated by the five motif finding algorithms for the five TFs on each of the two arrays.

Fig. 1.

Sequence logo representations of the PWMs generated by the five motif finding algorithms for the five TFs on each of the two arrays.

3.5 Fitting the MDScan motif models

We also vary the MDScan motif widths w between 7 and 13. MDScan uses a ‘top’ set and a refinement set to fit its motif model. We set the size of the top set to 20 which is the largest size used in Liu et al. (2002). We set the refinement set size to be equal to the number of positive probes on each array, and we use the default settings for all the other parameters. For each length, we only evaluate the motif model assigned the top score by MDScan.

The motif scores provided by MDScan always increased for the shorter motifs. As such, we used an alternative method for determining the optimal motif width. Using the top scoring motif model for each width, we scored each positive probe sequence of the training array using equation (9), and then we selected the motif model whose sequence scores had the highest Spearman correlation with the normalized intensities for the positive probes.

3.6 Fitting the PREGO motif models

The input probe sets for PREGO were the same as those used by RankMotif++.

We set the PREGO parameters formula to 7 and formula to 14. With these settings, PREGO reports the highest scoring motif of length range between 7 and 13. In order to force PREGO to run on our normalized intensities, we set formula to 1. We use the default settings for the other parameters.

3.7 Fitting the Seed and Wobble models

We were supplied with Seed and Wobble PWMs by Anthony Philippakis. We asked that these PWMs be fit separately to each array. For one of the arrays, Ceh22 Array 1, he supplied us with two slightly different PWM models.

3.8 Fitting the fully specified 8mer motif model

Each 8mer is associated with the median normalized intensity of the probes in which it appears in the training array.

3.9 Scoring test set probe sequences under each motif model

Note that all the motif models were only fit to training arrays, here we describe how these motif models were used to score the probe sequences on the test arrays.

  1. 8mer: The score assigned to a test set probe is simply the maximum of training array median intensities associated with the 8mers present in the probe.
  2. RankMotif++, MDScan, PREGO: We use equation (9) where s is the test probe sequence and Θ is the motif model fit to the training array.
  3. Seed and Wobble: We use equation (9) as above except for the two Ceh22 Array 1 PWMs. For Ceh22 Array 1, we scored each probe separately under the two PWMs [using equation (9)] and then set the composite score for the probe to be s = 1 − (1−_s_1)(1−_s_2) where _s_1 and _s_2 were its scores under the two PWMs respectively.
  4. MatrixREDUCE: The MatrixREDUCE motif model is neither a position weight matrix nor a position frequency matrix. As such, we use the MatrixREDUCE scoring function formula⁠, where θ is the MatrixREDUCE motif model, si is the test probe sequence, Li is the length of si and formula is the (t+K)-th element of sequence si.

3.10 Computing performance metrics

To test the generalization ability of each of the motif models, we train each model on intensity data from one of the array designs and tested generalization performance on the other. Each design was used once as a training array and once as a test array.

We score each model's ability to correctly identify and rank the positive probes (as previously defined) in the test array using the methods as described in Section 3.9.

For each TF and each method, we calculate the Spearman correlation between the sequence scores and the intensities for the positive probes, which measures how well each motif model can reproduce the rankings of the intensities of the positives.

By thresholding the sequence scores at a given threshold, we can make predictions of which test set probes will be positive. For each threshold, we can calculate the true positive rate (⁠formula⁠, also called the recall or the sensitivity), the false positive rate (⁠formula⁠) and the precision (⁠formula⁠). By decreasing the sequence score threshold and computing the true and false positive rates and the precision at these decreasing thresholds, we plot the ROC curves and the precision/recall curves shown in Figure 2.

The ROC curves and the Precision/Recall curves for array #1 for each of the six methods and each of the five TFs.

Fig. 2.

The ROC curves and the Precision/Recall curves for array #1 for each of the six methods and each of the five TFs.

4 RESULTS

Figure 1 shows the PWMs produced by each method for the five different transcription factors on each of the two array designs using sequence logos generated using enoLogo (Workman et al., 2005). There is a general overlap in the consensus sequences of the PWM but also consistent differences between the PWMs from different methods. In general, MatrixREDUCE and RankMotif++ produce more degenerate PWMs than MDScan. Seed and Wobble motifs are almost always the widest and the PREGO motifs are always between 7 and 8 columns. In general, PWMs from the same method run on data from different arrays are more similar than PWMs from different methods run on the same array data.

To evaluate whether these differences in PWMs would lead to differences in genome-wide predictions of TF binding, we evaluated the accuracy with which each PWM could reproduce the ranking of probe intensities based on the probe sequence.

Each probe sequence was scored by each PWM, and the rankings of the probe scores were compared to the rankings of the measured probe intensities. In order to judge the validity of the position-independent model, we included a fully specified model of 8mer TF-binding preferences in our evaluation. To ensure that the performance measures were not influenced by probes not bound by the TF, we defined a set of ‘positive’ probes that had intensities at least four SDs above the median intensity. For each PWM, we determined how highly the positive probes ranked in the score rankings (Table 1 and Figure 2) and how well the scoring ranking reproduced the observed intensity rankings (Table 2).

Table 1.

Spearman rank correlation coefficients

graphic

graphic

Table 1.

Spearman rank correlation coefficients

graphic

graphic

Table 2.

True positive rates at 1% false positive rate

graphic

graphic

Table 2.

True positive rates at 1% false positive rate

graphic

graphic

Table 1 shows the Spearman correlation coefficients between the probe scores and the probe intensities for each of the six methods. All of these correlations are statistically significant under a one-sided test, the largest _P_-value of any element in the table is less than 10−5 and only four _P_-values were larger than 10−10. RankMotif++ has the highest correlation among the PWMs in eight out of the ten conditions with MatrixREDUCE and Seed and Wobble having the highest correlation for both Oct1 arrays. The low MDScan correlations are likely due to the lack of degeneracy of the MDScan PWMs: between 50 and 75% of positive probes do not have any hits to the MDScan PWMs and are thus assigned the same score as most of the non-positive probes. The degeneracy of the MDScan model is governed by one of the algorithm's parameters. Though we set this parameter to the highest level used in Liu et al. (2002), it is possible that the performance of MDScan would improve with a different setting.

Table 2 shows the sensitivity (true positive rate) of each PWM at 99% specificity (1% false positive rate). RankMotif++ has highest sensitivity in seven out of the ten conditions. At this specificity level, the MDScan PWMs still have hits in the positive probes. Figure 2 provides a more detailed view of the predictive performance of each method on the positive probes from the array design # 1.

RankMotif++ recovers probe rankings better any of the other four methods though does not perform quite as well as the fully specified 8mer model. However, RankMotif++ has higher correlations and higher sensitivities than the 8mer model in three of the ten conditions, particularly for Rap1 which has binding profiles with more than eight conserved bases. On Cbf1 and Ceh22, 8mer median and RankMotif++ have similar correlations and sensitivities. Though the 8mer median performance is better on Oct1 and Zif268.

5 DISCUSSION

We have made an investigation into the general validity of the PWM model of transcription factor DNA binding, and in the same analysis introduced a new method for learning PWMs. As a gold standard, we used a fully specified model of 8mer- binding preferences.

Surprisingly, the predictive accuracy of the 8-mer model at predicting the 35mer probe intensities was not as high as we had expected. Even for Cbf1, which has a core motif of six bases, a value of 50% recall is associated with only 60% precision by 8mer scores. Spearman correlations among the highest intensity spots (positives) are also far from absolute, suggesting that our stringent definition of positives is not causing a threshold effect. This phenomenon could represent a performance limit imposed by the reproducibility or noise level of the assay or the way the 8mer scores are calculated, or an effect of sequence or structural properties of flanking bases in the 35mer probe. Indeed Berger et al. (2006) report an effect of both the position of the consensus on the probe and its sequence context on the intensities of probes containing the Zif268 consensus. However, they also report that 8mer median intensities calculated from different arrays are highly reproducible. In our hands, the Spearman correlations between the highest intensity 8mer scores calculated separately on each array average 0.7593 over the five TFs. However, because the 8mer scores each are calculated from 16 separate 35mer intensity measurements, this correlation may represent an overestimate of their performance on a genome-wide scan.

By several criteria, our method, RankMotif++, performed favorably in comparison to currently used tools. In some circumstances, RankMotif++ even performed better than the 8mer model, for example in the case of Rap1. We note that the binding profile of Rap1 is greater than 11 bp, so the PWM models are inferring the binding affinities of Rap1 to 12 and 13mers not represented in the training array and thus cannot be fully captured by the 8mer model. On the whole, PWMs learned by RankMotif++ on PBM microarray array data are more accurate predictors of transcription factor binding preferences than those learned by two widely used motif finders for use with semi-quantitative data: MatrixREDUCE and MDScan and two new algorithms that use intensity rankings: PREGO and Seed and Wobble. We suggest that these differences in performance are due differences in assumptions that each method makes about the relationship between transcription binding affinity and semi-quantitative readout and differences in each method's sensitivity to intensity noise.

ACKNOWLEDGEMENTS

We would like to acknowledge helpful discussions with Anthony Philippakis, Mike Berger and Martha Bulyk and thank Anthony for supplying us with Seed and Wobble PWMs. X.C. was supported by a grant from Genome Canada through the Ontario Genomics Institute. This work was supported in part by an NSERC operating grant to Quaid Morris.

Conflict of Interest: none declared.

REFERENCES

Fitting a mixture model by expectation maximization to discover motifs in biopolymers

,

Proc. Int. Conf. Intell. Syst. Mol. Biol.

,

1994

, vol.

2

(pg.

28

-

36

)

Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters

,

J. Mol. Biol

,

1987

, vol.

193

(pg.

723

-

750

)

et al.

Compact, universal DNA microarrays to comprehensively determine transcriptionfactor binding site specificities

,

Nat. Biotechnol

,

2006

, vol.

24

(pg.

1429

-

1435

)

et al.

Identifying transcription factor functions and targets by phenotypic activation

,

Proc. Natl. Acad. Sci. USA

,

2006

, vol.

103

(pg.

12045

-

12050

)

et al.

Profiling conditionspecific, genome-wide regulation of mRNA stability in yeast

,

Proc. Natl. Acad. Sci. USA

,

2005

, vol.

102

(pg.

17675

-

17680

)

et al.

Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE

,

Bioinformatics

,

2006

, vol.

22

(pg.

e141

-

e149

)

et al.

An improved map of conserved regulatory sites for Saccharomyces cerevisiae

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

113

Explicit equilibrium modeling of transcription-factor binding and gene regulation

,

Genome. Biol

,

2005

, vol.

6

pg.

R87

et al.

BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes

,

Pac. Symp. Biocomput

,

2001

(pg.

127

-

138

)

2001

et al.

An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments

,

Nat. Biotechnol

,

2002

, vol.

20

(pg.

835

-

839

)

et al.

DIP-chip: rapid and accurate determination of DNA-binding specificity

,

Genome Res

,

2005

, vol.

15

(pg.

421

-

427

)

et al.

Whole-genome comparison of Leu3 binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site selection

,

Genome Res

,

2006

, vol.

16

(pg.

1517

-

1528

)

Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay

,

Nucleic Acids Res

,

2001

, vol.

29

(pg.

2471

-

2478

)

et al.

An ORFeome-based analysis of human transcription factor genes and the construction of a microarray to interrogate their expression

,

Genome Res

,

2004

, vol.

14

(pg.

2041

-

2047

)

et al.

Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays

,

Nat. Genet

,

2004

, vol.

36

(pg.

1331

-

1339

)

et al.

Numerical Recipes in C++

,

2002

2nd edn

Cambridge University Press

et al.

Highthroughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites

,

Nat. Biotechnol

,

2002

, vol.

8

(pg.

831

-

835

)

et al.

In vivo enhancer analysis of human conserved non-coding sequences

,

Nature

,

2006

, vol.

444

(pg.

499

-

502

)

et al.

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

,

Genome. Res

,

2005

, vol.

8

(pg.

1034

-

1050

)

Extensive low-affinity transcriptional interactions in the yeast genome

,

Genome. Res

,

2006

, vol.

16

(pg.

962

-

972

)

et al.

A new generation of JASPAR, the open-access repository for transcription factor binding site profiles

,

Nucleic Acids Res

,

2006

, vol.

34

(pg.

D95

-

D97

)

et al.

Defining the sequence-recognition profile of DNA-binding molecules

,

Proc. Natl. Acad. Sci. USA

,

2006

, vol.

103

(pg.

867

-

872

)

et al.

enoLOGOS: a versatile web tool for energy normalized sequence logos

,

Nucleic Acids Res

,

2005

, vol.

33

(pg.

W389

-

W392

)

© 2007 The Author(s)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.