Bayesian search of functionally divergent protein subgroups and their function specific residues (original) (raw)

Abstract

Motivation: The rapid increase in the amount of protein sequence data has created a need for an automated identification of evolutionarily related subgroups from large datasets. The existing methods typically require a priori specification of the number of putative groups, which defines the resolution of the classification solution.

Results: We introduce a Bayesian model-based approach to simultaneous identification of evolutionary groups and conserved parts of the protein sequences. The model-based approach provides an intuitive and efficient way of determining the number of groups from the sequence data, in contrast to the ad hoc methods often exploited for similar purposes. Our model recognizes the areas in the sequences that are relevant for the clustering and regards other areas as noise. We have implemented the method using a fast stochastic optimization algorithm which yields a clustering associated with the estimated maximum posterior probability. The method has been shown to have high specificity and sensitivity in simulated and real clustering tasks. With real datasets the method also highlights the residues close to the active site.

Availability: Software ‘kPax’ is available at Author Webpage

Contact: pekka.marttinen@helsinki.fi

Supplementary information: Author Webpage

1 INTRODUCTION

Many finished and ongoing genome projects have produced massive amounts of potential protein sequence data, increasing the demand for automated sequence analysis methods. One particularly critical task in the automation context is the functional annotation of sequences. Such an annotation is based on the observation that similar sequences are usually involved in similar biological processes in different organisms, even though the function of the sequence is often more conserved in the evolution than the sequence itself. An additional aim is to identify the active site of the protein [region of a protein that makes the contact with substrate(s)], and/or to find the most critical amino acids that define the function of the active site (Casari et al., 1995; Lichtarge et al., 2003). These positions are the ones where the functional constraints for the sequence are the strongest, such that there is very little or no variation at all in functional forms of the proteins. Small variations having these characteristics can often be linked to different substrate specificities. The remaining parts of the protein sequence have usually considerably higher evolutionary mutation rate. The above two aims of sequence analysis are interconnected. The grouping of sequences to functionally similar groups enables the search for the conserved amino acids within a multiple sequence alignment. Similarly, the conserved amino acids are more informative for the assignment of sequences to functionally conserved groups.

One way to create a functional grouping is to do clustering of sequences. Clustering differs from other methods, like classification, in that it does not require any a priori knowledge about the functions of the proteins (Mirny and Gelfand, 2002). Still the existing clustering methods often require a selection of some parameter, such as the number of clusters, which has a strong effect on results. Therefore, it is a standard procedure to run clustering with various parameter settings, and to use one of the many possible clustering evaluation methods to find optimal results. For comparison of various ways for protein clustering see Donald and Shakhnovich (2005a).

A drawback of the standard clustering (and clustering evaluation) approaches is that they usually consider all the positions in the alignment equally important, whereas in reality only a small portion is associated with the active site of a protein. Often, only the amino acids within the active site show sequence similarity between distant relatives that still have the same biochemical function. Regular clustering would assign the same importance to all positions and would therefore be mainly influenced by the uninformative positions in the sequences.

We propose to separate the sequence positions into positions that are either informative (signal) or uninformative (noise) for clustering purposes. The clustering is allowed to optimize the results using only the positions that are selected as signal. Similar ideas have been utilized in different fields under names such as feature selection (e.g. Law et al., 2004) and sub-space clustering (Parsons et al., 2004). We present a fully Bayesian model that allows the simultaneous optimization of: number of clusters, the clustering of sequences and the selection of the signal preserving positions. These represent the positions that optimally illuminate diversity between the sequences, being therefore the potential specificity defining residues of the active site (e.g. Donald and Shaknovich, 2005b). We consider using information provided by the model to further identify the specificity determining residues. We validate the fundamental properties of the method with simulated datasets where the underlying cluster structure is known, and additionally, with two protein families for which an expert-derived classification to subfamilies and PFAM (Bateman et al., 2002) classification are available. In addition, we analyze the reporting of the functionally important residues and find it to show correlation with the distance from the active site. A close correspondence can be seen between our method and that of Vaithyanathan and Dom (2000), who developed a Bayesian method for text classification, with data divided into noise and signal dimensions. An earlier Bayesian method for protein sequence classification was introduced in Qu et al. (1998), however, they treated the number of classes as an a priori fixed quantity.

2 METHODS

Consider a set of n proteins whose sequences are aligned, e.g. by the ClustalW algorithm available at EBI (Thompson et al., 1994), or a more recent approach, Maxflow, introduced by Heger et al. (2004). Assume that the sequences are converted linearly into binary strings of length d, where each element equal to unity reflects the existence of a particular amino acid at a specific location in the sequence. Thus, the resulting binary strings are in theory 20 times longer than the original amino acid sequences; however, the columns consisting only of zeros are removed. So, characteristic string xh of protein h takes values in the space.

xh=(xh(1),…,xh(d))∈{0,1}d,h=1,…,n.

(1)

Here the elements 1,…,d of the binary strings are called attributes. The primary target of our analysis is the identification of a suitable integer k and a corresponding unsupervised classification (i.e. clustering) of the proteins into non-overlapping classes s1,s2,…,sk⁠. This is conveniently represented in mathematical terms by a partition S of the integer n. The biological relevance of this representation stems from the assumed evolutionary relatedness among the proteins allocated to the same class of S, which in turn is determined by statistical similarity in the characteristic strings. The general mathematical basis of unsupervised classification using a partition representation of strings from a finite alphabet has been investigated by Corander et al. (2006b). Specific applications of such a framework to classification of molecular marker data have been considered earlier by Corander et al. (2004, 2006a).

On the basis of our prior knowledge we expect that most of the attributes are such that the probabilities of observing their values equal to unity are the same in different classes. This kind of attributes are not informative for classification purposes, and hence we call them uninformative. On the other hand, if the probabilities of observing some attribute equal to unity differ between classes, the attribute is called informative. Let D⊆{1,…,d} denote all the informative attributes. Furthermore, we expect that for each class si,i=1,…,k, there exists a subgroup of informative attributes Ti that are characteristic for the class. These attributes correspond to conserved or group-specific attributes for which the corresponding attribute value is very likely to be equal to unity. We denote the groups of characteristic attributes by T = (_T_1,…,Tk).

Let x=(_x_1,…,xn) represent all the observed data. Our predictive Bayesian model M∈M for the data consists of the three above specified components _M_=(S,D,T), where the model space M contains all the triplets of the specified type. We proceed by defining a joint probability

where p(·∣M) is a predictive distribution of the data given model M and p(M) is a probability distribution representing our prior uncertainty about different models. Our goal is to find the model M that maximizes the posterior probability:

p(M∣x)=p(x∣M)p(M)∑M∈Mp(x∣M)p(M).

(3)

Thus, our classification framework aims at a simultaneous identification of the number of classes k, allocation of proteins into the classes and the characteristic attributes of the classes that are best able to predict the patterns observed in the data.

With the assumption of conditional independence of the observations over different classes and attributes, we obtain the predictive representation

p(x∣M)=∫Θ(∏i=1k∏j∈D∩Ti ∏l=01 pijlnijl)(∏i=1k ∏j∈D∖Ti ∏l=01 qijlnijl)∗(∏j∉D ∏l=01 wjlnjl)p(θ)dθ.

(4)

Three distinct types of conditional probabilities are involved in this representation. First, pijl,i=1,…,k⁠, j∈D∩Ti,l=0,1, is the probability that attribute j equals l in class i, when we know that attribute j is characteristic for class i. Second, qijl,i=1,…,k,j∈D∖Ti,l=0,1, are the corresponding probabilities for the uncharacteristic attributes in the class i. Finally, wjl,j∉D,l=0,1, is the probability that an uninformative attribute equals l for any protein. The parameter θ represents jointly all the quantitative parameters p ijl, q ijl and w jl. The observed counts of zeros and ones for an informative attribute j in class i are represented by nijl,i=1,…,k,j∈D,l=0,1. The observed counts of zeros and ones for an uninformative attribute j among all the proteins are represented by njl,j∉D,l=0,1⁠. Note that for a fixed θ the integrand in (4) can be seen as proportional to a product of binomial distributions. Thus the prior beliefs p(θ) for the parameters of the model are most naturally represented by beta distributions, the conjugate family for binomial likelihood (Bernardo and Smith, 1994):

pij1∼Beta(λ1,λ0),pij0=1−pij1,

(5)

qij1∼Beta(μ1,μ0),qij0=1−qij1, and

(6)

wj1∼Beta(ν1,ν0),wj0=1−wj1,

(7)

where λ_l_, μ_l_, ν_l_, l = 0,1 are parameters whose values are given by the user and represent the belief about how many zeros and ones we expect to observe in the three different situations. Given these prior beliefs the integral in (4) can be computed analytically analogously to Corander et al. (2004, 2006a,b) yielding:

p(x∣M)=∏i=1k ∏j∈D∩Ti Γ(∑l=01 λl)Γ(∑l=01 λl+nijl)∏l=01 Γ(λl+nijl)Γ(λl)∗∏i=1k ∏j∈D∖Ti Γ(∑l=01 μl)Γ(∑l=01 μl+nijl)∏l=01 Γ(μl+nijl)Γ(μl)∗∏j∉D Γ(∑l=01 νl)Γ(∑l=01 νl+njl)∏l=01 Γ(νl+njl)Γ(νl).

(8)

To complete the classification model, we need to assign a prior distribution for the layer (S,D,T). This is determined by

p(S,D,T)=p(S)p(T,D∣S),

(9)

where we set the probability

according to an uninformative choice, that is, we give equal probability to every partition of n. We further specify

p(T,D∣S)=c2∗ξ∑i=1k ∣D∩Ti∣(1−ξ)∑i=1k∣D∖Ti∣,

(11)

where _c_2 is a constant, ξ is the probability given by the user that a certain attribute is characteristic for some cluster, and ∣⋅∣ is the cardinality operator. This prior can be interpreted in the following way: Suppose that we are examining an informative attribute in an arbitrary class. The probability that the attribute is characteristic in this particular class has a Bernoulli distribution with success probability ξ. The formula (11) is then simply the product probability of such Bernoulli random variables over all the informative attributes and classes. This prior is characterized by a penalty for complex models having many classes and many informative attributes.

Bayes factors (Kass and Raftery, 1995) allow us to compare the relative goodness of any two models. Bayes factor (BF) in favor of _M_1 and against _M_2 is given as follows:

B12=p(x∣M1)p(M1)p(x∣M2)p(M2).

(12)

The uncertainty related to the allocation of any particular protein h in the optimal solution can be estimated in a similar fashion. Assume that in the optimal model the protein h belongs to a group i and we would like to compare this to a case where groups are kept similar except that protein h belongs now to some other group, say j. In this situation _M_1 = (_S_1,_D_1,_T_1) refers to the optimal model and _M_2 = (_S_2,_D_2,_T_2) refers to a competing model where the partition _S_2 is the same as _S_1 except for the allocation of the protein h. _D_2 and _T_2 are chosen to correspond partition _S_2. Now the BF (12) in favor of _M_1 and against _M_2 can be used as a measure of plausibility of allocating the protein h into the group i instead of the group j. BFs can also be analogously computed attribute-wise. In this case they provide a measure of an attribute being informative, e.g. characteristic to some group, against the alternative hypothesis that the attribute is uninformative. Thereby, the values of the BFs can be used to detect the functional areas in the protein sequences. The values for the mentioned BFs are provided in our implementation.

The interpretations of the different model parameters and the choice of their actual values are further considered in the Implementation section.

3 ALGORITHM

Here we provide details for an algorithm that searches the model space M to identify the optimal Bayesian solution to the classification problem, i.e. the triplet (S,D,T) maximizing the posterior probability (3). Since the denominator in (3) is the same for all models it is sufficient to find the model maximizing the numerator which is given by (2). As it is possible to analytically evaluate both the terms p(x∣M) and p(M), the problem can be viewed generally as an optimization problem for which numerous algorithms have been proposed in the literature. We have revised a greedy stochastic optimization algorithm developed in Corander et al. (2006a) for unsupervised classification of molecular marker gene data. The algorithm of Corander et al. (2006a) implemented in the BAPS 3 software has in the literature been shown to perform very well in a wide variety of challenging situations.

The central search space of our algorithm contains all possible partitions S of n. The search can be initialized either manually or with a state obtained by fast heuristic classification methods. The algorithm proceeds by repeatedly applying different kinds of transition operators to the current state (corresponding to a particular value of M). The changes that lead to an improvement in the posterior probability (3) are accepted. After each proposed change to the partition S, the optimal groups of informative (D) and characteristic (T) attributes are identified, and using these, the putative partition is evaluated. The search terminates when no changes leading to an improvement in the posterior probability are available. The different search operators available in our algorithm are:

Move a protein from one class to another.
Join two classes.
Using some heuristic distance measure between the proteins, divide a class to smaller classes.

Owing to the relatively complex objective function, the sole use of local changes such as those provided by the first alternative may easily yield a local maximum of the posterior. Therefore, it is of importance to have a variety of informative distance measures in the last alternative to guide the search and facilitate greater jumps in the search space. The different distance measures that we utilize in the splitting step are as follows:

Hamming distance between the sequences of the proteins.
Hamming distance between the protein sequences computed using only such attributes that equal unity for at least one of the two considered proteins.
Hamming distance of the sequences of the proteins, using only such attributes that are most diverse in the class (i.e. those attributes for which the number of zeros and ones is about equal). This distance is defined only between the members of the class that is being currently processed.
Hamming distance computed using only the attributes for which the total number of ones among all the proteins is above the general noise level.

In the core, the algorithm we use is a greedy stochastic search algorithm. Once a partition S is fixed, the values of D and T can be obtained exactly. However, as is the case with most greedy algorithms, there is no guarantee that the final solution S is really the optimal one. To obtain a higher level of certainty, the algorithm should be started from different initial assignments. If it happens that the algorithm ends up in two different maxima on different runs, the obtained solutions can be compared using Bayes factors. Alternative initial values of S can be easily sought by using some of the distance measures described earlier and dividing the proteins to some putative number of clusters on the basis of their heterogeneity. Also, the more different search operators the greedy algorithm has available at a certain point, the more likely it is to find the optimal solution. In our algorithm there are available local operators moving a single protein as well as global operators that treat groups of proteins.

The run-time of the algorithm depends strongly on the initial state of the search and the order in which different search operators are carried out. The most time-consuming search operator is the first one, which consecutively tries to move every protein to every other cluster. The time-complexity of this search operator is proportional to the product of the size of the dataset and the number of clusters. This can be used as a rule of thumb to estimate the total run-time of the algorithm. Table 1 shows approximate run-times of the algorithm on a 1.3 GHz P4 processor for datasets of varying size.

The most obvious alternative to a greedy search in the present setting would be an MCMC simulation, see e.g. Robert and Casella (2005). However, the size of the search space and the amount of data are often so vast, that MCMC is very unlikely to even find the best solution in a reasonable time, let one yield an acceptable approximation to the posterior distribution over the solutions, unless applied in advanced facilities with a massively parallel computing architecture. The relative time efficiency is indeed the greatest beneficial factor for a greedy algorithm approach.

4 IMPLEMENTATION

We now consider the prior specification for our Bayesian model using biologically relevant knowledge about proteins.

The value for ξ, the prior probability that an arbitrary attribute is characteristic for some class, is needed in the definition of the joint prior (9) for the qualitative layer of the model. In our default implementation ξ is chosen to equal 0.01. This reflects our prior knowledge according to which each evolutionary group has its characteristic attributes, the number of which is in the order of tens, among all the attributes, of which there are several thousands. In some preliminary simulation studies all the values between 0.001 and 0.02 resulted in the same optimal partition. The reason for considering this particular parameter was to prevent emergence of erroneous singleton classes. Anyhow, as the amount of data increases, the meaning of this prior becomes negligible, and the exact value for ξ is of no crucial importance.

We also need to specify the prior distributions for the model parameters p ij_1, q ij_1, and w j1, that define probabilities for an attribute being equal to unity when the attribute is known to be characteristic, uncharacteristic or uninformative, respectively. A straightforward choice for μ_l,ν_l,l = 0,1 defining the distributions of q _ij_1 and w _j_1, forces the expected values for q _ij_1 and w _j_1 to equal the observed frequency of ones in the whole data. Similarly, λ1 and λ0, the hyperparameters of the beta-distribution for p _ij_1, should be chosen so that the expected value of p _ij_1 equals a reasonably high probability. The exact value of this probability does not seem to be essential, as the performed preliminary simulations showed that values between 0.9 and 0.98 yield highly similar results. In our implementation we have used the value 0.95 for the expected value of p _ij_1. However, these restrictions do not yet determine the distributions for p _ij_1, q _ij_1, and w _j_1 completely. A crucial question is how to choose to sums λ1 + λ0, μ1 + μ0, and ν1 + ν0, which determine the level of variation in the prior. For example the choices (λ1,λ0) = (0.95,0.05)and (λ1,λ0) = (95,5) both define the expected value of p _ij_1 to be equal to 0.95.

The question of how to choose the sums λ1 + λ0, μ1 + μ0, and ν1 + ν0, can be approached theoretically. The value of the sum is related to the amount of prior information fed into the model. If the sum is small, say one, the prior information becomes negligible already after a few observations. This kind of a choice may lead to contradictory conclusions where the model implicates that some attribute is characteristic in a class although every protein allocated there has the attribute value equal to zero. This is due to the fact that the first observed zeros update the distribution of the characteristic attribute such that it is actually preferable to observe even more zeros. Therefore the values of the sums should not be too small. Choosing the values of the sums to be much larger than the number of observations, corresponds to a situation where the probabilities p _ij_1, q _ij_1, and w _j_1 have a point distribution with the probability mass completely concentrated at the expected value of the distribution. This kind of an approach might be reasonable, although not utilizing all the flexibility allowed by a model-based approach.

We have analyzed more than 200 simulated datasets with different choices of the values of the sums λ1 + λ0, μ1 + μ0, and ν1 + ν0. The simulations suggested that these values should not be constant but they should depend on the size of the corresponding evolutionary group that is identified from the data. This led us to the following prior distributions for p _ij_1, q _ij_1, and w _j_1:

pij1∼Beta(0.95ϕi,0.05ϕi),

(13)

qij1∼Beta(f∗ϕi,(1−f)∗ϕi), and

(14)

wj1∼Beta(f∗n,(1−f)∗n),

(15)

where n is the total number of proteins and f is the overall frequency of ones in the data. φi is the scaling factor that depends on the number of proteins in class i. We set

where ∣_si_∣ is the number of proteins in class i. This choice of the hyperparameters preserves the earlier formulated expected values. The value of the sums λ1 + λ0, μ1 + μ0 is chosen to be the number of proteins in the underlying group, except for groups having size <10, for which a constant value 10 is used. This choice allows some flexibility for the observed counts of zeros and ones while preventing contradictory behavior of the model as described before. In the simulations such a strategy proved to have more resolution and ability to handle also very small evolutionary groups than the alternative strategy with the sums of the parameters fixed to some large value.

Recall that the amount of prior information fed into the model corresponds to the sum of the corresponding hyperparameters. Thus, the amount of prior information for a model of an informative attribute can be obtained by summing over the groups the values λ1 + λ0, or μ1 + μ0, depending on whether the attribute is characteristic or not in a group. If all the groups have sizes >10, it follows from (13), (14) and (16), that the amount of prior information in the model of an informative attribute is equal to the total number of proteins in the data. To make the comparison feasible with the competing model of the attribute j being uninformative, in which case all the observations are described by a single pair of parameters, (w _j_1,w _j_0), the value of the sum of the corresponding hyperparameters ν1 + ν0 was in (15) also chosen to be equal to the number of proteins in the data. The simulations support this choice, although also showing that it is not the only possibility. In fact, any value of the sum ν1 + ν0 that is about the same size or larger than the number of possible observations would yield quite similar results.

RESULTS

Our method has been tested with real data and under different simulation scenarios. The results from the analyses of simulated data are presented collectively in Table 2. Each row in Table 2 summarizes the results from analyses of five different simulated datasets. The datasets were created by simulating proteins from five different groups, each having its characteristic attributes and frequencies for ones for characteristic and uncharacteristic attributes. The sizes of the groups were randomly chosen; however, the sizes were normed to sum up to the total of 100 proteins. In some of the test scenarios the groups had also a number of characteristic attributes in common. We use here a simple Rand-index (Hubert and Arabie, 1985) as a measure of similarity between the true and the obtained partition. Rand-index between two partitions is defined as follows:

where a is the number of pairs of proteins that are in the same class in both partitions and b is the number of pairs that are in different classes in both partitions. All the simulated datasets were also analyzed with _k_-means (e.g. Ripley, 1996) algorithm for reference. _k_-Means was run with the correct number of clusters. Also, we tried _k_-means with two different distance measures, distances 1 and 2 as described in the Algorithm section.

Table 1

Running times of the algorithm

Data	Proteins	Attributes	Clusters	Run-time
Simulated	100	7000	5	20 min
Ureases	296	7344	23	2 h
Crotonases	915	18020	157	5 days

Data	Proteins	Attributes	Clusters	Run-time
Simulated	100	7000	5	20 min
Ureases	296	7344	23	2 h
Crotonases	915	18020	157	5 days

The sizes of the analyzed datasets with running times of the algorithm on a 1.3 GHz P4 processor. Run-time depends strongly on the initial state of the search algorithm. The presented run-times are approximations for a standard run using a default initial state.

Table 1

Running times of the algorithm

Data	Proteins	Attributes	Clusters	Run-time
Simulated	100	7000	5	20 min
Ureases	296	7344	23	2 h
Crotonases	915	18020	157	5 days

Data	Proteins	Attributes	Clusters	Run-time
Simulated	100	7000	5	20 min
Ureases	296	7344	23	2 h
Crotonases	915	18020	157	5 days

The results in Table 2 show that our method is capable of inferring the underlying structure with a very high accuracy. The first four rows show results from scenarios, where the probability that a characteristic or uncharacteristic attribute equals one vary randomly on a small interval between different groups. In this kind of situations our method seemed to work optimally. The true underlying structure was inferred correctly in all the test simulations. Changing the total number of attributes or the number of characteristic attributes did not seem to affect the performance. Neither did small variations in the intervals for frequencies of ones have any effect. The algorithm was also able to identify small, as well as larger classes correctly.

Table 2

Results with simulated data

Attr	Own	Common	Freq.1	Freq.2	Avg RI	k means	k means2	False	True
1	7000	5	0	[0.8 1]	[0 0.15]	1.00	0.57	0.96	0.04	4.94
2	7000	3	0	[0.8 1]	[0 0.15]	1.00	0.71	0.93	0.06	2.95
3	3000	5	5	[0.8 1]	[0 0.15]	1.00	0.78	0.98	0.01	9.88
4	7000	5	5	[0.7 1]	[0 0.2]	1.00	0.78	0.90	0.03	9.10
5	7000	5	5	0.95	0.07	0.86	0.64	0.91	0.08	9.40
6	7000	10	5	0.95	0.07	0.90	0.57	0.93	0.08	14.5

Attr	Own	Common	Freq.1	Freq.2	Avg RI	k means	k means2	False	True
1	7000	5	0	[0.8 1]	[0 0.15]	1.00	0.57	0.96	0.04	4.94
2	7000	3	0	[0.8 1]	[0 0.15]	1.00	0.71	0.93	0.06	2.95
3	3000	5	5	[0.8 1]	[0 0.15]	1.00	0.78	0.98	0.01	9.88
4	7000	5	5	[0.7 1]	[0 0.2]	1.00	0.78	0.90	0.03	9.10
5	7000	5	5	0.95	0.07	0.86	0.64	0.91	0.08	9.40
6	7000	10	5	0.95	0.07	0.90	0.57	0.93	0.08	14.5

Each row presents a summary from analyses of five simulated datasets. Each dataset had 100 proteins simulated from five different groups. The sizes of the groups were randomized. The columns refer to Attr, total number of attributes in the simulated data; Own, number of such characteristic attributes in each group that are not characteristic for any other group; Common, number of such attributes that are characteristic in every group; Freq.1, the probability that a characteristic attribute equals unity was uniformly chosen from the shown interval for each group. When only a single value is shown, the same value was used for every characteristic attribute in every group; Freq.2, the probability that an uncharacteristic attribute equals unity was uniformly chosen from the shown interval for each group; Avg RI, average Rand-index between the true and estimated partitions in five simulations; Kmeans/Kmeans2, average Rand-index between the true partition and the partition obtained by _k_-means using the correct number of clusters and distance measure 1/2, described in the Algorithm section; False, the average number of erroneous characteristic attributes (per individual) inferred by the algorithm; True, The average number of correctly inferred characteristic attributes.

Table 2

Results with simulated data

Attr	Own	Common	Freq.1	Freq.2	Avg RI	k means	k means2	False	True
1	7000	5	0	[0.8 1]	[0 0.15]	1.00	0.57	0.96	0.04	4.94
2	7000	3	0	[0.8 1]	[0 0.15]	1.00	0.71	0.93	0.06	2.95
3	3000	5	5	[0.8 1]	[0 0.15]	1.00	0.78	0.98	0.01	9.88
4	7000	5	5	[0.7 1]	[0 0.2]	1.00	0.78	0.90	0.03	9.10
5	7000	5	5	0.95	0.07	0.86	0.64	0.91	0.08	9.40
6	7000	10	5	0.95	0.07	0.90	0.57	0.93	0.08	14.5

Attr	Own	Common	Freq.1	Freq.2	Avg RI	k means	k means2	False	True
1	7000	5	0	[0.8 1]	[0 0.15]	1.00	0.57	0.96	0.04	4.94
2	7000	3	0	[0.8 1]	[0 0.15]	1.00	0.71	0.93	0.06	2.95
3	3000	5	5	[0.8 1]	[0 0.15]	1.00	0.78	0.98	0.01	9.88
4	7000	5	5	[0.7 1]	[0 0.2]	1.00	0.78	0.90	0.03	9.10
5	7000	5	5	0.95	0.07	0.86	0.64	0.91	0.08	9.40
6	7000	10	5	0.95	0.07	0.90	0.57	0.93	0.08	14.5

The rows 5 and 6 in Table 2 show results from a more challenging scenario where the frequencies of ones in characteristic and uncharacteristic attributes were fixed so that the same values were used for every attribute in every group. In this kind of situation the small differences in the frequencies do not provide any help for the classification, but information is obtained only through the characteristic attributes. Even then the performance of the algorithm was surprisingly good, having the average values of Rand-index 0.86 and 0.90 in the two different settings, respectively. A closer inspection of the obtained partitions revealed that even in these simulations distinct classes were still identified with high accuracy. However, the number of classes was slightly overestimated so that some of the true underlying groups were further divided into subgroups. Dividing one large or two medium-sized groups into two yields a Rand-index value somewhere around 0.9. Comparison with the _k_-means shows that in most cases our method outperforms both versions of the _k_-means algorithm. In the last two rows where our method slightly overestimated the number of clusters, the second version of _k_-means algorithm benefited from knowing the true number clusters and got a bit higher Rand-index values than the presented method.

The last two columns of Table 2 show the average number of correctly inferred characteristic attributes for each individual and the average number of falsely identified characteristic attributes. Inspection of these columns shows that our method is capable of inferring the correct characteristic attributes for the groups with high accuracy. The lowest percentage of recognition of the correct attributes occured on row 4, and even then, on average 9.1 attributes out of 10 were identified. Also, the number of falsely identified characteristic attributes is very low. Even when the number of clusters was slightly overestimated on rows 5 and 6, where inferred clusters were subgroups of the correct clusters, they had almost the same groups of characteristic attributes as the correct clusters.

Besides the simulated datasets we also tested the method with two protein families, for which we had an expert curated classification into sub-families: ureases (Heger and Holm, 2003) and crotonases (A. Sivakumar, personal communication). The sizes of the datasets are shown in Table 1. We evaluated the obtained partitions by comparing them to the expert classifications using Rand-index. We also used Pfam classification (Bateman et al., 2002) as another reference whenever it was available for the sequence. For comparison, we analyzed the same datasets with two simple standard clustering methods, _k_-means and linkage clustering (e.g. Ripley, 1996), and one state-of-art method, SPC (Tetko et al., 2005). _k_-Means and linkage clustering were run using the expert opinion of the number of underlying groups. Also, we tried _k_-means and linkage clustering with different distance measures, from which the distance 2 described in the Algorithm section had the best performance. SPC was run with default settings and with small variations to them. From all the results only the most reasonable were chosen for the analysis. The results are collectively presented in Tables 3 and 4. The comparison of our method with SPC is the most interesting, since these are the two methods that are capable of inferring the underlying number of clusters. Table 3 shows that performances of the two methods do not differ much when measured with Rand-index. However, SPC did not give a classification to all of the proteins. For ureases, SPC found a classification for 264/296 proteins, and for crotonases for 623/915 proteins. The results for the Tables 3 and 4 were calculated using only these proteins. Alternative strategy would have been to consider all non-clustered proteins as singleton clusters, but this would have lead to a harsh overestimation in the number of clusters. Table 4 shows that the number of clusters as inferred by our method is closer to the expert opinion than that obtained from SPC analysis. However, it must be acknowledged that SPC works extremely fast compared with our method. For instance, the analysis of crotonases was completed within tens of seconds.

Table 3

Comparison of different clustering methods with reference clusterings

Method	Urease (man.)	Urease (Pfam)	Crotonase (man.)	Crotonase (Pfam)
Introduced method	0.96	0.64	0.87	0.75
SPC	0.89	0.62	0.87	0.76
Kmeans2	0.96	0.64	0.89	0.78
Single linkage	0.68	0.61	0.84	0.70
Complete linkage	0.87	0.52	0.89	0.78

Method	Urease (man.)	Urease (Pfam)	Crotonase (man.)	Crotonase (Pfam)
Introduced method	0.96	0.64	0.87	0.75
SPC	0.89	0.62	0.87	0.76
Kmeans2	0.96	0.64	0.89	0.78
Single linkage	0.68	0.61	0.84	0.70
Complete linkage	0.87	0.52	0.89	0.78

Results given by five different clustering methods were compared with reference clusterings using Rand-index. The comparison was performed using two sets of proteins, urease and crotonase, and the reference clustering was either the expert hand-curated clustering (man.) or Pfam clustering. Kmeans2 refers to the _k_-means algorithm using distance measure 2, as described in the Algorithm section, with the correct number of clusters. Both the linkage based methods were also run with the correct number of clusters.

Table 3

Comparison of different clustering methods with reference clusterings

Method	Urease (man.)	Urease (Pfam)	Crotonase (man.)	Crotonase (Pfam)
Introduced method	0.96	0.64	0.87	0.75
SPC	0.89	0.62	0.87	0.76
Kmeans2	0.96	0.64	0.89	0.78
Single linkage	0.68	0.61	0.84	0.70
Complete linkage	0.87	0.52	0.89	0.78

Method	Urease (man.)	Urease (Pfam)	Crotonase (man.)	Crotonase (Pfam)
Introduced method	0.96	0.64	0.87	0.75
SPC	0.89	0.62	0.87	0.76
Kmeans2	0.96	0.64	0.89	0.78
Single linkage	0.68	0.61	0.84	0.70
Complete linkage	0.87	0.52	0.89	0.78

Table 4

The table shows the inferred numbers of clusters for urease and crotonase datasets by two different methods, the introduced method and SPC. Also, the expert opinion of the true number is shown

Inferred numbers of clusters	Urease	Crotonase
Expert opinion	23	157
Introduced method	24	103
SPC	17	63

Inferred numbers of clusters	Urease	Crotonase
Expert opinion	23	157
Introduced method	24	103
SPC	17	63

Table 4

The table shows the inferred numbers of clusters for urease and crotonase datasets by two different methods, the introduced method and SPC. Also, the expert opinion of the true number is shown

Inferred numbers of clusters	Urease	Crotonase
Expert opinion	23	157
Introduced method	24	103
SPC	17	63

Inferred numbers of clusters	Urease	Crotonase
Expert opinion	23	157
Introduced method	24	103
SPC	17	63

We also wanted to evaluate the selection of functionally important features from the protein sequences. Owing to space constraints we limit this analysis to the urease family. Our program enables two ways of monitoring the important amino acid positions: cluster-specific feature representation and global feature representation. Cluster-specific presentation shows sequences as separate clusters, with cluster-specific over-represented features. Amino acids in an important position in the alignment are represented with capital letter, whereas the rest of the alignment is shown in the lower case letters. Owing to the large size this is shown only in the additional material.

Global feature representation plots the Bayes factor (BF) between the signal model and noise model for each position of amino acid sequence, combining the importance of the position in all the clusters. Clustering actually produces 20 attribute values (one for each amino acid type) for each position. In order to maximally distinguish features of this information we combined two measures, (1) the maximum attribute value and (2) the sum of all attribute values for each alignment position. Results are shown in Figure 1a, showing high values for positions that were considered important by the clustering process. Notice the two types of well-performing areas: regions where the maximum value and the sum are high, and regions where the sum is high, although the maximum does not have a similarly high value. The first case corresponds to the regions where the entire signal is represented by one conserved amino acid type (such as positions 184 and 55), therefore being the functionally very critical positions. The latter case represents the positions where few amino acid types occurring in combination contain a considerable signal (positions such as 293 and 268), thus being the most potential candidates for specificity determining residues.

Fig. 1

(a) Bayes factors of signal model versus noise model are plotted for each position in the alignment. The blue line shows the sum of logarithms of BFs of 20 different amino acids per position; the green line shows the logarithm of the maximum BF in a position. Asterisks represent the known important positions from the active site and circles show the positions that were close to the active site in 3D structure, by showing the first and the last or the only position that falls within the closest 30 positions [for details see (d)]. Notice that in some places asterisk and circle point the same position. Asterisks and circles are only shown for sum profile. Observe how all the circles are clearly elevated from the profile baseline. (b) Negative Shannon entropy (SE) per position in the alignment, calculated directly from the original alignment. The continuous profile shows the results obtained with method 1 and dotted line the method 2, for SE calculus. Method 1 is used in the analysis. Asterisks and circles have the same meaning as in (a). Note that although most asterisks get a clear peak, many circles are practically at the profile baseline. (c) Negative SE per position in the alignment, calculated using the obtained clusters. Both methods for SE calculus are shown with continuous line as they give almost identical results. Asterisks and circles are shown for method 1, and they have again the same meaning as in (a). Notice how differently from (a) and (b) the baseline of these profiles is significantly higher in the first positions, last positions and around 200th position. (d) Negative distance to ligand or metal ion for showing the positions close to the active site. Areas without information are caused by gaps in the structure-target sequence alignment. Notice that all the asterisks, except one in position 147 (more details on this position in the text), are mapped very close to the active site. Also notice how the circles represent the borders for active site areas, although in the position around 246 only a single position was selected as close position.

Similar information can be produced by calculating the Shannon entropy (SE) from the original sequence alignment (e.g. Weiss et al., 2000) without any clustering 1b, or calculating it separately for each cluster and averaging the cluster scores 1c (Mihalek et al., 2004). We used the obtained clustering result to create profile in 1c, so the differences between 1a and 1c are only caused by a different measure. There are also two ways of calculating the SE: comparing the number of each amino acid type to total number of amino acids in the position, and comparing the number of each amino acid type to total number of sequences in the dataset. Profile in 1b is calculated using the first method as it gave more reasonable results than the second (which is shown with a dotted line in 1b) whereas 1c shows both methods, as they give similar results.

To ease the comparison we have marked six residues from active site (five metal binding, one catalytic; swissprot entry PYRC_METJA) with asterisks to all the profiles. These positions differ so that 61, 63, 184, 222 and 295 are invariant, staying similar in most clusters, whereas position 147 varies and has either lysine, glutamate or no feature at all in different clusters. Entropy calculated from the original alignment 1b finds the invariant positions (ranks 6, 5, 2, 11, 1 for 61, 63, 184, 222 and 295) whereas the position 147 is totally lost (rank 138). SE, obtained with clustering 1c gives mainly similar peaks as 1b. Now the position 147 has obtained a weak peak. Still the most striking feature in 1c is the strong local bias in the baseline (average of local minimums), especially at region around position 200. Varying baseline is seen with both methods. Analysis of alignment (see additional files: urease_output.txt) revealed that this region has got mostly gaps. SE is based on maximum likelihood model and such models can have problems when there is a varying amount of information available, like in the positions that have larger number of gaps.

The maximum BF profile picks most invariant positions nicely (ranks 3, 4, 1, 5 for 61, 63, 184 and 295, respectively) with the surprising exception of position 222 (rank 26) that does not perform as well as with SE. Still the largest benefit of BF profile can be seen in the sum profile that can monitor the combinatorial signal as now the position 147 gets a clearer signal (rank 38). Notice also that BF profiles do not suffer from the varying baseline level observed in the SE with clustering 1c.

Previous invariant positions are often easy to find from alignment simply by eye (Holm and Sander, 1998). The positions that are conserved only within clusters and vary significantly between them are harder to find. As these positions are not as well known, we used structural information (from PDB entry 1xge, PYRC_ECOLI) and calculated the minimum distance from every residue to any heteroatom (metal ion and ligands). Obtained negative distance value profile is shown in 2d. We mapped the known positions from active site to this profile, showing that, apart from the position 147, all are placed within 2 Å from heteroatom. Exception in the results at the area around the 150th position is due the bad quality of the alignment, and the distance results ofthis region seemed unreliable. The correct distance is manually added as another higher asterisk to the 147th position.

Obtained distance profile shows many regions close to the active site. We further highlighted the positions within top 30 results by marking the first and last position from close region with a red circle to the profiles. Note that these circles occasionally collide with known active site positions. Now it is clear that the BF sum profile gives better scores for areas close to the active site, although some of the positions are also picked by SE methods. This can be seen with regions at positions 63 and 295, where SE scores drop dramatically (practically to minimum) whereas the BF sum stays quite high, agreeing more with the distance profile. Region at position 90 gets clear peak from BF sum whereas the SE profiles do not show signal at all. Areas at positions 123 and 269 are close to stronger peaks in BF sum but are also elevated themselves, whereas in SE profiles these areas have no signal. SE methods pick correctly the point (246), but it is only BF sum that considers also the neighborhood of this point important, again showing more agreement with distance. Low score in 1b points that many of these positions cannot be found from the alignment without the clustering (see the alignment.txt in Supplementary Material). To further highlight the correlation between the distance and the BF sum, we show a scatter plot of these two values (Fig. 2). Notice that all the positions closer than 5 Å to the heteroatoms get score >50. Also most of the positions that have score >150 are within 10 Å from the active site. We observed an approximate border at the 10 Å distance where we have most of the signal in between 50 and 250, whereas beyond 10 Å signal seems to be from ∼0 to 150. Especially the positions that get strong signal from BF and lie within 10 Å from the active site seem like good candidates for functional analysis (for a detailed analysis of some positions see Supplementary Material). Other methods did not show similar correlation in the scatter plots (see figure in Supplementary Material).

Fig. 2

Scatterplot showing the correlation between the (negative) distance (X) to the active site and the sum of log(BF) (Y). Vertical border line highlights the difference in the scores when the distance goes from <10 Å to >10 Å. Horizontal border at 150 highlights that most of positions with this score are within 10 Å distance from active site. Horizontal border at 50 highlights similarly that none of the positions within 5 Å and few within 10 Å gets BF sum <50.

6 DISCUSSION

We have presented a method for simultaneous clustering of proteins into evolutionary groups and finding the active areas in the sequences of the proteins in different groups. The method has been shown to perform well in a variety of simulated scenarios and with two empirical datasets. The potential of using the method for identifying the functionally important residues has been highlighted by showing the correlation of the BF sum with the distance from the heteroatoms at the active site. Recognition of functionally important residues is one of the key problems in biosciences. Although the results obviously need evaluation with other structures, our results still show that the obtained BF sum should be among the methods considered, when analyzing the key residues. The greatest advantage of our method over other methods comes from the possibility of a probabilistic (predictive) comparison of any two solutions. This fully Bayesian approach also provides the user an estimate of the most likely number of different groups that are present in the data. The statistical power to detect different groups in the data is increased by the fact that the model is able to learn which areas in the sequences are most relevant for classification purposes. Also, biological information concerning the characteristics of the functional areas in the sequences is used in the formulation of the model.

It should be noted that the presented method can be applied at three different levels (1) use of the proposed algorithm and model to find the needed clustering solution and function specific residues, as done here, (2) evaluation of a large group of clustering results with the specified model and the selection of the optimum clustering and the corresponding function specific residues; and (3) analysis of manually selected optimum cluster solution for the selection of functionally important residues.

A considerable effort was devoted to the specification of the hyperparameters of the model. However, it should be stressed that, although this may appear as a tedious task, at least in our experiments all the reasonable values for the parameters gave results that were in accordance with each other. We have managed to define the parameters in a way that can be justified theoretically, and is supported by a large number of simulations. One advantage of scaling the hyperparameters as described in Section 4 is that the resulting model can be expected to perform well with datasets of widely varying magnitude. This will allow us to develop the software further in the future by simply considering the computational efficiency of the algorithm, while the model itself remains the same. One further advantage of our strategy is that it does not require any inputs regarding the parameters of the model from the user. This saves the potential time needed in testing a range of possible alternatives for the model (hyper)parameters.

Owing to the overwhelming size of the model space in a typical application, it seems to be beyond the capabilities of any thinkable method to provide a reliable estimate of the probability distribution over the whole model space. Therefore we suggested the use of Bayes factors to locally estimate the uncertainty related to the optimal solution. Bayes factors provide a way to compare any two models. They also allow us to estimate the uncertainty related to the allocation of any single protein into the classes. The properties of the obtained solution can also be assessed through the sets of characteristic attributes inferred for each group. First, they provide a way to detect a putative overestimation in the number of classes: if two classes have nearly equal sets of characteristic attributes, it is clear that in reality they are very close to each other. Second, if some class has none or very few characteristic attributes it may indicate that the class does not form an evolutionary group, but may consist of outlier proteins for which a classification is not clear. This latter kind of a behavior has been observed in some of the analyses that we have performed.

One drawback of our method is the requirement of a reliable multiple sequence alignment. Although some practical solutions for multiple sequence alignment exist, the problem still remains open for further research. Especially, no truly Bayesian model-based methods have been presented for the problem, and yet this would seem to provide a natural way to estimate the uncertainty. One of our future goals is to better take the uncertainty related to the alignment into account. We also consider the possibility of simultaneous multiple sequence alignment and clustering with the sequences. The model-based approach could be further utilized in comparison of the different clusters that are present in the estimated optimal model. In particular, it would be feasible to do a search (e.g. MCMC) in the space of different ways to partition the clusters into greater groups. Results of this kind of an analysis could be summarized using a dendrogram based on probabilities in a similar manner as in Corander et al. (2004), yielding a hierarchical clustering. Another issue where our method could be improved is to better utilize the biologically meaningful information related to the various types of conservation observed in amino acids. This kind of detailed information could be brought into our model by a careful modification of the prior distributions. However, even without the suggested improvements our method already shows a reasonable performance.

Another area where our method is behind the distance-based methods (e.g. SPC) is the time consumption. This is a result of the model-based approach and the resulting need to search through a vast space of different models. However, the current implementation of the method is capable of dealing with challenging real-life data in a reasonable time (e.g. crotonases). Also, we feel that the advantages of the model-based approach are worth the introduced computational cost. We plan to improve the computational efficiency of our method by implementing a parallel version of the algorithm that could utilize a multi-processor computer architecture. At the same time we will introduce more intelligent steps to the search algorithm, possibly by combining our method with the NMF method presented in Heger and Holm (2003). These advances will allow users to analyze datasets on much larger scale than presented here. We believe that the method presented in this paper has potential of becoming a standard tool in the analysis of protein sequence data.

This work was supported by the research funds of the University of Helsinki and Tekes (grant no. 40672/01).

Conflict of Interest: none declared.

REFERENCES

et al.

The Pfam protein families database

Nucleic Acids Res.

2002

, vol.

(pg.

276

280

)

Bayesian Theory

1994

Wiley

Chichester

et al.

A method to predict functional residues in proteins

Nat. Struct. Biol.

1995

, vol.

(pg.

171

178

)

et al.

BAPS 2: enhanced possibilities for the analysis of genetic population structure

Bioinformatics

2004

, vol.

(pg.

2363

2369

)

et al.

Bayesian identification of stock mixtures from molecular marker data

Fish. Bull.

2006

in press

et al.

Bayesian unsupervised classification algorithms based on parallel search strategy

Pattern Recogn.

2006

under revision

Determining functional specificity from protein sequences

Bioinformatics

2005

, vol.

(pg.

2629

2635

)

Predicting specificity-determining residues in two large eukaryotic transcription factor families

Nucleic Acids Res.

2005

, vol.

(pg.

4455

4465

)

Sensitive pattern discovery with ‘fuzzy’ alignments of distantly related proteins

Bioinformatics

2003

, vol.

(pg.

i130

i139

)

et al.

Accurate detection of very sparse motifs

J. Comput. Biol.

2004

, vol.

(pg.

848

857

)

An evolutionary treasure: unification of a broad set of amidohydrolases related to urease

Proteins

1998

, vol.

(pg.

)

Comparing partitions

J. Classif.

1985

(pg.

193

218

)

textbf2

Bayes factors

J. Am. Stat. Assoc.

1995

, vol.

(pg.

773

795

)

et al.

Simultaneous feature selections and clustering using mixture models

IEEE Trans. Pattern Anal. Mach. Intell.

2004

, vol.

(pg.

1154

1166

)

et al.

Accurate and scalable identification of functional sites by evolutionary tracing

J. Struct. Funct. Genomics

2003

, vol.

(pg.

159

166

)

et al.

A family of evolution-entropy hybrid methods for ranking protein residues by importance

J. Mol. Biol.

2004

, vol.

336

(pg.

1265

1282

)

Using orthologous and paralogous proteins to identify specificity-determining residues in bacterial transcription factors

J. Mol. Biol.

2002

, vol.

321

(pg.

)

et al.

Subspace clustering for high dimensional data: a review

ACM SIGKDD Explor. Newslett.

2004

, vol.

(pg.

105

)

et al.

Bayesian protein family classifier

Proc. Int. Conf. Intell. Syst. Mol. Biol.

1998

, vol.

(pg.

131

139

)

Pattern Recognition and Neural Networks

1996

Cambridge

Cambridge University Press

Monte Carlo Statistical Methods

2005

2nd edn

New York

Springer

et al.

Super paramagnetic clustering of protein sequences

BMC Bioinformatics

2005

, vol.

pg.

et al.

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

Nucleic Acids Res.

1994

, vol.

(pg.

4673

4680

)

Model-based hierarchical clustering

Proceedings of the 16th Annual Conference on Uncertainty in Artificial Intelligence (UAI-00)

2000

San Francisco, CA

Morgan Kaufmann Publishers

(pg.

599

608

)

et al.

Information content of protein sequences

J. Theoref. Biol.

2000

, vol.

206

(pg.

379

386

)

Author notes

Associate Editor: Dmitrij Frishman