Multiple alignment by sequence annealing (original) (raw)

Abstract

Motivation: We introduce a novel approach to multiple alignment that is based on an algorithm for rapidly checking whether single matches are consistent with a partial multiple alignment. This leads to a sequence annealing algorithm, which is an incremental method for building multiple sequence alignments one match at a time. Our approach improves significantly on the standard progressive alignment approach to multiple alignment.

Results: The sequence annealing algorithm performs well on benchmark test sets of protein sequences. It is not only sensitive, but also specific, drastically reducing the number of incorrectly aligned residues in comparison to other programs. The method allows for adjustment of the sensitivity/specificity tradeoff and can be used to reliably identify homologous regions among protein sequences.

Availability: An implementation of the sequence annealing algorithm is available at Author Webpage

Contact: sariel@cs.berkeley.edu

1 INTRODUCTION

The multiple alignment problem is to find all homologous amino acids, or nucleotides, among multiple sequences. This problem is similar in some respects to the protein folding problem: each multiple alignment can be evaluated according to the homology relationships it specifies in the same way that the Gibbs free energy can be computed for a protein conformation. However the differences between the problems reveal distinct fundamental hurdles that lead to very different computational problems. In protein folding the free energy of a conformation is derived from a clear understanding of the underlying physics, whereas in multiple alignment the mechanisms of evolution are not well understood leading to uncertainty over how to evaluate multiple alignments. There is another fundamental difference: while the space of protein conformations is infinite and difficult to explore completely, there are natural ‘moves’ in the space based on changing backbone torsion angles or side chain conformations. The space of multiple alignments, while also difficult to explore completely, is discrete and finite. However it has been unclear how to perform ‘small’ steps in the space, making it difficult to accurately align single amino acids or nucleotides. In this paper we show that it is possible to efficiently explore the space of multiple alignments using the smallest possible steps.

Existing programs for multiple sequence alignment are mostly based on the idea of progressive alignment (Feng and Doolittle, 1987). We refer the reader to Batzoglou (2005) for a review of this approach to alignment. Here we note that it is widely used, by first generation programs such as CLUSTALW (Thompson et al., 1994) and T-Coffee (Notre-dame et al., 2000), as well as by more recent programs such as Align-m (Van Walle et al., 2005), MUSCLE (Edgar, 2004) and ProbCons (Do et al., 2005). In terms of the space of multiple alignments, we can think of progressive alignment as beginning with the ‘null alignment’ where no two sequences are aligned and proceeding via large steps to arrive at a complete alignment. Each step consists of aligning either a pair of sequences to each other or the alignment of a two partial alignments of subsequences to each other. This is a fundamental weakness of progressive alignment. For example, at the very first step two entire sequences are aligned, possibly incorrectly, because other informative sequences are not used. This has been addressed by iterative refinement (Gotoh, 1996), which improves on progressive alignment but still often fails to correct complex alignment errors involving multiple sequences.

Another approach to multiple sequence alignment was introduced by Morgenstern et al. (1996) and pursued in a series of papers developing the DIALIGN program (Morgenstern et al., 1998; Subramanian et al., 2005). The main idea was to address the problems of progressive alignment by incrementally aligning matching segments to each other, while preserving the consistency of the alignment. This segment-to-segment alignment approach effectively reduced the size of the steps taken by progressive alignment methods. A key ingredient was the careful formulation of the multiple alignment problem via precise definitions of partial and global multiple alignments (Morgenstern et al., 1999). We discuss these ideas in detail in Section 2.

In Section 3 we introduce the method of sequence annealing. We build alignments one match at a time, ‘annealing’ positions that are more likely to be homologous, first. Using fast algorithms for online topological ordering (Katriel and Bodlaender, 2006), we are able to rapidly construct a consistent global multiple alignment. We remove the requirement of DIALIGN so that alignment proceed by segment-to-segment comparison and allow for segments of size 1. This eliminates the need for many of the heuristics incorporated in DIALIGN. In order to correctly align single residues, we compute posterior probabilities for matches and gaps, using methods described in Holmes and Durbin (1998), Durbin et al. (1998), Do et al. (2005) and Schwartz et al. (2006).

In Section 4 we show that sequence annealing improves on all existing methods for protein multiple sequence alignment. Not only is sequence annealing more accurate, it is also very fast, even though it is based on performing very small steps in multiple alignment space.

2 ALIGNMENT POSETS

We use the notation formula to denote the i_-th element of a sequence of length n. By a set of sequence characters S = {σ1, … , σ_k} we mean the set of _n_1 + n_2 ⋯ + nk sequence characters that form the sequences σ1, σ2, … , σ_k of lengths _n_1, n_2, … , nk respectively. A partial global multiple alignment of sequence characters S = {σ1, … , σ_k} is a partially ordered set P = {_c_1, … , cm} together with a surjective function ϕ : S → P such that formula if i < j. The elements of P correspond to columns of the multiple alignment, and the partial order specifies the order in which columns must appear. We call P an alignment poset, and note that unless P is a total order, there are columns of the partial multiple alignment whose order is unspecified. A linear extension of a partially ordered set P = {_c_1, … , cm} is a permutation of the elements _c_1, … , cm such that whenever ci < cj, i < j. A global multiple alignment is a partial global multiple alignment together with a linear extension of the alignment poset P (Fig. 1). A similar representation using directed acyclic graphs instead of postes is used by the Partial Order Alignment program (Lee et al., 2002), and has also been used in the A-Bruijn alignment program (Raphael et al., 2004).

A set of four sequences, an alignment poset together with a linear extension and a global multiple alignment. The function from the set of sequence elements to the alignment poset that specifies the multiple alignment is not shown, but is fully specified by the diagram on the right. For example, the second element in the first sequence is corresponds to the fourth column of the multiple alignment.

Fig. 1

A set of four sequences, an alignment poset together with a linear extension and a global multiple alignment. The function from the set of sequence elements to the alignment poset that specifies the multiple alignment is not shown, but is fully specified by the diagram on the right. For example, the second element in the first sequence is formula corresponds to the fourth column of the multiple alignment.

There is an important (trivial) case of a partial multiple alignment:

Example (null alignment): A null global multiple alignment of k sequence S = {σ1, … , σk} of lengths _n_1, … , nk is a partial global multiple alignment _M_Null where the alignment poset P has size formula ⁠. Note that in this case P must be a disjoint union of k chains.

Let formula be the set of all partial multiple alignments of a set of sequences S. A scoring function for multiple alignments is a function f : formula which assigns a ‘score’ to each partial multiple alignment. The problem of specifying biologically realistic scoring functions that can be evaluated efficiently is an important open problem that we do not address in this paper. In what follows, we make use of a scoring function defined for alignments of pairs of sequences, namely the alignment metric accuracy (Schwartz et al., 2006). The score of a multiple alignment is defined to be the sum-of-pairs score for all of the pairwise alignments. This scoring function has useful properties explained in Schwartz et al. (2006) most notably that the sum-of-pairs score is meaningful for producing an alignment ‘close’ to the true alignment.

3 SEQUENCE ANNEALING

The goal of global multiple alignment is to find argmax_M_∈ formula (f(M)) where M ranges over all partial multiple alignments and f is a scoring function. In principle, f can be evaluated at every partial multiple alignment but this is not feasible in practice, as the set formula is large (Dress et al., 1998).

We formalize a greedy approach to multiple alignment as follows. Let S be a collection of k sequences of length _n_1, … , nk and let formula ⁠. A sequence annealing for S with scoring function f is a nesting of partial global multiple alignments ML ⊃ ML − 1 ⋯ ⊃ Mr where the alignment poset Pi associated to Mi has ∣_Pi_∣ = i for all i, PL is a null multiple alignment of S and f (M i+1) ≤ f(Mi). Note that each multiple alignment Mi consists of a function formula

By definition, the sequence annealing process can only proceed by minimal steps that reduce the number of columns in the (partial) alignment by one. This can only be done by merging two existing columns formula into formula ⁠, such that formula if formula ⁠, or formula otherwise. The difficulty in finding a sequence annealing is that not all pairs of columns can necessarily be merged in a partial multiple alignment. However it is natural to consider the algorithm in Figure 2.

A general sequence annealing algorithm.

Fig. 2

A general sequence annealing algorithm.

The time complexity of the algorithm depends on (1) the number of iterations of the main loop, (2) the time it takes to check the condition in line 3 and (3) the time it takes to merge the two columns. The time for (1) is simply L − r + 1. (3) can be done in O(k) time, since there are at most k − 1 sequence elements that are affected by each merge operation. The challenging step of the algorithm is (2).

In order to perform step (2) efficiently a weight formula is assigned to each pair of columns in ML. Candidate pairs with positive weights are placed in a heap in O(Lk log(Lk)) time.1 At every iteration of the algorithm the candidate pair p with the highest weight is drawn from the heap in O(1) time. The weight of p is recalculated to account for merge operations that involve the positions in p. If the new weight _w_′ is lower than the weight of the current top candidate in the heap the edge is reinserted into the heap with the new weight _w_′ in O(log(Lk)) time, otherwise merge(p) is performed.

The merge operation can be done efficiently using an online topological ordering algorithm. Given a directed acyclic graph G, the topological ordering problem is to find a function ord: V → N such that if i → j then ord(i) < ord(j). Directed acyclic graphs are equivalent to partially ordered sets, and the topological ordering problem is just the problem of finding a linear extension of the poset (the former terminology is used in computer science, and the latter terminology in mathematics). It is easy to see that the topological ordering problem is trivial (Tarjan, 1972). The online topological ordering problem is the topological ordering problem where edges appear one at a time. This problem was first considered in Alpern et al. (1990) and Marchetti-Spaccarnela et al. (1996). Significantly for our application, the problem admits efficient algorithms. We omit a detailed complexity analysis in this paper and refer the reader to Ajwani et al. (2006). In our implementation we adopted the algorithm of Pearch and Kelly (2004), which has good time complexity and is relatively easy to implement.

The correctness of the algorithm requires that the weight function w(p) has the property that merging a pair with positive weight will result in M _i_−1 for which f(M _i_−1) ≥ f(Mi). In addition, a good weight function should be correlated with δ = f(M _i_−1) − f(Mi), such that pairs that have a higher potential contribution to the objective function will be merged earlier in the sequence annealing process. Since w can change after each merge operation, a naive algorithm will have to update the weights for all the affected pairs after every such operation and reinsert them to the heap with the updated weights. To reduce the complexity of the algorithm we currently restrict w to be independent of merge operations or to have the property that the weight of a pair can only be reduced after a merge operation. The second property allows for a rapid evaluation of w. Only when a pair pi is fetched from the top of the heap, its weight _w_′(pi) is recalculated. Since _w_′(pl) ≤ w(pl), formula if and only if _w_′(pl) ≥ w(ph), where ph is the next top candidate pair in the heap H.

In order to specify a weighting function w, we first need to define an objective function for the global multiple alignment problem. Such an objective function should be derived from the quality measures used to evaluate alignments. The most widely used accuracy measure for multiple sequence alignment is the developer score (fD) (Sauder et al., 2000), which is equivalent to the sum-of-pairs score (SP) (Thompson et al., 1994). fD is a sensitivity measure, defined as the number of correctly aligned residue pairs in the evaluated alignment divided by the number of aligned residue pairs in the reference alignment. In Schwartz et al. (2006) we show that use of fD to derive the objective function leads to the over alignment problem, in which many unrelated positions are aligned without much support, resulting in low specificity, especially when aligning unrelated sequences or sequences with unalignable regions. Moreover, sine most alignment programs are compared based on fD alone, programs that are considered to be very accurate can actually produce poor alignments that do not distinguish between related (homologous) and unrelated sequence-regions.

To solve this problem we introduced in Schwartz et al. (2006) a new accuracy measure for global multiple sequence alignment named alignment metric accuracy (AMA), which is based on a metric for multiple alignments. AMA is defined as the fraction of residues that are aligned correctly to another residue or to a gap, summed over all sequence pairs. We refer the reader to the original paper for a formal definition of the alignment metric, the AMA score, and further discussion about their properties.

Given a probabilistic model for pairwise alignments, the match posterior probabilities formula assign a probability to the event that formula is aligned to formula given the probabilistic model specified by the parameters θ and the data (Durbin et al., 1998). Although gap posterior probabilities formula and formula are not discussed in Durbin et al. (1998), they are important and easy to derive.

A key point is that there is a family of objective functions, based on posterior decoding, that can be used to optimize the expected fD score, as well as the expected AMA score:

formula

(1)

In fact, these objective functions can be used to obtain a wide range of alignments, from the most specific to the most sensitive. When the gap-factor parameter Gf is set to 0, _f_0 evaluates the expected fD score. _f_0.5 evaluates the expected AMA score. In general, setting Gf to higher values results in alignments with higher specificity, while reducing the value of Gf results in higher sensitivity.

We propose two weight functions that are derived from f as defined in (1). Both have a static and a dynamic version. The static weights are computed once using the null alignment and stay constant during the sequence annealing process, while the dynamic weights are recomputed (rapidly) for each top candidate column pair (ck, ci):

formula

then

formula

(3)

and

formula

(4)

The first weight function formula assigns the highest weight to the pair of columns p max, for which the increase in the objective function formula divided by the number of new matched pairs is maximal. It is easy to see that this weight function satisfies the requirement that weights can only decrease following a merge operation, since it is calculated by averaging over all newly matched residue pairs.

While formula is a good hill-climbing heuristic and is useful for finding a local maxima of formula at the final alignment Mr, it has no guarantees for the alignments produced during the sequence annealing process. The second weight function formula addresses this issue. It is motivated by the empirical observation that when the optimal pairwise alignment is found using dynamic programming, 99.8% of the pairs that are matched in alignments that use higher gap-factor values for increased specificity also appear in alignments that use lower gap-factors for better sensitivity. Sequence annealing with wtgf emulates the process of slowly reducing the temperature (gap-factor in our analogy), allowing pairs of columns whose weights become positive to align. Therefore at any step of the sequence annealing process the scoring function that is being optimized is f temp(M), where formula ⁠.

The following lemma shows that formula can only decrease after a merge operation.

Lemma:

If _x_1, … , xn and _y_1, … , yn are positive numbers then

formula

Proof: We may assume without loss of generality that formula ⁠. If xk < yk for all k, then formula ⁠, which is a contradiction.□

4 RESULTS

We implemented sequence annealing by extending our existing protein alignment software that already performs posterior decoding based alignments. The new program, AMAP 2.0, is abbreviated to AMAP in this paper2.

The following variations of AMAP were tested:

AMAP—the program with its default settings. with dynamic weights.
AMAP_sn_—adjusted to maximize sensitivity. with static weights and two rounds of consistency transformations. This version of the program is best suited for comparison with existing multiple alignment programs that focus on maximizing sensitivity.
AMAP_sp_—adjusted for better specificity. with dynamic weights.

We compared the performance of AMAP with other alignment programs on the SABmark 1.65 datasets (Van Walle et al., 2005). SABmark includes two sets of reference alignments with a known structure from the ASTRAL (Brenner et al., 2000) database. The Twilight Zone set contains 1740 sequences with <25% identity divided into 209 groups based on SCOP folds (Murzin et al., 1995). The Superfamilies set contains 3280 sequences with <50% identity divided into 425 groups. In addition, each dataset has a ‘false positives’ version, which contains unrelated sequences with the same degree of sequence similarity in addition to the related sequences. Programs tested include Align-m 2.3 (Van Walle et al., 2005), CLUSTALW 1.83 (Thompson et al., 1994), DIALIGN-T 0.2.1 (Subramanian et al., 2005), MUSCLE 3.52 (Edgar, 2004), ProbCons 1.1 (Do et al., 2005) and T-Coffee 3.84 (Notredame et al., 2000). Our main results are as follows:

The sensitivity of AMAP_sn_ averaged over all positions in both SABmark datasets (with and without false-positives) is higher than all tested programs and is achieved with the highest specificity of all tested programs. Remarkably, the specificity of AMAP_sn_ is almost three times higher than the closest competitors in sensitivity, when false positives are introduced.
Both AMAP and AMAP_sp_ get better modeler and AMA scores than all other programs, including Align-m, which is designed to optimize specificity rather than sensitivity.
The running time of AMAP is comparable to (and in some cases faster than) that of DIALIGN-T, MUSCLE, ProbCons and T-Coffee. This was achieved without any optimization for speed in the current prototype.

These results are detailed in Tables 1 and 2 that show the performance of all the alignment programs we tested measured with the developer, modeler and AMA accuracy measures on the SABmark 1.65 datasets without and with false positives, respectively. The developer score (fD) measures sensitivity and the modeler score (fM) measures specificity. The AMA measure, discussed earlier in the paper, provides a balanced assessment of the overall accuracy of an alignment.

Table 1

Comparison of protein alignment programs on the SABmark datasets with no false-positives

Twilight (209)	Superfamilies (425)	Overall by alignments	Overall by positions	Average time
Program	f D	f M	AMA	f D	fM	AMA	fD	fM	AMA	f D	fM	AMA	Seconds
Align-m	21.6	23.6	51.7	49.2	45.6	56.9	40.1	38.3	55.2	35.2	45.4	56.6	12.7
CLUSTALW	25.6	14.7	24.9	54.0	38.1	43.8	44.7	30.4	37.6	33.6	19.5	28.2	0.4
DIALIGN-T	21.3	19.8	45.5	49.9	44.9	54.8	40.4	36.6	51.7	33.9	38.6	52.5	1.4
MUSCLE	27.3	16.4	27.6	56.3	40.3	46.4	46.8	32.4	40.2	37.5	22.5	31.7	2.1
ProbCons	32.1	21.1	37.4	59.8	44.4	51.8	50.7	36.7	47.0	43.0	34.3	47.0	4.5
T-Coffee	26.7	18.1	35.2	56.5	42.8	50.3	46.7	34.7	45.3	39.4	31.5	44.5	11.3
AMAPsn	30.9	22.4	40.9	58.8	45.3	53.3	49.6	37.8	49.2	43.3	39.1	51.9	2.4
AMAP	24.0	28.3	51.2	52.8	54.6	59.5	43.3	45.9	56.8	32.5	59.7	59.6	1.7
AMAPsp	14.5	41.5	56.5	38.7	69.4	60.2	30.7	60.2	59.0	20.7	78.1	58.9	1.4

Twilight (209)	Superfamilies (425)	Overall by alignments	Overall by positions	Average time
Program	f D	f M	AMA	f D	fM	AMA	fD	fM	AMA	f D	fM	AMA	Seconds
Align-m	21.6	23.6	51.7	49.2	45.6	56.9	40.1	38.3	55.2	35.2	45.4	56.6	12.7
CLUSTALW	25.6	14.7	24.9	54.0	38.1	43.8	44.7	30.4	37.6	33.6	19.5	28.2	0.4
DIALIGN-T	21.3	19.8	45.5	49.9	44.9	54.8	40.4	36.6	51.7	33.9	38.6	52.5	1.4
MUSCLE	27.3	16.4	27.6	56.3	40.3	46.4	46.8	32.4	40.2	37.5	22.5	31.7	2.1
ProbCons	32.1	21.1	37.4	59.8	44.4	51.8	50.7	36.7	47.0	43.0	34.3	47.0	4.5
T-Coffee	26.7	18.1	35.2	56.5	42.8	50.3	46.7	34.7	45.3	39.4	31.5	44.5	11.3
AMAPsn	30.9	22.4	40.9	58.8	45.3	53.3	49.6	37.8	49.2	43.3	39.1	51.9	2.4
AMAP	24.0	28.3	51.2	52.8	54.6	59.5	43.3	45.9	56.8	32.5	59.7	59.6	1.7
AMAPsp	14.5	41.5	56.5	38.7	69.4	60.2	30.7	60.2	59.0	20.7	78.1	58.9	1.4

Entries show the average developer (fD), modeler (fM) and AMA scores. Best results are shown in boldface. All numbers have been multiplied by 100.

Table 1

Comparison of protein alignment programs on the SABmark datasets with no false-positives

Twilight (209)	Superfamilies (425)	Overall by alignments	Overall by positions	Average time
Program	f D	f M	AMA	f D	fM	AMA	fD	fM	AMA	f D	fM	AMA	Seconds
Align-m	21.6	23.6	51.7	49.2	45.6	56.9	40.1	38.3	55.2	35.2	45.4	56.6	12.7
CLUSTALW	25.6	14.7	24.9	54.0	38.1	43.8	44.7	30.4	37.6	33.6	19.5	28.2	0.4
DIALIGN-T	21.3	19.8	45.5	49.9	44.9	54.8	40.4	36.6	51.7	33.9	38.6	52.5	1.4
MUSCLE	27.3	16.4	27.6	56.3	40.3	46.4	46.8	32.4	40.2	37.5	22.5	31.7	2.1
ProbCons	32.1	21.1	37.4	59.8	44.4	51.8	50.7	36.7	47.0	43.0	34.3	47.0	4.5
T-Coffee	26.7	18.1	35.2	56.5	42.8	50.3	46.7	34.7	45.3	39.4	31.5	44.5	11.3
AMAPsn	30.9	22.4	40.9	58.8	45.3	53.3	49.6	37.8	49.2	43.3	39.1	51.9	2.4
AMAP	24.0	28.3	51.2	52.8	54.6	59.5	43.3	45.9	56.8	32.5	59.7	59.6	1.7
AMAPsp	14.5	41.5	56.5	38.7	69.4	60.2	30.7	60.2	59.0	20.7	78.1	58.9	1.4

Twilight (209)	Superfamilies (425)	Overall by alignments	Overall by positions	Average time
Program	f D	f M	AMA	f D	fM	AMA	fD	fM	AMA	f D	fM	AMA	Seconds
Align-m	21.6	23.6	51.7	49.2	45.6	56.9	40.1	38.3	55.2	35.2	45.4	56.6	12.7
CLUSTALW	25.6	14.7	24.9	54.0	38.1	43.8	44.7	30.4	37.6	33.6	19.5	28.2	0.4
DIALIGN-T	21.3	19.8	45.5	49.9	44.9	54.8	40.4	36.6	51.7	33.9	38.6	52.5	1.4
MUSCLE	27.3	16.4	27.6	56.3	40.3	46.4	46.8	32.4	40.2	37.5	22.5	31.7	2.1
ProbCons	32.1	21.1	37.4	59.8	44.4	51.8	50.7	36.7	47.0	43.0	34.3	47.0	4.5
T-Coffee	26.7	18.1	35.2	56.5	42.8	50.3	46.7	34.7	45.3	39.4	31.5	44.5	11.3
AMAPsn	30.9	22.4	40.9	58.8	45.3	53.3	49.6	37.8	49.2	43.3	39.1	51.9	2.4
AMAP	24.0	28.3	51.2	52.8	54.6	59.5	43.3	45.9	56.8	32.5	59.7	59.6	1.7
AMAPsp	14.5	41.5	56.5	38.7	69.4	60.2	30.7	60.2	59.0	20.7	78.1	58.9	1.4

Entries show the average developer (fD), modeler (fM) and AMA scores. Best results are shown in boldface. All numbers have been multiplied by 100.

Table 2

Comparison of protein alignment programs on the SABmark datasets with false-positives

Twilight-FP (209)	Superfamilies-FP (425)	Overall by alignments	Overall by positions	Average time
Program	fD	fM	AMA	fD	fM	AMA	fD	fM	AMA	fD	fM	AMA	Seconds
Align-m	17.8	6.4	81.5	44.8	16.8	77.5	35.9	13.4	78.9	28.6	17.6	83.3	158.9
CLUSTALW	20.4	2.4	35.5	50.9	7.4	37.0	40.8	5.7	36.5	31.2	4.0	34.2	1.7
DIALIGN-T	17.0	4.5	74.1	46.7	14.0	71.5	36.9	10.9	72.4	30.2	13.5	78.5	5.7
MUSCLE	19.4	2.3	37.1	49.7	7.5	38.9	39.7	5.8	38.3	30.7	4.1	35.7	19.2
ProbCons	26.8	4.4	55.6	56.0	10.9	55.0	46.4	8.8	55.2	34.8	6.5	54.2	28.5
T-Coffee	13.0	2.3	56.5	42.5	9.3	56.6	32.8	7.0	56.6	26.7	6.6	62.6	61.2
AMAPsn	27.3	6.6	68.3	56.1	14.1	63.8	46.6	11.6	65.3	35.7	17.7	81.1	13.5
AMAP	19.2	9.8	84.4	46.4	27.0	84.2	37.4	21.4	84.2	27.4	30.1	88.5	11.2
AMAPsp	12.7	17.3	91.0	35.7	45.9	91.1	28.1	36.5	91.1	19.1	50.3	91.4	10.0

Twilight-FP (209)	Superfamilies-FP (425)	Overall by alignments	Overall by positions	Average time
Program	fD	fM	AMA	fD	fM	AMA	fD	fM	AMA	fD	fM	AMA	Seconds
Align-m	17.8	6.4	81.5	44.8	16.8	77.5	35.9	13.4	78.9	28.6	17.6	83.3	158.9
CLUSTALW	20.4	2.4	35.5	50.9	7.4	37.0	40.8	5.7	36.5	31.2	4.0	34.2	1.7
DIALIGN-T	17.0	4.5	74.1	46.7	14.0	71.5	36.9	10.9	72.4	30.2	13.5	78.5	5.7
MUSCLE	19.4	2.3	37.1	49.7	7.5	38.9	39.7	5.8	38.3	30.7	4.1	35.7	19.2
ProbCons	26.8	4.4	55.6	56.0	10.9	55.0	46.4	8.8	55.2	34.8	6.5	54.2	28.5
T-Coffee	13.0	2.3	56.5	42.5	9.3	56.6	32.8	7.0	56.6	26.7	6.6	62.6	61.2
AMAPsn	27.3	6.6	68.3	56.1	14.1	63.8	46.6	11.6	65.3	35.7	17.7	81.1	13.5
AMAP	19.2	9.8	84.4	46.4	27.0	84.2	37.4	21.4	84.2	27.4	30.1	88.5	11.2
AMAPsp	12.7	17.3	91.0	35.7	45.9	91.1	28.1	36.5	91.1	19.1	50.3	91.4	10.0

Entries show the average developer (fD), modeler (fM) and AMA scores. Best results are shown in boldface. All numbers have been multiplied by 100.

Table 2

Comparison of protein alignment programs on the SABmark datasets with false-positives

Twilight-FP (209)	Superfamilies-FP (425)	Overall by alignments	Overall by positions	Average time
Program	fD	fM	AMA	fD	fM	AMA	fD	fM	AMA	fD	fM	AMA	Seconds
Align-m	17.8	6.4	81.5	44.8	16.8	77.5	35.9	13.4	78.9	28.6	17.6	83.3	158.9
CLUSTALW	20.4	2.4	35.5	50.9	7.4	37.0	40.8	5.7	36.5	31.2	4.0	34.2	1.7
DIALIGN-T	17.0	4.5	74.1	46.7	14.0	71.5	36.9	10.9	72.4	30.2	13.5	78.5	5.7
MUSCLE	19.4	2.3	37.1	49.7	7.5	38.9	39.7	5.8	38.3	30.7	4.1	35.7	19.2
ProbCons	26.8	4.4	55.6	56.0	10.9	55.0	46.4	8.8	55.2	34.8	6.5	54.2	28.5
T-Coffee	13.0	2.3	56.5	42.5	9.3	56.6	32.8	7.0	56.6	26.7	6.6	62.6	61.2
AMAPsn	27.3	6.6	68.3	56.1	14.1	63.8	46.6	11.6	65.3	35.7	17.7	81.1	13.5
AMAP	19.2	9.8	84.4	46.4	27.0	84.2	37.4	21.4	84.2	27.4	30.1	88.5	11.2
AMAPsp	12.7	17.3	91.0	35.7	45.9	91.1	28.1	36.5	91.1	19.1	50.3	91.4	10.0

Twilight-FP (209)	Superfamilies-FP (425)	Overall by alignments	Overall by positions	Average time
Program	fD	fM	AMA	fD	fM	AMA	fD	fM	AMA	fD	fM	AMA	Seconds
Align-m	17.8	6.4	81.5	44.8	16.8	77.5	35.9	13.4	78.9	28.6	17.6	83.3	158.9
CLUSTALW	20.4	2.4	35.5	50.9	7.4	37.0	40.8	5.7	36.5	31.2	4.0	34.2	1.7
DIALIGN-T	17.0	4.5	74.1	46.7	14.0	71.5	36.9	10.9	72.4	30.2	13.5	78.5	5.7
MUSCLE	19.4	2.3	37.1	49.7	7.5	38.9	39.7	5.8	38.3	30.7	4.1	35.7	19.2
ProbCons	26.8	4.4	55.6	56.0	10.9	55.0	46.4	8.8	55.2	34.8	6.5	54.2	28.5
T-Coffee	13.0	2.3	56.5	42.5	9.3	56.6	32.8	7.0	56.6	26.7	6.6	62.6	61.2
AMAPsn	27.3	6.6	68.3	56.1	14.1	63.8	46.6	11.6	65.3	35.7	17.7	81.1	13.5
AMAP	19.2	9.8	84.4	46.4	27.0	84.2	37.4	21.4	84.2	27.4	30.1	88.5	11.2
AMAPsp	12.7	17.3	91.0	35.7	45.9	91.1	28.1	36.5	91.1	19.1	50.3	91.4	10.0

Entries show the average developer (fD), modeler (fM) and AMA scores. Best results are shown in boldface. All numbers have been multiplied by 100.

It is important to note that all programs except for Align-m aim at maximizing sensitivity at the expense of specificity. It is therefore not surprising that these programs clearly have better fD scores than fM scores. On the other hand, AMAP allows to control the sensitivity/specificity tradeoff and is able to achieve best results on both measures. This is clear by examining Figure 3, which shows sensitivity and specificity of AMAP at various stages of the sequence annealing on the SABmark dataset with no false-positives (averaged over all positions). All the points on the AMAP curve were produced with AMAP, except for the most sensitive point, which was produced by AMAP_sn_. The figure depicts one crucial difference between AMAP and all the other current alignment programs. While every other program produces a single point, the sequence annealing method used in AMAP produces a series of alignments, starting with very specific alignments with low sensitivity and ending at very sensitive alignments with lower specificity. It is clear that the AMAP line dominates all other programs, since for every alignment produced by some other program, there is a point on the AMAP line that has better sensitivity and specificity.

Comparison of different alignment programs with AMAP.

Fig. 3

Comparison of different alignment programs with AMAP.

Although we disagree that alignment programs should be evaluated based on sensitivity alone, we will note that the fD scores of AMAP_sn_ are comparable to ProbCons, which is arguably the most accurate program in terms of sensitivity. This is not surprising since AMAP and ProbCons use similar posterior probabilities. While ProbCons has slightly better sensitivity than AMAP_sn_ on the sets without false-positives when scores are averaged over alignments, AMAP_sn_ has a slightly better score when the scores are averaged over all positions in all alignments together. This indicates that the sequence annealing method finds better alignments when the total number of residues in an alignment increases. The same trend is evident in the false-positives sets, in which AMAP_sn_ achieves better scores than ProbCons on all measures and datasets. This is again due to the fact that the false-positives sets are larger and therefore the alignments are less robust to errors made in early stages of the progressive alignment procedure used in ProbCons. Even more striking is the fact that while both programs achieve very similar fD scores, AMAP_sn_ has much better fM and AMA scores than ProbCons (81.1 AMA compared with 54.2 on the false-positive sets averaged over all alignments), even though it is tuned to optimized sensitivity alone. Again, this can be explained by the fact that sequence annealing takes the smallest step in the space of multiple alignments and fixes matches based on their overall contribution to the objective function score. In progressive alignment, once two sequences are chosen to be aligned, their entire alignment is determined without regard to the fact that the alignment of some residues may be unclear until later in the procedure.

Both AMAP and AMAP_sp_ get better fM and AMA scores than all other programs, including Align-m, which is designed to optimize specificity rather than sensitivity. AMAP achieves high specificity while still producing respectable sensitivity scores. On most datasets, AMAP_sp_ has better AMA scores than AMAP, despite the fact that AMAP is tuned to optimize the AMA scores. This can be explained by the fact that the parameters used to produce the posterior probabilities do not fit the SABmark data very well and probably underestimate the gap probabilities.

AMAP compares favorably with most other programs in terms of running time, as well. It takes only 13.5 seconds on average to align a group of sequences from the false-positives sets with current version of AMAP, which is still a prototype that has not yet been optimized for speed, compared with 19.2 seconds per query with MUSCLE, which is known to be optimized for fast running time. In fact, most of AMAP's running time is spent on computing the posterior probabilities and not on the actual sequence annealing. Using sequence annealing also allows AMAP to produce in one run of the algorithm a range of multiple sequence alignments, ranging from very specific to very sensitive alignments, since at every point of the algorithm a valid multiple sequence alignment can be produced from the current (partial) alignments. This allows users to see how the alignment is being built and decide which alignment looks best for their purpose.

5 SUMMARY AND FUTURE DIRECTIONS

We have shown that sequence annealing is a practical alternative to progressive alignment that can be used for multiple alignment of protein sequences. In addition to being more accurate than existing methods, sequence annealing has the advantage that the intermediate alignments produced during the annealing process are of use in identifying reliable portions of the multiple alignment. They provide the user with the ability to explore the tradeoff between sensitivity and specificity.

Although we have only considered multiple alignment of protein sequences in this paper there is no reason to expect that the methods we have developed cannot be applied to DNA multiple alignment. In the case of DNA, it may be necessary to further develop the sequence annealing approach to allow for melting (in analogy with simulated annealing algorithms). Indeed, by revisiting certain annealing steps, it may be possible to avoid local maxima of the scoring function. Indeed, we do not, at this time, have a clear understanding of the landscape of the scoring function we have proposed. This is a promising direction for future research.

Sequence annealing can also be used for local multiple alignment. A partial local multiple alignment of sequence characters S = {σ1, … , σ_k_} is a partially ordered set P = {_c_1, … , cm} together with a surjective function ϕ : S → P such that if x, y ∈ P with x < y then if formula and formula ⁠, i < j. Note that every partial global multiple alignment is a partial local multiple alignment (but not vice versa). A null local multiple alignment is the poset consisting of a disjoint union of singletons. Sequences can be annealed with the added ‘move’ of connecting components of the alignment posets without collapsing elements.

In summary, sequence annealing opens up the possibility of exploring the space of multiple alignments and allows for global optimization methods to be applied directly to the multiple alignment problem.

Ariel Schwartz was supported by NSF grant EF 03-31494. Lior Pachter was supported by NIH grant R01HG2362 and NSF grant CCF0347992.

REFERENCES

An o(_n_2.75) algorithm for online topological ordering

2006

arXiv:cs.DS/0602073

et al.

Incremental evaluation of computational circuits

1990

Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms

(pg.

)

The many faces of sequence alignment

Brief. Bioinform.

2005

, vol.

(pg.

)

et al.

The ASTRAL compendium for protein structure and sequence analysis

Nucleic Acids Res.

2000

, vol.

(pg.

254

256

)

et al.

ProbCons: probabilistic consistency-based multiple sequence alignment

Genome Res.

2005

, vol.

(pg.

330

340

)

et al.

The number of standard and of effective multiple alignments

Appl. Math. Lett.

1998

, vol.

(pg.

)

et al. ,

Biological sequence analysis. Probablistic models of proteins and nucleic acids

1998

Cambridge University Press

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Nucleic Acids Res.

2004

, vol.

(pg.

1792

1797

)

Progressive alignment of amino acid sequences as a prerequisite to correct phylogenetic trees

J. Mol. Evol.

1987

, vol.

(pg.

351

360

)

Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments

J. Mol. Biol.

1996

, vol.

264

(pg.

823

838

)

Dynamic programming alignment accuracy

J. Comput. Biol.

1998

, vol.

(pg.

493

504

)

Online topological ordering

ACM Trans. Algorithm.

2006

in press

et al.

Multiple sequence alignment using partial order graphs

Bioinformatics

2002

, vol.

(pg.

452

464

)

et al.

Maintaining a topological order under edge insertions

Inform. Process. Lett.

1996

, vol.

(pg.

)

et al.

Multiple DNA and protein sequence alignment based on segment-to-segment comparison

Proc. Natl Acad. Sci. USA

1996

, vol.

(pg.

12098

12103

)

et al.

DIALIGN: finding local similarities by multiple sequence alignment

Bioinformatics

1998

, vol.

(pg.

290

294

)

Consistent equivalence relations: a set-theoretical framework for multiple sequence alignment

Technical Report Materialien und Preprints 133

1999

University of Bielefeld

et al.

SCOP: a structural classification of proteins database for the investigation of sequences and structures

J. Mol. Biol.

1995

, vol.

247

pg.

536

et al.

T-Coffee: a novel method for multiple sequence alignments

J. Mol. Biol.

2000

, vol.

302

(pg.

205

217

)

A dynamic algorithm for topologically sorting directed acyclic graphs

2004

, vol.

Vol. 3059

Proceedings of the Workshop on Efficient and experimental Algorithms, of Lecture Notes in Computer Science

Springer-Verlag

(pg.

383

398

)

et al.

A novel method for multiple alignment of sequences with repeated and shuffled elements

Genome Res.

2004

, vol.

(pg.

2336

2346

)

et al.

Large-scale comparison of protein sequence alignment algorithms with structure alignments

Proteins

2000

, vol.

(pg.

)

Alignment metric accuracy

2006

arXiv: q-bio.QM/0510052.

et al.

DIALIGN-T: an improved algorithm for segment-based multiple alignment

BMC Bioinformatics

2005

, vol.

pg.

Depth first search and linear graph algorithms

SIAM J. Comput.

1972

, vol.

(pg.

146

160

)

et al.

CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

Nucleic Acids Res.

1994

, vol.

(pg.

4673

4680

)

et al.

SABmark—a benchmark for sequence alignment that covers the entire known fold space

Bioinformatics

2005

, vol.

(pg.

1267

1268

)

1Here we assume that the number of candidate pairs with positive weights per sequence position is of the order O(k).

2AMAP uses the ProbCons parameters with a single pair of gap states to generate pairwise posterior probabilities. The latest version of ProbCons uses two additional gap states, which we do not use in the current version of AMAP. We also slightly modified the initial state probabilities to {πmatch, πinsert, πdelete} = {0.4,0.3,0.3}.