Kalign--an accurate and fast multiple sequence alignment algorithm - PubMed (original) (raw)

Kalign--an accurate and fast multiple sequence alignment algorithm

Timo Lassmann et al. BMC Bioinformatics. 2005.

Abstract

Background: The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics.

Results: We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods.

Conclusion: Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Balibase 2.01 reference alignment 1tvxA ref1 viewed by Belvu [37], showing conservation by "average similarity by BLOSUM62".

Figure 2

Figure 2

Analysis of the contribution to alignment accuracy made by different algorithmic variants. Kalign-default uses Wu-Manber approximate string matching, while Kalign-ktuple, Mafft-fast, Muscle-fast, and ClustalW-quicktree use exact k-tuple matching. The default Kalign Wu-Manber based algorithm becomes more accurate than other methods at high evolutionary distances. The alignments consisted of 50 simulated sequences.

Figure 3

Figure 3

A 2D plot indicating in which situations different methods perform better on the large testset. The accuracy of the most accurate versions of Kalign, Muscle, and Mafft was measured for each combination of average evolutionary distance (in PAM units) and number of sequences. The cells were colored according to the most accurate program as: Kalign:red; Muscle:blue; Mafft:yellow. If there was a tie between two or more methods the cell is black. In (a) it is enough to win by the smallest margin, whereas in (b) the program must win by a margin of 2%. Up to 200 PAM no program stands out as a clear winner while above this distance Kalign dominates.

Figure 4

Figure 4

Plots of the accuracy (a) and speed (b) achieved by of Kalign, Mafft (FFTNSI), Muscle, and ClustalW on the large testset with increasing average evolutionary distance. The number of sequences (300) and the average sequence length (500 residues) are kept constant.

Figure 5

Figure 5

Plots of the accuracy (a) and speed (b) achieved by of Kalign, Mafft (FFTNSI), Muscle, and ClustalW on the large testset with increasing number of sequences. The evolutionary distance (300 PAM) and the average sequence length (500 residues) are kept constant.

Similar articles

Cited by

References

    1. Notredame C. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics. 2002;3:131–144. - PubMed
    1. Felsenstein J. PHYLIP – Phylogeny Inference Package (Version 3.2) Cladistics. 1989;5:164–166.
    1. Sjolander K. Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics. 2004;20:170–179. - PubMed
    1. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, Studholme DJ, Yeats C, Eddy SR. The Pfam protein families database. Nucleic Acids Res. 2004. pp. 138–141. - PMC - PubMed
    1. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. - PubMed

MeSH terms

LinkOut - more resources