Kalign--an accurate and fast multiple sequence alignment algorithm - PubMed (original) (raw)
Kalign--an accurate and fast multiple sequence alignment algorithm
Timo Lassmann et al. BMC Bioinformatics. 2005.
Abstract
Background: The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics.
Results: We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods.
Conclusion: Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences.
Figures
Figure 1
Balibase 2.01 reference alignment 1tvxA ref1 viewed by Belvu [37], showing conservation by "average similarity by BLOSUM62".
Figure 2
Analysis of the contribution to alignment accuracy made by different algorithmic variants. Kalign-default uses Wu-Manber approximate string matching, while Kalign-ktuple, Mafft-fast, Muscle-fast, and ClustalW-quicktree use exact k-tuple matching. The default Kalign Wu-Manber based algorithm becomes more accurate than other methods at high evolutionary distances. The alignments consisted of 50 simulated sequences.
Figure 3
A 2D plot indicating in which situations different methods perform better on the large testset. The accuracy of the most accurate versions of Kalign, Muscle, and Mafft was measured for each combination of average evolutionary distance (in PAM units) and number of sequences. The cells were colored according to the most accurate program as: Kalign:red; Muscle:blue; Mafft:yellow. If there was a tie between two or more methods the cell is black. In (a) it is enough to win by the smallest margin, whereas in (b) the program must win by a margin of 2%. Up to 200 PAM no program stands out as a clear winner while above this distance Kalign dominates.
Figure 4
Plots of the accuracy (a) and speed (b) achieved by of Kalign, Mafft (FFTNSI), Muscle, and ClustalW on the large testset with increasing average evolutionary distance. The number of sequences (300) and the average sequence length (500 residues) are kept constant.
Figure 5
Plots of the accuracy (a) and speed (b) achieved by of Kalign, Mafft (FFTNSI), Muscle, and ClustalW on the large testset with increasing number of sequences. The evolutionary distance (300 PAM) and the average sequence length (500 residues) are kept constant.
Similar articles
- transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.
Bininda-Emonds OR. Bininda-Emonds OR. BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156. BMC Bioinformatics. 2005. PMID: 15969769 Free PMC article. - Grammar-based distance in progressive multiple sequence alignment.
Russell DJ, Otu HH, Sayood K. Russell DJ, et al. BMC Bioinformatics. 2008 Jul 10;9:306. doi: 10.1186/1471-2105-9-306. BMC Bioinformatics. 2008. PMID: 18616828 Free PMC article. - TM-Aligner: Multiple sequence alignment tool for transmembrane proteins with reduced time and improved accuracy.
Bhat B, Ganai NA, Andrabi SM, Shah RA, Singh A. Bhat B, et al. Sci Rep. 2017 Oct 2;7(1):12543. doi: 10.1038/s41598-017-13083-y. Sci Rep. 2017. PMID: 28970546 Free PMC article. - Multiple sequence alignment.
Edgar RC, Batzoglou S. Edgar RC, et al. Curr Opin Struct Biol. 2006 Jun;16(3):368-73. doi: 10.1016/j.sbi.2006.04.004. Epub 2006 May 5. Curr Opin Struct Biol. 2006. PMID: 16679011 Review. - Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons.
Margulies EH, Chen CW, Green ED. Margulies EH, et al. Trends Genet. 2006 Apr;22(4):187-93. doi: 10.1016/j.tig.2006.02.005. Epub 2006 Feb 24. Trends Genet. 2006. PMID: 16499991 Review.
Cited by
- SARS-CoV-2 Genotyping Highlights the Challenges in Spike Protein Drift Independent of Other Essential Proteins.
Prokop JW, Alberta S, Witteveen-Lane M, Pell S, Farag HA, Bhargava D, Vaughan RM, Frisch A, Bauss J, Bhatti H, Arora S, Subrahmanya C, Pearson D, Goodyke A, Westgate M, Cook TW, Mitchell JT, Zieba J, Sims MD, Underwood A, Hassouna H, Rajasekaran S, Tamae Kakazu MA, Chesla D, Olivero R, Caulfield AJ. Prokop JW, et al. Microorganisms. 2024 Sep 9;12(9):1863. doi: 10.3390/microorganisms12091863. Microorganisms. 2024. PMID: 39338537 Free PMC article. - RNAi-Mediated Silencing of Laccase 2 in Culex pipiens Pupae via Dehydration and Soaking Results in Multiple Defects in Cuticular Development.
Naumenko AN, Fritz ML. Naumenko AN, et al. Insects. 2024 Mar 14;15(3):193. doi: 10.3390/insects15030193. Insects. 2024. PMID: 38535388 Free PMC article. - Human pangenome analysis of sequences missing from the reference genome reveals their widespread evolutionary, phenotypic, and functional roles.
Wu Z, Li T, Jiang Z, Zheng J, Gu Y, Liu Y, Liu Y, Xie Z. Wu Z, et al. Nucleic Acids Res. 2024 Mar 21;52(5):2212-2230. doi: 10.1093/nar/gkae086. Nucleic Acids Res. 2024. PMID: 38364871 Free PMC article. - Genomics of the expanding pine pathogen Lecanosticta acicola reveals patterns of ongoing genetic admixture.
Marcet-Houben M, Cruz F, Gómez-Garrido J, Alioto TS, Nunez-Rodriguez JC, Mesanza N, Gut M, Iturritxa E, Gabaldon T. Marcet-Houben M, et al. mSystems. 2024 Mar 19;9(3):e0092823. doi: 10.1128/msystems.00928-23. Epub 2024 Feb 16. mSystems. 2024. PMID: 38364101 Free PMC article.
References
- Notredame C. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics. 2002;3:131–144. - PubMed
- Felsenstein J. PHYLIP – Phylogeny Inference Package (Version 3.2) Cladistics. 1989;5:164–166.
- Sjolander K. Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics. 2004;20:170–179. - PubMed
- Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. - PubMed
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources