ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches - PubMed (original) (raw)

ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches

T Rognes. Nucleic Acids Res. 2001.

Abstract

There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith-Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith-Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/

PubMed Disclaimer

Figures

Figure 1

Figure 1

Computation of the diagonal scores using SIMD technology. Computation of the diagonal scores is performed efficiently in the order indicated using bands that are 32 diagonals wide.

Figure 2

Figure 2

Computation of the estimated gapped alignment score. The numbers within the matrix are the temporary scores (_e_i) along the diagonals from the initial computation of the diagonal scores. The numbers directly outside the matrix are the optimal ungapped alignment scores (_S_d) for each diagonal. The second numbers outside the matrix are the temporary scores (_u_d) used in the calculation of the estimated gapped alignment score (T), which in this example is 3 + 11 + 1 + 22 = 37. The BLOSUM62 matrix was used in combination with the parameters q = 11, r = 1 and c = 3 in this example. The calculations were performed in order of increasing diagonal numbers.

Figure 3

Figure 3

Comparison of database search sensitivity and selectivity. The sensitivity (coverage) versus the selectivity (EPQ) is plotted for a range of database search programs using either (A) the BLOSUM50 matrix and a 10 + 2_k_ gap penalty or (B) the BLOSUM62 matrix and a 11 + k gap penalty.

Figure 3

Figure 3

Comparison of database search sensitivity and selectivity. The sensitivity (coverage) versus the selectivity (EPQ) is plotted for a range of database search programs using either (A) the BLOSUM50 matrix and a 10 + 2_k_ gap penalty or (B) the BLOSUM62 matrix and a 11 + k gap penalty.

Figure 4

Figure 4

Comparison of database search speed. Search time versus query sequence length is plotted for the different search programs and the 11 query sequences (see Results). The search time used is the total CPU time of the fastest of three consecutive runs on a minimally loaded computer. With a database of only 29 and 128 MB of RAM, all of the database was cached in the computer’s RAM; disk reading time should then be negligible.

Similar articles

Cited by

References

    1. Smith T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197. - PubMed
    1. Pearson W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. - PMC - PubMed
    1. Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
    1. Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
    1. Hughey R. (1996) Parallel hardware for sequence comparison and alignment. Comput. Appl. Biosci., 12, 473–479. - PubMed

MeSH terms

LinkOut - more resources