HH-suite3 for fast remote homology detection and deep protein annotation - PubMed (original) (raw)

HH-suite3 for fast remote homology detection and deep protein annotation

Martin Steinegger et al. BMC Bioinformatics. 2019.

Abstract

Background: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins.

Results: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite .

Conclusion: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.

Keywords: Algorithm; Functional annotation; Homology detection; Profile HMM; Protein alignment; SIMD; Sequence search.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1

HMM-HMM alignment of query and target. The alignment is represented as red path through both HMMs. The corresponding pair state sequence is MM, MM, MI, MM, MM, DG, MM

Fig. 2

SIMD parallelization over target profile HMMs. Batches of 4 or 8 database profile HMMs are aligned together by the vectorized Viterbi algorithm. Each cell (i,j) in the dynamic programming matrix is processed in parallel for 4 or 8 target HMMs

Fig. 3

The layout of the log transition probabilities (top) and emission probabilities (bottom) in memory for single-instruction single data (SISD) and SIMD algorithms. For the SIMD algorithm, 4 (using SSE2) or 8 (using AVX 2) target profile HMMs (t1 – t4) are stored together in interleaved fashion: the 4 or 8 transition or emission values at position i in these HMMs are stored consecutively (indicated by the same color). In this way, a single cache line read of 64 bytes can fill four SSE2 or two AVX2 SIMD registers with 4 or 8 values each

Fig. 4

Two approaches to reduce the memory requirement for the DP score matrices from O(L q L t) to O(L t), where L q and L t are lengths of the query and target profile, respectively. (Top) One vector holds the scores of the previous row, _S_XY(_i_−1,·), for pair state XY ∈{MM, MI, IM, GD and DG}, and the other holds the scores of the current row, _S_XY(i,·) for pair state XY ∈{MM, MI, IM, GD and DG}. Vector pointers are swapped after each row has been processed. (Bottom) A single vector per pair state XY holds the scores of the current row up to _j_−1 and of the previous row for j to L t. The second approach is somewhat faster and was chosen for HH-suite3

Fig. 5

Predecessor pair states for backtracing the Viterbi alignments are stored in a single byte of the backtrace matrix in HH-suite3 to reduce memory requirements. The bits 0 to 2 (blue) are used to store the predecessor state to the MM state, bits 3 to 6 store the predecessor of GD, IM, DG and MI pair states. The last bit denotes cells that are not allowed to be part of the suboptimal alignment because they are near to a cell that was part of a better-scoring alignment

Fig. 6

Speed comparisons. a runtime versus query profile length for 1644 searches with profile HMMs randomly sampled from UniProt. These queries were searched against the PDB70 database containing 35 000 profile HMMs of average length 234. The average speedup over HHsearch 2.0.16 is 3.2-fold for SSE2- vectorized HHsearch and 4.2-fold for AVX2-vectorized HHsearch. b Box plot for the distribution of total runtimes (in logarithmic scale) for one, two, or three search iterations using the 1644 profile HMMs as queries. PSI-BLAST and HHMER3 searches were done against the UniProt database (version 2015_06) containing 49 293 307 sequences. HHblits searches against the uniprot20 database, a clustered version of UniProt containing profile HMMs for each of its 7 313 957 sequence clusters. Colored numbers: speed-up factors relative to HMMER3

Fig. 7

Sensitivity of sequence search tools. a We searched with 6616 SCOP20 domain sequences through the UniProt plus SCOP20 database using one to three search iterations. The sensitivity to detect homologous sequences is measured by cumulative distribution of the Area Under the Curve 1 (AUC1), the fraction of true positives ranked better than the first false positive match. True positive matches are defined as being from the same SCOP superfamily [25], false positives have different SCOP folds, excepting known cases of inter-fold homologies. b Sensitivity of HHsearch with and without scoring secondary structure similarity, measured by the cumulative distribution of AUC1 for a comparison of 6616 profile HMMs built from SCOP20 domain sequences. Query HMMs include predicted secondary structure, target HMMs include actual secondary structure annotated by DSSP. True and false positives are defined as in A

Cited by

Near-atomic structure of an atadenovirus reveals a conserved capsid-binding motif and intergenera variations in cementing proteins.
Marabini R, Condezo GN, Krupovic M, Menéndez-Conejero R, Gómez-Blanco J, San Martín C. Marabini R, et al. Sci Adv. 2021 Mar 31;7(14):eabe6008. doi: 10.1126/sciadv.abe6008. Print 2021 Mar. Sci Adv. 2021. PMID: 33789897 Free PMC article.
Comparative Computational Analysis of Spike Protein Structural Stability in SARS-CoV-2 Omicron Subvariants.
Balupuri A, Kim JM, Choi KE, No JS, Kim IH, Rhee JE, Kim EJ, Kang NS. Balupuri A, et al. Int J Mol Sci. 2023 Nov 8;24(22):16069. doi: 10.3390/ijms242216069. Int J Mol Sci. 2023. PMID: 38003257 Free PMC article.
Analysis of Pseudomonas aeruginosa Isolates from Patients with Cystic Fibrosis Revealed Novel Groups of Filamentous Bacteriophages.
Evseev P, Bocharova J, Shagin D, Chebotar I. Evseev P, et al. Viruses. 2023 Nov 5;15(11):2215. doi: 10.3390/v15112215. Viruses. 2023. PMID: 38005892 Free PMC article.
Antibacterial properties and urease suppression ability of Lactobacillus inhibit the development of infectious urinary stones caused by Proteus mirabilis.
Szczerbiec D, Bednarska-Szczepaniak K, Torzewska A. Szczerbiec D, et al. Sci Rep. 2024 Jan 10;14(1):943. doi: 10.1038/s41598-024-51323-0. Sci Rep. 2024. PMID: 38200115 Free PMC article.
Antibacterial T6SS effectors with a VRR-Nuc domain are structure-specific nucleases.
Hespanhol JT, Sanchez-Limache DE, Nicastro GG, Mead L, Llontop EE, Chagas-Santos G, Farah CS, de Souza RF, Galhardo RDS, Lovering AL, Bayer-Santos E. Hespanhol JT, et al. Elife. 2022 Oct 13;11:e82437. doi: 10.7554/eLife.82437. Elife. 2022. PMID: 36226828 Free PMC article.

References

1. Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci USA. 2014;111(13):4904–4909. doi: 10.1073/pnas.1402564111. - DOI - PMC - PubMed
1. Söding J, Remmert M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol. 2011;21(3):404–11. doi: 10.1016/j.sbi.2011.03.005. - DOI - PubMed
1. Eddy SR. A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009;23(1):205–11. - PubMed
1. Eddy SR. Accelerated Profile HMM Searches. PLOS Comput Biol. 2011;7(10):1002195. doi: 10.1371/journal.pcbi.1002195. - DOI - PMC - PubMed
1. Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2012;9(2):173–5. doi: 10.1038/nmeth.1818. - DOI - PubMed

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

HH-suite3 for fast remote homology detection and deep protein annotation - PubMed (original) (raw)