Hidden Markov model speed heuristic and iterative HMM search procedure - PubMed (original) (raw)

Hidden Markov model speed heuristic and iterative HMM search procedure

L Steven Johnson et al. BMC Bioinformatics. 2010.

Abstract

Background: Profile hidden Markov models (profile-HMMs) are sensitive tools for remote protein homology detection, but the main scoring algorithms, Viterbi or Forward, require considerable time to search large sequence databases.

Results: We have designed a series of database filtering steps, HMMERHEAD, that are applied prior to the scoring algorithms, as implemented in the HMMER package, in an effort to reduce search time. Using this heuristic, we obtain a 20-fold decrease in Forward and a 6-fold decrease in Viterbi search time with a minimal loss in sensitivity relative to the unfiltered approaches. We then implemented an iterative profile-HMM search method, JackHMMER, which employs the HMMERHEAD heuristic. Due to our search heuristic, we eliminated the subdatabase creation that is common in current iterative profile-HMM approaches. On our benchmark, JackHMMER detects 14% more remote protein homologs than SAM's iterative method T2K.

Conclusions: Our search heuristic, HMMERHEAD, significantly reduces the time needed to score a profile-HMM against large sequence databases. This search heuristic allowed us to implement an iterative profile-HMM search method, JackHMMER, which detects significantly more remote protein homologs than SAM's T2K and NCBI's PSI-BLAST.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Minimum Error Rate Relative to Mean Search Time. 250 randomly chosen benchmarking models were used in HMMERHEAD Viterbi searches of NRDB90 and the test database utilizing a range of θ and η values. All other parameters were kept at their default value (θ = 6, δ = 2, μ = 7, and η = 20). Plotting minimum error rate versus mean search time for searches using these parameter values reveals a dramatic increase in minimum error rate, relative to a minor decrease in search time, for θ and η parameter values greater than 6 and 20, respectively. This further supports our choice of these parameter values for HMMERHEAD's default settings. The total number of true homologous pairs between those 250 models and the test benchmark was 3,617. The number of true positives identified at these parameter settings at 0 false positives is 938 or 26% of the possible true positives.

Figure 2

Figure 2

Remote Homology Detection of HMMERHEAD and HMMER 2.5.1 Forward and Viterbi. Each of the 2,521 benchmarking models was scored against the test database using either HMMER 2.5.1 or HMMERHEAD with Forward or Viterbi scoring. The results of the searches were combined and scored. This procedure was then repeated for each of the 1000 bootstrapping replicate test sets. The average number of true positives was plotted versus errors per query. Minimum and maximum numbers of true positives from the replicates are shown as error bars. HMMERHEAD Forward (Red) performance is shown relative to default HMMER 2.5.1 Forward (Blue), WU-BLAST (Black dashed), and WU-BLAST Family Pairwise Search (Black). Default HMMERHEAD Forward detects an average of 269 fewer true positives than default HMMER 2.5.1 Forward and detects significantly more true positives than either pairwise sequence comparison method. HMMERHEAD Viterbi (Red dashed) performance is shown relative to default HMMER 2.5.1 Viterbi (Blue dashed). Default HMMERHEAD Viterbi detects an average of 173 fewer true positives than default HMMER 2.5.1 Viterbi and again outperforms either pairwise comparison method. The total number of true homologous pairs between the 2,521 models and the test database is 39,733, and thus 8,000 true positives correspond to identifying 20% of the homologs.

Figure 3

Figure 3

Performance of Iterative Methods. The individual 2,521 benchmarking sequences were used to iteratively search a non-redundant version of NCBI's NR database. The iterative models created from this process were then scored against the test database. The results of the searches were combined and scored. This procedure was then repeated for each of the 1000 bootstrapping replicate testsets. The average number of true positives was plotted versus errors per query. Minimum and maximums numbers of true positives from the replicates are shown as error bars. JackHMMER (Red) detects an average of 1,337 more homologs than SAM 3.5's target2k (Blue) and an average of 2,476 more homologs than NCBI's PSI-BLAST (Black) across the errors per query range. This represents an increase of 14% and 28% in remote protein homologs detected. As before, the total number of true homologous pairs between the 2,521 models and the test database is 39,733, and thus 12,000 true positives correspond to identifying 30% of the homologs.

References

    1. HMMER: biosequence analysis using profile hidden Markov models. http://hmmer.org/
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;3:403–410. - PubMed
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998;14(10):846–856. doi: 10.1093/bioinformatics/14.10.846. - DOI - PubMed
    1. Scheeff ED, Bourne PE. Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction. BMC Bioinformatics. 2006;7:410. doi: 10.1186/1471-2105-7-410. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources