UProC: tools for ultra-fast protein domain classification - PubMed (original) (raw)

UProC: tools for ultra-fast protein domain classification

Peter Meinicke. Bioinformatics. 2015.

Abstract

Motivation: With rapidly increasing volumes of biological sequence data the functional analysis of new sequences in terms of similarities to known protein families challenges classical bioinformatics.

Results: The ultrafast protein classification (UProC) toolbox implements a novel algorithm ('Mosaic Matching') for large-scale sequence analysis. UProC is by three orders of magnitude faster than profile-based methods and in a metagenome simulation study achieved up to 80% higher sensitivity on unassembled 100 bp reads.

Availability and implementation: UProC is available as an open-source software at https://github.com/gobics/uproc. Precompiled databases (Pfam) are linked on the UProC homepage: http://uproc.gobics.de/.

Contact: peter@gobics.de.

Supplementary information: Supplementary data are available at Bioinformatics online.

© The Author 2014. Published by Oxford University Press.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

UProC workflow and Mosaic Matching sketch. For DNA input sequences, first all ORFs with at least 60 bp are identified, filtered and translated. The protein sequences then are analysed with the Mosaic Matching algorithm which compares all oligopeptides in the query sequence with oligopeptides from reference sequences in the database. From all matching reference oligopeptides with the same family label a maximum substitution score is computed for each residue and summed up over the whole sequence to provide the total Mosaic Matching score. If this score exceeds a length-dependent noise threshold the protein hit and the corresponding score is written to the output. The substitution scores that result from oligopeptide comparisons using PSSM are indicated by heatmap color (red:high, blue:low). The example shows all matching oligopeptides that contribute to the total score of Pfam family PF01370

Fig. 2.

Fig. 2.

Contributions of different word positions to PSSM in terms of the SSW obtained from regularized least-squares classifier training (see text)

Similar articles

Cited by

References

    1. Beckstette M., et al. (2009). Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics, 25, 3251–3258. - PMC - PubMed
    1. Eddy S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. - PubMed
    1. Finn R.D., et al. . (2010) The Pfam protein families database. Nucleic Acids Res., 38, D211–D222. - PMC - PubMed
    1. Fung G., Mangasarian O.L. (2001) Proximal support vector machine classifiers. In Proceedings KDD-2001: Knowledge Discovery and Data Mining, pp. 77–86.
    1. Gestel T.V., et al. . (2004) Benchmarking least squares support vector machine classifiers. Mach. Learn., 54, 5–32.

Publication types

MeSH terms

LinkOut - more resources