UProC: tools for ultra-fast protein domain classification - PubMed (original) (raw)

UProC: tools for ultra-fast protein domain classification

Peter Meinicke. Bioinformatics. 2015.

Abstract

Motivation: With rapidly increasing volumes of biological sequence data the functional analysis of new sequences in terms of similarities to known protein families challenges classical bioinformatics.

Results: The ultrafast protein classification (UProC) toolbox implements a novel algorithm ('Mosaic Matching') for large-scale sequence analysis. UProC is by three orders of magnitude faster than profile-based methods and in a metagenome simulation study achieved up to 80% higher sensitivity on unassembled 100 bp reads.

Availability and implementation: UProC is available as an open-source software at https://github.com/gobics/uproc. Precompiled databases (Pfam) are linked on the UProC homepage: http://uproc.gobics.de/.

Contact: peter@gobics.de.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.

UProC workflow and Mosaic Matching sketch. For DNA input sequences, first all ORFs with at least 60 bp are identified, filtered and translated. The protein sequences then are analysed with the Mosaic Matching algorithm which compares all oligopeptides in the query sequence with oligopeptides from reference sequences in the database. From all matching reference oligopeptides with the same family label a maximum substitution score is computed for each residue and summed up over the whole sequence to provide the total Mosaic Matching score. If this score exceeds a length-dependent noise threshold the protein hit and the corresponding score is written to the output. The substitution scores that result from oligopeptide comparisons using PSSM are indicated by heatmap color (red:high, blue:low). The example shows all matching oligopeptides that contribute to the total score of Pfam family PF01370

Fig. 2.

Contributions of different word positions to PSSM in terms of the SSW obtained from regularized least-squares classifier training (see text)

Cited by

Tax4Fun2: prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene sequences.
Wemheuer F, Taylor JA, Daniel R, Johnston E, Meinicke P, Thomas T, Wemheuer B. Wemheuer F, et al. Environ Microbiome. 2020 May 18;15(1):11. doi: 10.1186/s40793-020-00358-7. Environ Microbiome. 2020. PMID: 33902725 Free PMC article.
The green impact: bacterioplankton response toward a phytoplankton spring bloom in the southern North Sea assessed by comparative metagenomic and metatranscriptomic approaches.
Wemheuer B, Wemheuer F, Hollensteiner J, Meyer FD, Voget S, Daniel R. Wemheuer B, et al. Front Microbiol. 2015 Aug 11;6:805. doi: 10.3389/fmicb.2015.00805. eCollection 2015. Front Microbiol. 2015. PMID: 26322028 Free PMC article.
Metagenomic Profiling of Ocular Surface Microbiome Changes in Meibomian Gland Dysfunction.
Zhao F, Zhang D, Ge C, Zhang L, Reinach PS, Tian X, Tao C, Zhao Z, Zhao C, Fu W, Zeng C, Chen W. Zhao F, et al. Invest Ophthalmol Vis Sci. 2020 Jul 1;61(8):22. doi: 10.1167/iovs.61.8.22. Invest Ophthalmol Vis Sci. 2020. PMID: 32673387 Free PMC article.
16S rDNA profiling of Loach (Misgurnus anguillicus) fed with soybean fermented powder intestinal flora in response to Lipopolysaccharide (LPS) infection.
Dai W, Liu Y, Zhang X, Dai L. Dai W, et al. Heliyon. 2023 Nov 11;9(11):e22369. doi: 10.1016/j.heliyon.2023.e22369. eCollection 2023 Nov. Heliyon. 2023. PMID: 38053882 Free PMC article.
Comprehensive Longitudinal Microbiome Analysis of the Chicken Cecum Reveals a Shift From Competitive to Environmental Drivers and a Window of Opportunity for Campylobacter.
Ijaz UZ, Sivaloganathan L, McKenna A, Richmond A, Kelly C, Linton M, Stratakos AC, Lavery U, Elmi A, Wren BW, Dorrell N, Corcionivoschi N, Gundogdu O. Ijaz UZ, et al. Front Microbiol. 2018 Oct 15;9:2452. doi: 10.3389/fmicb.2018.02452. eCollection 2018. Front Microbiol. 2018. PMID: 30374341 Free PMC article.

References

1. Beckstette M., et al. (2009). Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics, 25, 3251–3258. - PMC - PubMed
1. Eddy S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763. - PubMed
1. Finn R.D., et al. . (2010) The Pfam protein families database. Nucleic Acids Res., 38, D211–D222. - PMC - PubMed
1. Fung G., Mangasarian O.L. (2001) Proximal support vector machine classifiers. In Proceedings KDD-2001: Knowledge Discovery and Data Mining, pp. 77–86.
1. Gestel T.V., et al. . (2004) Benchmarking least squares support vector machine classifiers. Mach. Learn., 54, 5–32.

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

UProC: tools for ultra-fast protein domain classification - PubMed (original) (raw)