Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method (original) (raw)
Related papers
Efficiency analysis of KNN and minimum distance-based classifiers in enzyme family prediction
Computational Biology and Chemistry / Computers & Chemistry, 2009
Nearly all enzymes are proteins. They are the biological catalysts that accelerate the function of cellular reactions. Because of different characteristics of reaction tasks, they split into six classes: oxidoreductases (EC-1), transferases (EC-2), hydrolases (EC-3), lyases (EC-4), isomerases (EC-5), ligases (EC-6). Prediction of enzyme classes is of great importance in identifying which enzyme class is a member of a protein. Since the enzyme sequences increase day by day, contrary to experimental analysis in prediction of enzyme classes for a newly found enzyme sequence, providing from data mining techniques becomes very useful and time-saving.
Prediction of enzyme classification from protein sequence without the use of sequence similarity
Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1997
We describe a novel approach for predicting the function of a protein from its amino-acid sequence. Given features that can be computed from the amino-acid sequence in a straightforward fashion (such as pI, molecular weight, and amino-acid composition), the technique allows us to answer questions such as: Is the protein an enzyme? If so, in which Enzyme Commission (EC) class does it belong? Our approach uses machine learning (ML) techniques to induce classifiers that predict the EC class of an enzyme from features extracted from its primary sequence. We report on a variety of experiments in which we explored the use of three different ML techniques in conjunction with training datasets derived from PDB and from Swiss-Prot. We also explored the use of several different feature sets. Our method is able to predict the first EC number of an enzyme with 74% accuracy (thereby assigning the enzyme to one of six broad categories of enzyme function), and to predict the second EC number of an...
Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes
Bioinformatics, 2005
With the protein sequences entering into databanks at an explosive pace, it is important to timely determine the family or subfamily class for a newly-found enzyme molecule because this is directly related to the detailed information about what specific target it acts on, as well as to its catalytic process and biological function. Unfortunately, it is both time-consuming and costly to do so by experiments alone. In a previous study, the covariant-discriminant algorithm was introduced to identify the 16 subfamily classes of oxidoreductases. Although the results were quite encouraging, the entire prediction process was based on the amino acid composition alone without including any sequenceorder information. Therefore, it is worthy of further investigation.
Bioinformation, 2007
The problem of predicting the enzymes and non-enzymes from the protein sequence information is still an open problem in bioinformatics. It is further becoming more important as the number of sequenced information grows exponentially over time. We describe a novel approach for predicting the enzymes and non-enzymes from its amino-acid sequence using artificial neural network (ANN). Using 61 sequence derived features alone we have been able to achieve 79 percent correct prediction of enzymes/non-enzymes (in the set of 660 proteins). For the complete set of 61 parameters using 5-fold cross-validated classification, ANN model reveal a superior model (accuracy = 78.79 plus or minus 6.86 percent, Q(pred) = 74.734 plus or minus 17.08 percent, sensitivity = 84.48 plus or minus 6.73 percent, specificity = 77.13 plus or minus 13.39 percent). The second module of ANN is based on PSSM matrix. Using the same 5-fold cross-validation set, this ANN model predicts enzymes/non-enzymes with more accuracy (accuracy = 80.37 plus or minus 6.59 percent, Q(pred) = 67.466 plus or minus 12.41 percent, sensitivity = 0.9070 plus or minus 3.37 percent, specificity = 74.66 plus or minus 7.17 percent).
Journal of Artificial Intelligence and Systems, 2020
The last decade has witnessed an unprecedented accumulation of proteins in large online databases which has led to the need for automatic prediction of protein function essential for massive and timely annotations of the proteins in these datasets. Protein databases, combined with functional annotations and machine learning (ML) techniques, offer many potential benefits, including significantly facilitating rapid pharmacological target identification. The main objective of this study is to identify, for the problem of enzyme classification, the most powerful combinations of descriptors taken from different protein representations. To achieve this objective, four approaches for representing the Position-Specific Scoring Matrix (PSSM) combined with three methods for representing the Amino Acid Sequence (AAS) are evaluated with the aim of experimentally producing a powerful ensemble of descriptors for enzyme function prediction. Each protein descriptor is classified by a Support Vector Machine (SVM), with the set of SVMs finally combined by sum rule. Cross-validation experiments using these descriptors on single-functional enzymes (n=44,661) extracted from the PDB database demonstrate that the ensemble proposed here achieves superior classification rates compared to state-of-the-art ML techniques reported in the literature on the same dataset. Although the proposed ensemble strongly outperforms these other techniques, it is computationally much heavier, mainly because the PSSM extraction process is time consuming. However, there is a growing repository of proteins where PSSM has already been extracted, making the proposed method more practical and attractive. The MATLAB code and the dataset used in the experiments reported here are available at https://github.com/LorisNanni.
A top-down approach to classify enzyme functional classes and sub-classes using random forest
EURASIP Journal on Bioinformatics and Systems Biology, 2012
Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze biochemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class of enzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes.
Computational approaches for automated classification of enzyme sequences
Journal of Proteomics and Bioinformatics, 2011
Abstract Determining the functional role (s) of enzymes is very important to build the metabolic blueprint of an organism and to identify the potential roles enzymes may play in metabolic and disease pathways. With exponential growth in gene and protein sequence data, it is not feasible to experimentally characterize the function (s) of all enzymes. Alternatively, computational methods can be used to annotate the enormous amount of unannotated enzyme sequences.
ArXiv, 2021
Enzymes and proteins are live driven biochemicals, which has a dramatic impact over the environment, in which it is active. So, therefore, it is highly looked-for to build such a robust and highly accurate automatic and computational model to accurately predict enzymes nature. In this study, a novel split amino acid composition model named piSAAC is proposed. In this model, protein sequence is discretized in equal and balanced terminus to fully evaluate the intrinsic correlation properties of the sequence. Several state-of-the-art algorithms have been employed to evaluate the proposed model. A 10-folds cross-validation evaluation is used for finding out the authenticity and robust-ness of the model using different statistical measures e.g. Accuracy, sensitivity, specificity, F-measure and area un-der ROC curve. The experimental results show that, probabilistic neural network algorithm with piSAAC feature extraction yields an accuracy of 98.01%, sensitivity of 97.12%, specificity of ...
Data mining of enzymes using specific peptides
BMC Bioinformatics, 2009
Background: Predicting the function of a protein from its sequence is a long-standing challenge of bioinformatic research, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method that consists of Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the Data Mining of Enzymes (DME) methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme's EC classification is.