Hierarchical fuzzy classifier for bioinformatics data (original) (raw)

Rule Generation for Protein Secondary Structure Prediction With Support Vector Machines and Decision Tree

IEEE Transactions on Nanobioscience, 2006

Support vector machines (SVMs) have shown strong generalization ability in a number of application areas, including protein structure prediction. However, the poor comprehensibility hinders the success of the SVM for protein structure prediction. The explanation of how a decision made is important for accepting the machine learning technology, especially for applications such as bioinformatics. The reasonable interpretation is not only useful to guide the "wet experiments," but also the extracted rules are helpful to integrate computational intelligence with symbolic AI systems for advanced deduction. On the other hand, a decision tree has good comprehensibility. In this paper, a novel approach to rule generation for protein secondary structure prediction by integrating merits of both the SVM and decision tree is presented. This approach combines the SVM with decision tree into a new algorithm called SVM_DT, which proceeds in three steps. This algorithm first trains an SVM. Then, a new training set is generated through careful selection from the output of the SVM. Finally, the obtained training set is used to train a decision tree learning system and to extract the corresponding rule sets. The results of the experiments of protein secondary structure prediction on RS126 data set show that the comprehensibility of SVM_DT is much better than that of the SVM. Moreover, the generalization ability of SVM_DT is better than that of C4.5 decision trees and is similar to that of the SVM. Hence, SVM_DT can be used not only for prediction, but also for guiding biological experiments.

A STUDY OF INTELLIGENT TECHNIQUES FOR PROTEIN SECONDARY STRUCTURE PREDICTION

Protein secondary structure prediction has been and will continue to be a rich research field. This is because the protein structure and shape directly affect protein behavior. Moreover, the number of known secondary and tertiary structures versus primary structures is relatively small. Although the secondary prediction started in the seventies but it has been together with the tertiary structure prediction a topic that is always under research. This paper presents a technical study on recent methods used for secondary structure prediction using amino acid sequence. The methods are studied along with their accuracy levels. The most known methods like Neural Networks and Support Vector Machines are shown and other techniques as well. The paper shows different approaches for predicting the protein structures that showed different accuracies that ranged from 50% to over than 90%. The most commonly used technique is Neural Networks. However, Case Based Reasoning and Mixed Integer Linear Optimization showed the best accuracy among the machine learning techniques and provided accuracy of approximately 83%.

Generating fuzzy rules for protein classification

2008

ABSTRACT. This paper considers the generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily. Since the main objective of this classifier is the interpretability of rules, we have used the distribution of amino acids in the sequences of proteins as features. These features are the occurrence probabilities of six exchange groups in the sequences. To generate the fuzzy rules, we have used some modified versions of a common approach. The generated rules are simple and understandable, especially for biologists. To evaluate our fuzzy classifiers, we have used four protein superfamilies from UniProt database. Experimental results show the comprehensibility of generated fuzzy rules with comparable classification accuracy. 1.

Prediction of Protein Function Using Learning Classifier Systems

There are several problems that have been studied by Bioinformatics and one, which stands out, is the prediction of the proteins functions. This paper shows a novel solution for hierarchical classification problems based on Learning Classifier Systems. The algorithm proposed HLCS-Flat was designed to work with the protein functions prediction in structured ontologies in the form of a directed acyclic graph and provides positives results when compared to the well-known rule-based classification method RIPPER. This paper presents the concepts of hierarchical classification and classifier systems, and also the HLCS-Flat model and its computational results.

Logical Analysis of Data Approach to the Prediction of Protein Secondary Structures

The reason that this problem is so important is that the structure of protein is directly dependent on its function. Experimental structure determination, or structure prediction, aids the elucidation of protein function; conversely, synthetic protein sequences might be designed so that the protein performs a desired function. The study of protein structure is therefore not only of fundamental scientific interest in terms of understanding biochemical processes, but also produces very valuable practical benefits. Results: The obtained results over 70% for three classes of secondary structures are similar or better as compared with other methods for the protein prediction. A comparison has been made with the PHD algorithm and algorithm based on the Rough Set theory. During experiment the set of the most promising amino acids properties has been extracted for secondary structure description. LAD generated simple and strong rules which could be easily interpreted by biologists Availability:

Protein secondary structure prediction using logic-based machine learning

Protein Engineering, 1992

Many attempts have been made to solve the problem of predicting protein secondary structure from the primary sequence but the best performance results are still disappointing. In this paper, the use of a machine learning algorithm which allows relational descriptions is shown to lead to improved performance. The Inductive Logic Programming computer program, Golem, was applied to learning secondary structure prediction rules for a/a domain type proteins. The input to the program consisted of 12 non-homologous proteins (1612 residues) of known structure, together with a background knowledge describing the chemical and physical properties of the residues. Golem learned a small set of rules that predict which residues are part of the a-helices-based on their positional relationships and chemical and physical properties. The rules were tested on four independent non-homologous proteins (416 residues) giving an accuracy of 81% (±2%). This is an improvement, on identical data, over the previously reported result of 73% by King and Sternberg (1990, /. Mol. Biol., 216, 441-457) using the machine learning program PROMIS, and of 72% using the standard Gamier-Osguthorpe-Robson method. The best previously reported result in the literature for the a/a domain type is 76%, achieved using a neural net approach. Machine learning also has the advantage over neural network and statistical methods in producing more understandable results.

The usage of machine learning paradigms on protein secondary structure prediction

The significance of the secondary structure prediction process is something no one can deny. This is because of the importance of protein in all our human system functionalities. Protein forms every single element in the body using its amino acids. These amino acids start to bond together forming other protein structures. A lot of diseases can be diagnosed by simply checking the deformation of these structures. The problem is that it takes a lot of effort to get from the primary protein structure –aka amino sequence– to the secondary, tertiary and quaternary structures it forms. Through the past decade a lot of machine learning methods arose that predicted the secondary structure and then predicted the tertiary from it. Most of these methods were based on Neural Networks paradigm only. This paper aims to show how other machine learning techniques have been used to predict the secondary structure. The techniques used are; Case Based Reasoning, Bayes Network, Decision Tables and Decision trees. The highest accuracy reached was when using Bayes network to predict Beta secondary structure only, it reached an accuracy of 75.89 %.

Artificial Intelligence in Prediction of Secondary Protein Structure Using CB513 Database

Summit on translational bioinformatics, 2009

In this paper we describe CB513 a non-redundant dataset, suitable for development of algorithms for prediction of secondary protein structure. A program was made in Borland Delphi for transforming data from our dataset to make it suitable for learning of neural network for prediction of secondary protein structure implemented in MATLAB Neural-Network Toolbox. Learning (training and testing) of neural network is researched with different sizes of windows, different number of neurons in the hidden layer and different number of training epochs, while using dataset CB513.

Profiles and fuzzy k-nearest neighbor algorithm for protein secondary structure prediction

2005

We introduce a new approach for predicting the secondary structure of proteins using profiles and the Fuzzy K-Nearest Neighbor algorithm. K-Nearest Neighbor methods give relatively better performance than Neural Networks or Hidden Markov models when the query protein has few homologs in the sequence database to build sequence profile. Although the traditional K-Nearest Neighbor algorithms are a good choice for this situation, one of the difficulties in utilizing these techniques is that all the labeled sam ples are given equal importance while deciding the secondary structure class of the protein residue and once a class has been assigned to a residue, there is no indication of its confidence in a particular class. In this paper, we propose a system based on the Fuzzy K-Nearest Neighbor Algorithm that addresses the above-mentioned issues and the system outperforms earlier K-Nearest neighbor methods that use multiple sequence alignments. We also introduce a new distance measure to calculate the distance between two protein sequences, a new method to assign membership values to the Nearest Neighbors in each of the Helix, Strand and Coil classes. We also propose a novel heuristic based filter to smoothen the prediction. Particularly attractive feature of our filter is that it does not require retraining when new structures are added to the database. We have achieved a sustained three-state overall accuracy of 75.75% with our system. The software is available upon request. * Corresponding author. Dong Xu can be contacted at dong@cecs.missouri.edu.

Prediction of protein secondary structure by mining structural fragment database

Polymer, 2005

A new method for predicting protein secondary structure from amino acid sequence has been developed. The method is based on multiple sequence alignment of the query sequence with all other sequences with known structure from the protein data bank (PDB) by using BLAST. The fragments of the alignments belonging to proteins from the PBD are then used for further analysis. We have studied various schemes of assigning weights for matching segments and calculated normalized scores to predict one of the three secondary structures: α-helix, β-sheet, or coil. We applied several artificial intelligence techniques: decision trees (DT), neural networks (NN) and support vector machines (SVM) to improve the accuracy of predictions and found that SVM gave the best performance. Preliminary data show that combining the fragment mining approach with GOR V (Kloczkowski et al, Proteins 49 (2002) 154-166) for regions of low sequence similarity improves the prediction accuracy.