Protein Family Identification using Markov Chain as Feature Extraction and Probabilistic Neural Network (PNN) as Classifier (original) (raw)

An efficient technique for protein classification using feature extraction by artificial neural networks

2010 Annual IEEE India Conference (INDICON), 2010

Classification, or supervised learning, is one of the major data mining processes. Protein classification focuses on predicting the function or the structure of new proteins. This can be done by classifying a new protein to a given family with previously known characteristics. There are many approaches available for classification tasks, such as statistical techniques, decision trees and the neural networks. In this paper, three types of neural networks such as feedforward neural network, probabilistic neural network and radial basis function neural network are implemented. The main objective of the paper is to build up an efficient classifier using neural networks. The measures used to estimate the performance of the classifier are Precision, Sensitivity and Specificity.

A probabilistic neural network approach for protein superfamily classification

2005

The protein superfamily classification problem, which consists of determining the superfamily membership of a given unknown protein sequence, is very important for a biologist for many practical reasons, such as drug discovery, prediction of molecular function and medical diagnosis. In this work, we propose a new approach for protein classification based on a Probabilistic Neural Network and feature selection. Our goal is to predict the functional family of novel protein sequences based on the features extracted from the protein's primary structure i.e., sequence only. For this purpose, the datasets are extracted form Protein Data Bank(PDB), a curated protein family database, are used as training datasets. In these conducted experiments, the performance of the classifier is compared to other known data mining approaches / sequence comparison methods. The computational results have shown that the proposed method performs better than the other ones and looks promising for problems with characteristics similar to the problem.

Protein superfamily classification using Kernel Principal Component Analysis and Probabilistic Neural Networks

2011 Annual IEEE India Conference, 2011

This paper intends to implement Probabilistic Neural Network(PNN) for protein superfamily classification problem. The classification task organizes proteins into their superfamilies and helps in correct prediction of structure and function of newly discovered proteins. The two main steps for any pattern classification problem are feature selection and feature extraction. The bi-gram hashing function is used which extracts and counts the occurrences of bi-gram patterns from long strings of amino acid sequences. The bi-gram method maps sequences of different length into input vectors of same length, but the major drawback of this method is that, the size of the input feature vector tends to be very large. Selection of optimal number of features remains a critical issue for any pattern classification problem. Principal Component Analysis(PCA), a very powerful statistical technique, is used to reduce the dimension of the large input vector without much loss of information and thereby identifying pattern in data of high dimension. Traditional PCA makes a linear transformation wheras Kernel PCA(KPCA) is used when data are distributed nonlinearly. Numerical simulations have shown that for protein data distributed non-linearly, KPCA outperforms PCA in terms of accuracy, sensitivity and specificity.

Protein classification artificial neural system

Protein Science, 1992

A neural network classification method is developed as an alternative approach to the large database search/ organization problem. The system, termed Protein Classification Artificial Neural System (ProCANS), has been implemented on a Cray supercomputer for rapid superfamily classification of unknown proteins based on the information content of the neural interconnections. The system employs an n-gram hashing function that is similar to the k-tuple method for sequence encoding. A collection of modular back-propagation networks is used to store the large amount of sequence patterns. The system has been trained and tested with the first 2,148 of the 8,309 entries of the annotated Protein Identification Resource protein sequence database (release 29). The entries included the electron transfer proteins and the six enzyme groups (oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases), with a total of 620 superfamilies. After a total training time of seven Cray central processing unit (CPU) hours, the system has reached a predictive accuracy of 90%. The classification is fast (i.e., 0.1 Cray CPU second per sequence), as it only involves a forward-feeding through the networks. The classification time on a full-scale system embedded with all known superfamilies is estimated to be within 1 CPU second. Although the training time will grow linearly with the number of entries, the classification time is expected to remain low even if there is a 10-100-fold increase of sequence entries. The neural database, which consists of a set of weight matrices of the networks, together with the ProCANS software, can be ported to other computers and made available to the genome community. The rapid and accurate superfamily classification would be valuable to the organization of protein sequence databases and to the gene recognition in large sequencing projects.

Protein Family Recognition based on Fuzzy Logic

In the rise rabid research related to biometrics, bio-informatics and genome; many researches, fields, and issues are still undergo any uncertainties. One of the hottest areas in this field of research is the proteins informatics, that is relates the protein data with the modern information technology and it includes portions mapping and classification. This paper contributes an intelligent system which consists of adaptive neuro-fuzzy computations that is able to recognize and classify the proteins in families. An intelligent trainer will be structured based on Perceptron neural network in order to build an intelligent fuzzy inference system that is capable of predicting and classifying that data into categories according to the function of each protein. The structured system preprocesses that data set and extracts unique features from it. The system was built using a highly developed programming language. This paper will clearly show the results that such system achievement about 92% of accuracy when over 1000 inputs sequence of the validation sample was processed.

Protein Structure Prediction using Artificial Neural Network

2011

Protein secondary structure prediction is a problem related to structural bioinformatics which deals with the prediction and analysis of macromolecules i.e. DNA, RNA and protein. It is an important step towards elucidating its three dimensional structure, as well as its function. Secondary structure of a protein can be predicted from its primary structures i.e. from the amino acid sequences or from the residues though challenges exists. For these four methods are used. These are Statistical Approach, Nearest Neighbor method, Neural Network Approach and Hidden Markov Model Approach. The Artificial Neural Network (ANN) approach for prediction of protein secondary structure is the most successful one among all the methods used. In this method, ANNs are trained to make them capable of performing recognition of amino acid patterns in known secondary structure units and these patterns are used to distinguish between the different types of secondary structures. This work is related to the prediction of secondary structure of proteins employing artificial neural network though it is restricted initially to three structures only.

Prediction of protein structural classes by neural network

Biochimie, 2000

Protein structures can be classified as all-, all-, / , + and according to protein chain folding topologies. Previous studies have shown evidence that some correlation between the protein structural class and amino acid composition does exist, and the protein structural class can be predicted to some extent according to amino acid composition alone. In this study we apply Kohonen's self-organization neural network to approach this problem. The results obtained show that the structural class of a protein is considerably correlated with its amino acid composition, and the neural network is a useful tool for predicting the structural classes of proteins.

Mining protein database using machine learning techniques

Journal of integrative bioinformatics, 2008

With a large amount of information relating to proteins accumulating in databases widely available online, it is of interest to apply machine learning techniques that, by extracting underlying statistical regularities in the data, make predictions about the functional and evolutionary characteristics of unseen proteins. Such predictions can help in achieving a reduction in the space over which experiment designers need to search in order to improve our understanding of the biochemical properties. Previously it has been suggested that an integration of features computable by comparing a pair of proteins can be achieved by an artificial neural network, hence predicting the degree to which they may be evolutionary related and homologous. <br> We compiled two datasets of pairs of proteins, each pair being characterised by seven distinct features. We performed an exhaustive search through all possible combinations of features, for the problem of separating remote homologous from an...

Neural networks for protein classification.

Applied bioinformatics, 2004

This paper describes a biomolecular classification methodology based on multilayer perceptron neural networks. The system developed is used to classify enzymes found in the Protein Data Bank. The primary goal of classification, here, is to infer the function of an (unknown) enzyme by analysing its structural similarity to a given family of enzymes. A new codification scheme was devised to convert the primary structure of enzymes into a real-valued vector. The system was tested with a different number of neural networks, training set sizes and training epochs. For all experiments, the proposed system achieved a higher accuracy rate when compared with profile hidden Markov models. Results demonstrated the robustness of this approach and the possibility of implementing fast and efficient biomolecular classification using neural networks.

Protein Motif Extraction Using Hidden Markov Model

Genome Informatics, 1993

In this paper, we study the application of HMM to the problem of representing protein sequences by a stochastic motif. A stochastic (protein) motif represents the portions of protein sequences that have a certain function or structure, where conditional probabilities are used to deal with the stochastic nature of the motif. We proposed the iterative duplication method for HMM network learning. HMMs are much more expressive than symbolic patterns and are better suited to represent the variety of protein sequences. As an experiment, we constructed HMMs for leucine zipper motif using 112 protein sequences as a training set, and obtained an accuracy of 79.3 percent in the prediction of protein sequences, compared for an accuracy 14.8 percent when using a symbolic representation. Our approach can be used also for the validation of protein databases; the automatically constructed HMM has indicated that one protein sequence annotated as "leucine-zipper like sequence" in the database is quite different from other leucine-zipper sequences in terms of likelihood.