A Family of Feed-Forward Models for Protein Sequence Classification (original) (raw)

Motif-Based Protein Sequence Classification Using Neural Networks

Journal of Computational Biology, 2005

We present a system for multi-class protein classification based on neural networks. The basic issue concerning the construction of neural network systems for protein classification is the sequence encoding scheme that must be used in order to feed the neural network. To deal with this problem we propose a method that maps a protein sequence into a numerical feature space using the matching scores of the sequence to groups of conserved patterns (called motifs) into protein families. We consider two alternative ways for identifying the motifs to be used for feature generation and provide a comparative evaluation of the two schemes. We also evaluate the impact of the incorporation of background features (2-grams) on the performance of the neural system. Experimental results on real datasets indicate that the proposed method is highly efficient and is superior to other well-known methods for protein classification.

Protein Sequence Classification Using Probabilistic Motifs and Neural Networks

Lecture Notes in Computer Science, 2003

The basic issue concerning the construction of neural network systems for protein classification is the sequence encoding scheme that must be used in order to feed the network. To deal with this problem we propose a method that maps a protein sequence into a numerical feature space using the matching local scores of the sequence to groups of conserved patterns (called motifs). We consider two alternative schemes for discovering a group of D motifs within a set of K-class sequences. We also evaluate the impact of the background features (2-grams) to the performance of the neural system. Experimental results on real datasets indicate that the proposed method is superior to other known protein classification approaches.

Neural networks for molecular sequence classification

Mathematics and Computers in Simulation, 1995

A neural network classification method has been developed as an alternative approach to the search/ organization problem of large molecular databases. Two artificial neural systems have been implemented on a Cray supercomputer for rapid protein/nucleic acid sequence classifications. The neural networks used are three-layered, feed-forward networks that employ back-propagation learning algorithm. The molecular sequences are encoded into neural input vectors by applying an n-gram hashing method or a SVD (singular value decomposition) method. Once trained with known sequences in the molecular databases, the nettral system becomes an associative memory capable of classifying unknown sequences based on the class information embedded in its neural interconnections. The protein system, which classifies proteins into PIR (Protein Identification Resource) superfamilies, showed a 82% to a close to 100% sensitivity at a speed that is about an order of magnitude faster than other search methods. The pilot nucleic acid system, which classifies ribosomal RNA sequences according to phylogenetic groups, has achieved a 100% classification accuracy. The system could be used to reduce the database search time and help organize the molecular sequence databases. The tool is generally applicable to any databases that are organized according to family relationships.

g-MARS: Protein Classification Using Gapped Markov Chains and Support Vector Machines

2008

Classifying protein sequences has important applications in areas such as disease diagnosis, treatment development and drug design. In this paper we present a highly accurate classifier called the g-MARS (gapped Markov Chain with Support Vector Machine) protein classifier. It models the structure of a protein sequence by measuring the transition probabilities between pairs of amino acids. This results in a Markov chain style model for each protein sequence. Then, to capture the similarity among non-exactly matching protein sequences, we show that this model can be generalized to incorporate gaps in the Markov chain. We perform a thorough experimental study and compare g-MARS to several other state-of-the-art protein classifiers. Overall, we demonstrate that g-MARS has superior accuracy and operates efficiently on a diverse range of protein families.

Sequence-based protein structure prediction using a reduced state-space hidden Markov model

Computers in Biology and Medicine, 2007

This work describes the use of a hidden Markov model (HMM), with a reduced number of states, which simultaneously learns amino acid sequence and secondary structure for proteins of known three-dimensional structure and it is used for two tasks: protein class prediction and fold recognition. The Protein Data Bank and the annotation of the SCOP database are used for training and evaluation of the proposed HMM for a number of protein classes and folds. Results demonstrate that the reduced state-space HMM performs equivalently, or even better in some cases, on classifying proteins than a HMM trained with the amino acid sequence. The major advantage of the proposed approach is that a small number of states is employed and the training algorithm is of low complexity and thus relatively fast. ᭧

Subsequence-based feature map for protein function classification

Computational Biology and Chemistry, 2008

Automated classification of proteins is indispensable for further in vivo investigation of excessive number of unknown sequences generated by large scale molecular biology techniques. This study describes a discriminative system based on feature space mapping, called subsequence profile map (SPMap) for functional classification of protein sequences. SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification. The performance of the method was assessed through tests on various protein classification tasks. Our results showed that SPMap is capable of high accuracy classification in most of these tasks. Furthermore SPMap is fast and scalable enough to handle large datasets.

Data mining for building neural protein sequence classification systems with improved performance

Proceedings of the International Joint Conference on Neural Networks, 2003., 2003

Abstract Traditionally, two protein sequences are classified into the same class if their feature patterns have high homology. These feature patterns were originally extracted by sequence alignment algorithms, which measure similarity between an unseen protein sequence and identified protein sequences. Neural network approaches, while reasonably accurate at classification, give no information about the relationship between the unseen case and the classified items that is useful to biologist. In contrast, in this paper we use a ...

New techniques for extracting features from protein sequences

IBM Systems Journal, 2000

In this paper we propose new techniques to extract features from protein sequences. We then use the features as inputs for a Bayesian neural network (BNN) and apply the BNN to classifying protein sequences obtained from the PIR protein database maintained at the National Biomedical Research Foundation. To evaluate the performance of the proposed approach, we compare it with other protein classiers built based on sequence alignment and machine learning methods. Experimental results show the high precision of the proposed classi er and the complementarity of the bioinformatics tools studied in the paper.

Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers

Lecture Notes in Computer Science, 2005

We tackle the problem of sequence classification using relevant subsequences found in a dataset of protein labelled sequences. A subsequence is relevant if it is frequent and has a minimal length. For each query sequence a vector of features is obtained. The features consist in the number and average length of the relevant subsequences shared with each of the protein families. Classification is performed by combining these features in a Bayes Classifier. The combination of these characteristics results in a multi-class and multi-domain method that is exempt of data transformation and background knowledge. We illustrate the performance of our method using three collections of protein datasets. The performed tests showed that the method has an equivalent performance to state of the art methods in protein classification.

Protein Secondary Structure Prediction Using Support Vector Machines and a New Feature Representation

International Journal of Computational Intelligence and Applications, 2006

Prediction of protein secondary structure is an important step towards elucidating its three dimensional structure and its function. This is a challenging problem in bioinformatics. Segmental semi Markov models (SSMMs) are one of the best studied methods in this field. However, incorporating evolutionary information to these methods is somewhat difficult. On the other hand, the systems of multiple neural networks (NNs) are powerful tools for multi-class pattern classification which can easily be applied to take these sorts of information into account.