Protein classification artificial neural system (original) (raw)

Neural networks for molecular sequence classification

Mathematics and Computers in Simulation, 1995

A neural network classification method has been developed as an alternative approach to the search/ organization problem of large molecular databases. Two artificial neural systems have been implemented on a Cray supercomputer for rapid protein/nucleic acid sequence classifications. The neural networks used are three-layered, feed-forward networks that employ back-propagation learning algorithm. The molecular sequences are encoded into neural input vectors by applying an n-gram hashing method or a SVD (singular value decomposition) method. Once trained with known sequences in the molecular databases, the nettral system becomes an associative memory capable of classifying unknown sequences based on the class information embedded in its neural interconnections. The protein system, which classifies proteins into PIR (Protein Identification Resource) superfamilies, showed a 82% to a close to 100% sensitivity at a speed that is about an order of magnitude faster than other search methods. The pilot nucleic acid system, which classifies ribosomal RNA sequences according to phylogenetic groups, has achieved a 100% classification accuracy. The system could be used to reduce the database search time and help organize the molecular sequence databases. The tool is generally applicable to any databases that are organized according to family relationships.

Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition

Machine Learning, 1995

A neural network classification method has been developed as an alternative approach to the search/ organization problem of protein sequence databases. The neural networks used are three-layered, feed-forward, back-propagation networks. The protein sequences are encoded into neural input vectors by a hashing method that counts occurrences of n-gram words. A new SVD (singular value decomposition) method, which compresses the long and sparse n-gram input vectors and captures semantics of n-gram words, has improved the generalization capability of the network. A full-scale protein classification system has been implemented on a Cray supercomputer to classify unknown sequences into 3311 PIR (Protein Identification Resource) superfamilies/families at a speed of less than 0.05 CPU second per sequence. The sensitivity is close to 90% overall, and approaches 100% for large superfamilies. The system could be used to reduce the database search time and is being used to help organize the PIR protein sequence database.

ProGreSS: Simultaneous Searching of Protein Databases by Sequence and Structure

We consider the problem of similarity searches on protein databases based on both sequence and structure information simultaneously. Our program extracts feature vectors from both the sequence and structure components of the proteins. These feature vectors are then combined and indexed using a novel multi-dimensional index structure. For a given query, we employ this index structure to find candidate matches from the database. We develop a new method for computing the statistical significance of these candidates. The candidates with high significance are then aligned to the query protein using the Smith-Waterman technique to find the optimal alignment. The experimental results show that our method can classify up to 97% of the superfamilies and up to 100% of the classes correctly according to the SCOP classification. Our method is up to 37 times faster than CTSS, a recent structure search technique, combined with Smith-Waterman technique for sequences.

Title: ProGreSS: SIMULTANEOUS SEARCHING OF PROTEIN DATABASES BY SEQUENCE AND STRUCTURE PSB Session: Joint Learning from Multiple Types of Genomic Data

We consider the problem of similarity searches on protein databases based on both sequence and structure information simultaneously. Our program extracts feature vectors from both the sequence and structure components of the proteins. These feature vectors are then combined and indexed using a novel multi-dimensional index structure. For a given query, we employ this index structure to find candidate matches from the database. We develop a new method for computing the statistical significance of these candidates. The candidates with high significance are then aligned to the query protein using the Smith-Waterman technique to find the optimal alignment. The experimental results show that our method can classify up to 97 % of the superfamilies and up to 100 % of the classes correctly according to the SCOP classification. Our method is up to 37 times faster than CTSS, a recent structure search technique, combined with Smith-Waterman technique for sequences.

ProtoMap: automatic classification of protein sequences and hierarchy of protein families

Nucleic Acids Research, 2000

The ProtoMap site offers an exhaustive classification of all proteins in the SWISS-PROT database, into groups of related proteins. The classification is based on analysis of all pairwise similarities among protein sequences. The analysis makes essential use of transitivity to identify homologies among proteins. Within each group of the classification, every two members are either directly or transitively related. However, transitivity is applied restrictively in order to prevent unrelated proteins from clustering together. The classification is done at different levels of confidence, and yields a hierarchical organization of all proteins. The resulting classification splits the protein space into well-defined groups of proteins, which are closely correlated with natural biological families and superfamilies. Many clusters contain protein sequences that are not classified by other databases. The hierarchical organization suggested by our analysis may help in detecting finer subfamilies in families of known proteins. In addition it brings forth interesting relationships between protein families, upon which local maps for the neighborhood of protein families can be sketched. The ProtoMap web server can be accessed at http:// www.protomap.cs.huji.ac.il

Protein sequences as computer programs: an application to enzyme classification

The problem of identifying the cellular functions and biochemical behavior of proteins is still an open problem in bioinformatics. It is further becoming more important as the number of sequenced information grows exponentially over time. Alignment methods are a useful approach to provide functional annotation, but its use is sometimes limited, prompting the development and use of machine learning methods. Recent efforts have so far given promising results. However current approaches have so far not used the information contained in the order of the amino acids in the peptidic sequence, using instead global parameters derived from peptidic composition and structural information available. Results: A novel methodology, peptidic programs, is presented and described. This technique consists in adjusting a set of minimal computer programs to the amino acids of a peptidic sequence, in order to retrieve knowledge directly from the primary sequence without any further information. The basic concepts of peptidic programs are described, namely a proposed instruction set, virtual machine, evaluation procedures and convergence methods. This methodology is tested over 33,500 enzymes divided in 182 distinct Enzyme Commission (EC) classes by creating individual binary classifiers for each. Above 95 % of all classifiers showed accuracies above 90 % in a cross validation set. The Matthews correlation coefficient showed results above 60% for 68% of all classification problems. Conclusions Overall results suggest that the tested methodology may be able to give meaningful classification results, in several cases detecting distant homologues. Peptidic programs further use very few computational resources, on average about 31 s, using common hardware, for assess if a protein belongs to a given class, making it a competitive technology for using on extensive data searches.

PFDB: a protein families database for Macintosh computers. The effectiveness of its organization in searching for protein similarity

Journal of protein chemistry, 1997

A protein sequence database (PFDB) containing about 11,000 entries is available for Macintosh computers. The PFDB can be easily updated by importing sequences from the PIR collection through the internet. The most important feature of the database is its organization in families of closely related sequences, each family being characterized by its average dipeptide composition [Petrilli (1993), Comput. Appl. Biosci. 2, 89-93]. This allows one to perform a rapid and sensitive protein similarity search by comparing the precalculated family dipeptide composition with that of the query sequence by a linear correlation coefficient. An example of an application in which a new protein was classified by using a sequence of a fragment just 19 residues long is reported.

A novel Fibonacci hash method for protein family identification by using recurrent neural networks

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES, 2021

Identification and classification of protein families are one of the most significant problem in bioinformatics and protein studies. It is essential to specify the family of a protein since proteins are highly used in smart drug therapies, protein functions, and, in some cases, phylogenetic trees. Some sequencing techniques provide researchers to identify the biological similarities of protein families and functions. Yet, determining these families with sequencing applications requires huge amount of time. Thus, a computer and artificial intelligence based classification system is needed to save time and avoid complexity in protein classification process. In order to designate the protein families with computeraided systems, protein sequences need to be converted to the numerical representations. In this paper, we provide a novel protein mapping method based on Fibonacci numbers and hashing table (FIBHASH). Each amino acid code is assigned to the Fibonacci numbers based on integer representations respectively. Later, these amino acid codes are inserted a hashing table with the size of 20 to be classified with recurrent neural networks. To determine the performance of the proposed mapping method, we used accuracy, f1-score, recall, precision, and AUC evaluation criteria. In addition, the results of evaluation metrics with other protein mapping techniques including EIIP, hydrophobicity, CPNR, Atchley factors, BLOSUM62, PAM250, binary one-hot encoding, and randomly encoded representations are compared. The proposed method showed a promising result with an accuracy of 92.77%, and 0.98 AUC score.

Associative database of protein sequences

Bioinformatics, 1999

Motivation: We present a new concept that combines data storage and data analysis in genome research, based on an associative network memory. As an illustration, 115 000 conserved regions from over 73 000 published sequences (i.e. from the entire annotated part of the SWISSPROT sequence database) were identified and clustered by a self-organizing network. Similarity and kinship, as well as degree of distance between the conserved protein segments, are visualized as neighborhood relationship on a two-dimensional topographical map. Results: Such a display overcomes the restrictions of linear list processing and allows local and global sequence relationships to be studied visually. Families are memorized as prototype vectors of conserved regions. On a massive parallel machine, clustering and updating of the database take only a few seconds; a rapid analysis of incoming data such as protein sequences or ESTs is carried out on present-day workstations.