TOWARDS CLASSIFYING ORGANISMS BASED ON THEIR PROTEIN PHYSICOCHEMICAL PROPERTIES USING COMPARATIVE INTELLIGENT TECHNIQUES. (original) (raw)

An efficient technique for protein classification using feature extraction by artificial neural networks

2010 Annual IEEE India Conference (INDICON), 2010

Classification, or supervised learning, is one of the major data mining processes. Protein classification focuses on predicting the function or the structure of new proteins. This can be done by classifying a new protein to a given family with previously known characteristics. There are many approaches available for classification tasks, such as statistical techniques, decision trees and the neural networks. In this paper, three types of neural networks such as feedforward neural network, probabilistic neural network and radial basis function neural network are implemented. The main objective of the paper is to build up an efficient classifier using neural networks. The measures used to estimate the performance of the classifier are Precision, Sensitivity and Specificity.

Protein Classification using Machine Learning and Statistical Techniques

Recent Advances in Computer Science and Communications

Background: In recent era prediction of enzyme class from an unknown protein is one of the challenging tasks in bioinformatics. Day to day the number of proteins increases that causes difficulties in clinical verification and classification; as a result, the prediction of enzyme class gives a new opportunity to bioinformatics scholars. The machine learning classification technique helps in protein classification and predictions. But it is imperative to know which classification technique is more suited for protein classification. This study used human proteins data that is extracted from UniProtKB databank. Total 4368 protein data with 45 identified features has been used for experimental analysis. Objective: The prime objective of this article is to find an appropriate classification technique to classify the reviewed as well as un-reviewed human enzyme class of protein data. Also find the significance of different features in protein classification and prediction. Method: In this ...

Application of Intelligent Techniques for Classification of Bacteria Using Protein Sequence-Derived Features

Standard molecular experimental methodologies and mathematical procedures often fail to answer many phylogeny and classification related issues. Modern artificial intelligent-based techniques, such as radial basis function, genetic algorithm, artificial neural network, and support vector machines are of ample potential in this regard. Reliance on a large number of essential parameters will aid in enhanced robustness, reliability, and better accuracy as opposed to single molecular parameter. This study was conducted with dataset of computed protein physicochemical properties belonging to 20 different bacterial genera. A total of 57 sequential and structural parameters derived from protein sequences were considered for the initial classification. Feature selection based techniques were employed to find out the most important features influencing the dataset. Various amino acids, hydrophobicity, relative sulfur percentage, and codon number were selected as important parameters during the study. Comparative analyses were performed applying RapidMiner data mining platform. Support vector machine proved to be the best method with maximum accuracy of more than 91 %.

Feature selection and comparison of classifiers for predicting protein class

Journal of Information and Data Management

Knowing the function of proteins is essential for understanding several biological systems. The experiments in laboratory to determine protein class are costly and require a long time to be done. Therefore, it is necessary to provide efficient computational models to identify the class to which a protein belongs. Nowadays, a significant volume of information regarding proteins and their structure is continually being made available in public data repositories. For example, the STING_DB database has a lot of information extracted from all protein structural levels (primary, secondary, tertiary, and quaternary), which are frequently used in classification models for this type of problem. However, it is unknown which physical-chemical properties are the most relevant ones to contribute to the prediction of the class. Therefore, there is a need to identify the subset of more suitable properties. In this work, we propose an approach based on a multi-objective genetic algorithm with the c...

Machine learning techniques in biological data classification and clustering: Initiation of a scientific voyage

Zenodo (CERN European Organization for Nuclear Research), 2020

Machine learning (ML) techniques have revolutionized the way of data classification, clustering, segregation, and novel eleme nt identification. ML techniques are having tremendous impetus for biological complex data classification. A number of studies reported novel data classification methods, complex biological element classification, and clustering. The present article briefs our experience in classifying biological species based on the biomarker genes and important proteins using state-of-the-art machine learning algorithms including artificial neural networks, support vector machines, decision trees, Bayesian methods, etc. Increased complexity warranted thorough human investigations and inspection to have a better classification on a case-by-case basis. Obtained outcomes were satisfactory and yielded novel strategies along with identifying the comparative superiority of specific algorithms for the specific datasets. However, obtaining a universal method or strategy remains the future objective. Automation of the process and precision increment for classification and clustering of the multi-parametric complex biological datasets are the other future goals.

Improved Automatic Classification of Biological

In this paper several neural network classi cation algorithms have been applied to a real-world data case of electron microscopy image classi cation in which it was known a priori the existence of two di erentiated views of the same specimen. Using several labeled sets as a reference, the parameters and architecture of the classi er (both LVQ trained codebooks and BP trained neural-nets) were optimized using a genetic algorithm. The automatic process of training and optimization is implemented using a new version of the g-lvq (genetic learning vector quantization) and g-prop algorithms, and compared to a non-optimized version of the algorithms, Kohonen's lvq (learning vector quantization) and MLP trained with QP. Dividing the all available samples in three sets, for training, testing and validation, the results presented here show a low average error for unknown samples. Usually G-PROP outperforms G-LVQ, but G-LVQ obtains codebooks with less parameters than the perceptrons obtained by G-PROP. The implication of this kind of automatic classi cation algorithms in the determination of three dimensional structure of biological particles is nally discused.

Protein classification artificial neural system

Protein Science, 1992

A neural network classification method is developed as an alternative approach to the large database search/ organization problem. The system, termed Protein Classification Artificial Neural System (ProCANS), has been implemented on a Cray supercomputer for rapid superfamily classification of unknown proteins based on the information content of the neural interconnections. The system employs an n-gram hashing function that is similar to the k-tuple method for sequence encoding. A collection of modular back-propagation networks is used to store the large amount of sequence patterns. The system has been trained and tested with the first 2,148 of the 8,309 entries of the annotated Protein Identification Resource protein sequence database (release 29). The entries included the electron transfer proteins and the six enzyme groups (oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases), with a total of 620 superfamilies. After a total training time of seven Cray central processing unit (CPU) hours, the system has reached a predictive accuracy of 90%. The classification is fast (i.e., 0.1 Cray CPU second per sequence), as it only involves a forward-feeding through the networks. The classification time on a full-scale system embedded with all known superfamilies is estimated to be within 1 CPU second. Although the training time will grow linearly with the number of entries, the classification time is expected to remain low even if there is a 10-100-fold increase of sequence entries. The neural database, which consists of a set of weight matrices of the networks, together with the ProCANS software, can be ported to other computers and made available to the genome community. The rapid and accurate superfamily classification would be valuable to the organization of protein sequence databases and to the gene recognition in large sequencing projects.

A more appropriate Protein Classification using Data Mining

2011

Research in bioinformatics is a complex phenomenon as it overlaps two knowledge domains, namely, biological and computer sciences. This paper has tried to introduce an efficient data mining approach for classifying proteins into some useful groups by representing them in hierarchy tree structure. There are several techniques used to classify proteins but most of them had few drawbacks on their grouping. Among them the most efficient grouping technique is used by PSIMAP. Even though PSIMAP (Protein Structural Interactome Map) technique was successful to incorporate most of the protein but it fails to classify the scale free property proteins. Our technique overcomes this drawback and successfully maps all the protein in different groups, including the scale free property proteins failed to group by PSIMAP. Our approach selects the six major attributes of protein: a) Structure comparison b) Sequence Comparison c) Connectivity d) Cluster Index e) Interactivity f) Taxonomic to group the protein from the databank by generating a hierarchal tree structure. The proposed approach calculates the degree (probability) of similarity of each protein newly entered in the system against of existing proteins in the system by using probability theorem on each six properties of proteins.

An empirical study of different approaches for protein classification

TheScientificWorldJournal, 2014

Many domains would benefit from reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful in only a few domains. Our goal is to evaluate several feature extraction approaches for representing proteins by testing them across multiple datasets. Different types of protein representations are evaluated: those starting from the position specific scoring matrix of the proteins (PSSM), those derived from the amino-acid sequence, two matrix representations, and features taken from the 3D tertiary structure of the protein. We also test new variants of proteins descriptors. We develop our system experimentally by comparing and combining different descriptors taken from the protein representations. Each descriptor is used ...

Protein Classification with Multiple Algorithms

Advances in Informatics, 2005

Nowadays, the number of protein sequences being stored in central protein databases from labs all over the world is constantly increasing. From these proteins only a fraction has been experimentally analyzed in order to detect their structure and hence their function in the corresponding organism. The reason is that experimental determination of structure is labor-intensive and quite time-consuming. Therefore there is the need for automated tools that can classify new proteins to structural families. This paper presents a comparative evaluation of several algorithms that learn such classification models from data concerning patterns of proteins with known structure. In addition, several approaches that combine multiple learning algorithms to increase the accuracy of predictions are evaluated. The results of the experiments provide insights that can help biologists and computer scientists design high-performance protein classification systems of high quality.