Why neural networks should not be used for HIV1 protease cleavage site prediction (original) (raw)

Machine learning for HIV-1 protease cleavage site prediction

Pattern Recognition Letters, 2006

Recently, several works have approached the HIV-1 protease specificity problem by applying a number of classifier creation and combination methods, known as ensemble methods, from the field of machine learning. However, it is still difficult for researchers to choose the best method due to the lack of an effective comparison. For the first time we have made an extensive study on methods for feature extraction, feature transformation and multiclassifier systems (MCS) in the problem of HIV-1 protease. In this work we report an experimental comparison on several learning systems coupled with different feature representations.

Support vector machines for predicting HIV protease cleavage sites in protein

Journal of Computational Chemistry, 2002

Knowledge of the polyprotein cleavage sites by HIV protease will refine our understanding of its specificity, and the information thus acquired is useful for designing specific and efficient HIV protease inhibitors. The pace in searching for the proper inhibitors of HIV protease will be greatly expedited if one can find an accurate, robust, and rapid method for predicting the cleavage sites in proteins by HIV protease. In this article, a Support Vector Machine is applied to predict the cleavability of oligopeptides by proteases with multiple and extended specificity subsites. We selected HIV-1 protease as the subject of the study. Two hundred ninety-nine oligopeptides were chosen for the training set, while the other 63 oligopeptides were taken as a test set. Because of its high rate of self-consistency (299/299 = 100%), a good result in the jackknife test (286/299 = 95%) and correct prediction rate (55/63 = 87%), it is expected that the Support Vector Machine method can be referred to as a useful assistant technique for finding effective inhibitors of HIV protease, which is one of the targets in designing potential drugs against AIDS. The principle of the Support Vector Machine method can also be applied to analyzing the specificity of other multisubsite enzymes.

Support Vector Machines for HIV-1 Protease Cleavage Site Prediction

Lecture Notes in Computer Science, 2005

Recently, several works have approached the HIV-1 protease specificity problem by applying a number of classifier creation and combination methods, from the field of machine learning. In this work we propose a hierarchical classifier (HC) architecture. Moreover, we show that radial basis function-support vector machines may obtain a lower error rate than linear support vector machines, if a step of feature selection and a step of feature transformation is performed. The error rate decreases from 9.1% using linear support vector machines to 6.85% using the new hierarchical classifier.

A reliable method for HIV-1 protease cleavage site prediction

Neurocomputing, 2006

Recently, several works have approached the HIV-1 protease specificity problem by applying techniques from machine learning. In this work, an encoding scheme based on the BLOSUM50 matrix is investigated. We show that combining a linear discriminant classifier and radial basis function support vector machine we obtain performance higher than previously published in the literature. r

State of the art prediction of HIV-1 protease cleavage sites

Bioinformatics

Understanding the substrate specificity of HIV-1 protease is important when designing effective HIV-1 protease inhibitors. Furthermore, characterizing and predicting the cleavage profile of HIV-1 protease is essential to generate and test hypotheses of how HIV-1 affects proteins of the human host. Currently available tools for predicting cleavage by HIV-1 protease can be improved. The linear support vector machine with orthogonal encoding is shown to be the best predictor for HIV-1 protease cleavage. It is considerably better than current publicly available predictor services. It is also found that schemes using physicochemical properties do not improve over the standard orthogonal encoding scheme. Some issues with the currently available data are discussed. The data sets used, which are the most important part, are available at the UCI Machine Learning Repository. The tools used are all standard and easily available. thorsteinn.rognvaldsson@hh.se. © The Author (2014). Published by ...

No Algorithm Beats the Simple Perceptron on HIV Protease Function Prediction

We review past work for predicting and understanding HIV protease function, using both machine learning algorithms and other classification algorithms. We show that the best algorithm for solving the task is the simple Perceptron, which has never before been applied to this problem. The simple Perceptron is efficient because the peptide data set is linearly separable, a fact that previous researchers seem to have overlooked. We also discuss the issue of data set size in relation to the size of the feature space, and classifier bias and variance and how this relates to the HIV protease function prediction problem.

Comparison among feature extraction methods for HIV-1 protease cleavage site prediction

Pattern Recognition, 2006

Recently, several works have approached the HIV-1 protease specificity problem by applying a number of methods from the field of machine learning. However, it is still difficult for researchers to choose the best method due to the lack of an effective comparison. For the first time we have made an extensive study on methods for feature extraction for the problem of HIV-1 protease. We show that a fusion of classifiers trained in different feature spaces permits to obtain a drastically error reduction with respect to the performance of the state-of-the-art.

Neural Network and Bioinformatic Methods for Predicting HIV-1 Protease Inhibitor Resistance

CAS/CNS Technical Report Series, 2010

This article presents a new method for predicting viral resistance to seven protease inhibitors from the HIV-1 genotype, and for identifying the positions in the protease gene at which the specific nature of the mutation affects resistance. The neural network Analog ARTMAP predicts protease inhibitor resistance from viral genotypes. A feature selection method detects genetic positions that contribute to resistance both alone and through interactions with other positions. This method has identified positions 35, 37, 62, and 77, where traditional feature selection methods have not detected a contribution to resistance. At several positions in the protease gene, mutations confer differing degrees of resistance, depending on the specific amino acid to which the sequence has mutated. To find these positions, an Amino Acid Space is introduced to represent genes in a vector space that captures the functional similarity between amino acid pairs. Feature selection identifies several new positions, including 36, 37, and 43, with amino acid-specific contributions to resistance. Analog ARTMAP networks applied to inputs that represent specific amino acids at these positions perform better than networks that use only mutation locations.

Artificial neural network model for predicting HIV protease cleavage sites in protein

Advances in Engineering Software, 1998

This study presents an artificial neural network (ANN) model to predict the values of the longitudinal dispersion coefficient ) ( l D in rivers from their main hydraulic parameters. The model can be considered as a useful aid to water quality monitoring in rivers. The ANN model is a relatively new promising technique which can make use of the river width, depth, velocity, and shear velocity for predicting l D . The used ANN model is based on a back propagation algorithm to train a multi-layer feedforward network. The proposed model was verified using 116 sets of field data collected from 62 streams ranging from straight manmade canals to sinuous natural rivers. The ANN model predicts l D , where more than 83% of the calculated values range from 0.50 to 2.0 times the observed values in the field. A comparison of the ANN model estimates with the outputs of the most recent and accurate equations in the literature, for the longitudinal dispersion coefficient, using three different statistical methods for analysis, has shown that the accuracy of the ANN model compared favourably with other equations. Finally, a new accurate predictor for the values of l D in polluted streams that based on readily measurable hydraulic quantities is presented.