GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo-amino acid composition (original) (raw)

GOASVM: Protein subcellular localization prediction based on Gene ontology annotation and SVM

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2012

Protein subcellular localization is an essential step to annotate proteins and to design drugs. This paper proposes a functionaldomain based method-GOASVM-by making full use of Gene Ontology Annotation (GOA) database to predict the subcellular locations of proteins. GOASVM uses the accession number (AC) of a query protein and the accession numbers (ACs) of homologous proteins returned from PSI-BLAST as the query strings to search against the GOA database. The occurrences of a set of predefined GO terms are used to construct the GO vectors for classification by support vector machines (SVMs). The paper investigated two different approaches to constructing the GO vectors. Experimental results suggest that using the ACs of homologous proteins as the query strings can achieve an accuracy of 94.68%, which is significantly higher than all published results based on the same dataset. As a userfriendly web-server, GOASVM is freely accessible to the public at http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/GOASVM.html.

ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization

BMC Bioinformatics, 2008

Background: Gene Ontology (GO) annotation, which describes the function of genes and gene products across species, has recently been used to predict protein subcellular and subnuclear localization. Existing GO-based prediction methods for protein subcellular localization use the known accession numbers of query proteins to obtain their annotated GO terms. An accurate prediction method for predicting subcellular localization of novel proteins without known accession numbers, using only the input sequence, is worth developing.

An Ensemble Classifier for Eukaryotic Protein Subcellular Location Prediction Using Gene Ontology Categories and Amino Acid Hydrophobicity

PLoS ONE, 2012

With the rapid increase of protein sequences in the post-genomic age, it is challenging to develop accurate and automated methods for reliably and quickly predicting their subcellular localizations. Till now, many efforts have been tried, but most of which used only a single algorithm. In this paper, we proposed an ensemble classifier of KNN (k-nearest neighbor) and SVM (support vector machine) algorithms to predict the subcellular localization of eukaryotic proteins based on a voting system. The overall prediction accuracies by the one-versus-one strategy are 78.17%, 89.94% and 75.55% for three benchmark datasets of eukaryotic proteins. The improved prediction accuracies reveal that GO annotations and hydrophobicity of amino acids help to predict subcellular locations of eukaryotic proteins.

A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology

Biochemical and Biophysical Research Communications, 2003

Based on the recent development in the gene ontology and functional domain databases, a new hybridization approach is developed for predicting protein subcellular location by combining the gene product, functional domain, and quasi-sequence-order effects. As a showcase, the same prokaryotic and eukaryotic datasets, which were studied by many previous investigators, are used for demonstration. The overall success rate by the jackknife test for the prokaryotic set is 94.7% and that for the eukaryotic set 92.9%. These are so far the highest success rates achieved for the two datasets by following a rigorous cross-validation test procedure, suggesting that such a hybrid approach may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology. The very high success rates also reflect the fact that the subcellular localization of a protein is closely correlated with: (1) the biological objective to which the gene or gene product contributes, (2) the biochemical activity of a gene product, and (3) the place in the cell where a gene product is active.

mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines

BMC Bioinformatics, 2012

Background Although many computational methods have been developed to predict protein subcellular localization, most of the methods are limited to the prediction of single-location proteins. Multi-location proteins are either not considered or assumed not existing. However, proteins with multiple locations are particularly interesting because they may have special biological functions, which are essential to both basic research and drug discovery. Results This paper proposes an efficient multi-label predictor, namely mGOASVM, for predicting the subcellular localization of multi-location proteins. Given a protein, the accession numbers of its homologs are obtained via BLAST search. Then, the original accession number and the homologous accession numbers of the protein are used as keys to search against the Gene Ontology (GO) annotation database to obtain a set of GO terms. Given a set of training proteins, a set of T relevant GO terms is obtained by finding all of the GO terms in the...

A novel representation of protein sequences for prediction of subcellular location using support vector machines

Protein Science, 2005

As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: Nterminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold crossvalidation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the Nterminal, middle, and C-terminal parts is helpful to predict the subcellular location.

HybridGO-Loc: Mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins

2014

Protein subcellular localization prediction, as an essential step to elucidate the functions in vivo of proteins and identify drugs targets, has been extensively studied in previous decades. Instead of only determining subcellular localization of single-label proteins, recent studies have focused on predicting both single-and multi-location proteins. Computational methods based on Gene Ontology (GO) have been demonstrated to be superior to methods based on other features. However, existing GO-based methods focus on the occurrences of GO terms and disregard their relationships. This paper proposes a multi-label subcellular-localization predictor, namely HybridGO-Loc, that leverages not only the GO term occurrences but also the inter-term relationships. This is achieved by hybridizing the GO frequencies of occurrences and the semantic similarity between GO terms. Given a protein, a set of GO terms are retrieved by searching against the gene ontology database, using the accession numbers of homologous proteins obtained via BLAST search as the keys. The frequency of GO occurrences and semantic similarity (SS) between GO terms are used to formulate frequency vectors and semantic similarity vectors, respectively, which are subsequently hybridized to construct fusion vectors. An adaptivedecision based multi-label support vector machine (SVM) classifier is proposed to classify the fusion vectors. Experimental results based on recent benchmark datasets and a new dataset containing novel proteins show that the proposed hybridfeature predictor significantly outperforms predictors based on individual GO features as well as other state-of-the-art predictors. For readers' convenience, the HybridGO-Loc server, which is for predicting virus or plant proteins, is available online at http://bioinfo.eie.polyu.edu.hk/HybridGoServer/.

Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition

BMC Genomics, 2008

Background: Occurrence of protein in the cell is an important step in understanding its function. It is highly desirable to predict a protein's subcellular locations automatically from its sequence. Most studied methods for prediction of subcellular localization of proteins are signal peptides, the location by sequence homology, and the correlation between the total amino acid compositions of proteins. Taking amino-acid composition and amino acid pair composition into consideration helps improving the prediction accuracy.

Predicting subcellular localization of proteins based on their N-terminal amino acid sequence

Journal of molecular …, 2000

The localization of a protein in a cell is closely correlated with its biological function. With the number of sequences entering into databanks rapidly increasing, the importance of developing a powerful high-throughput tool to determine protein subcellular location has become selfevident. In view of this, the Nearest Neighbour Algorithm was developed for predicting the protein subcellular location using the strategy of hybridizing the information derived from the recent development in gene ontology with that from the functional domain composition as well as the pseudo amino acid composition. Results: As a showcase, the same plant and non-plant protein datasets as investigated by the previous investigators were used for demonstration. The overall success rate of the jackknife test for the plant protein dataset was 86%, and that for the non-plant protein dataset 91.2%. These are the highest success rates achieved so far for the two datasets by following a rigorous cross-validation test procedure, suggesting that such a hybrid approach (particularly by incorporating the knowledge of gene ontology) may become a very useful highthroughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology. Availability: The software would be made available on sending a request to the authors. Contact: y.cai@umist.ac.uk Bioinformatics 20

Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme

International Journal of Machine Learning and Cybernetics, 2015

From the perspective of machine learning, predicting subcellular localization of multi-location proteins is a multi-label classification problem. Conventional multi-label classifiers typically compare some pattern-matching scores with a fixed decision threshold to determine the number of subcellular locations in which a protein will reside. This simple strategy, however, may easily lead to over-prediction due to a large number of false positives. To address this problem, this paper proposes a more powerful multi-label predictor, namely AD-SVM, which incorporates an adaptive-decision (AD) scheme into multi-label support vector machine (SVM) classifiers. Specifically, given a query protein, a termfrequency based gene ontology vector is constructed by successively searching the gene ontology annotation database. Subsequently, the feature vector is classified by AD-SVM, which extends the binary relevance method with an adaptive decision scheme that essentially converts the linear SVMs to piecewise linear SVMs. Experimental results on two stringent benchmark datasets suggest that AD-SVM impressively outperforms existing state-of-the-art multi-location predictors. Results also show that the adaptive-decision scheme can effectively reduce over-prediction while having insignificant effect on the correctly predicted ones.