Vasilis Promponas | University of Cyprus (original) (raw)
Papers by Vasilis Promponas
IEEE Access, 2022
Trying to extract features from complex sequential data for classification and prediction problem... more Trying to extract features from complex sequential data for classification and prediction problems is an extremely difficult task. This task is even more challenging when both the upstream and downstream information of a time-series is important to process the sequence at a specific time-step. One typical problem which falls in this category is Protein Secondary Structure Prediction (PSSP). Recurrent Neural Networks (RNNs) have been successful in handling sequential data. These methods are demanding in terms of time and space efficiency. On the other hand, simple Feed-Forward Neural Networks (FFNNs) can be trained really fast with the Backpropagation algorithm, but in practice they give poor results in this category of problems. The Hessian Free Optimization (HFO) algorithm is one of the latest developments in the field of Artificial Neural Network (ANN) training algorithms which can converge faster and more accurately. In this paper, we present the implementation of simple FFNNs trained with the powerful HFO second-order learning algorithm for the PSSP problem. In our approach, a single FFNN trained with the HFO learning algorithm can achieve an approximately 79.6% per residue (Q 3) accuracy on the PISCES dataset. Despite the simplicity of our method, the results are comparable to some of the state of the art methods which have been designed for this problem. A majority voting ensemble method and filtering with Support Vector Machines have also been applied, which increase our results to 80.4% per residue (Q 3) accuracy. Finally, our method has been tested on the CASP13 independent dataset and achieved 78.14% per residue (Q 3) accuracy. Moreover, the HFO does not require tuning of any parameters which makes training much faster than other state of the art methods, a very important feature with big datasets and facilitates fast training of FFNN ensembles. INDEX TERMS Hessian free optimization, neural networks, protein secondary structure prediction, second order learning algorithms.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 2012
Filtering of protein secondary structure prediction aims to provide physicochemically realistic r... more Filtering of protein secondary structure prediction aims to provide physicochemically realistic results, while it usually improves the predictive performance. We performed a comparative study on this challenging problem, utilising both machine learning techniques and empirical rules and we found that combinations of the two lead to the highest improvement.
Scientific reports, 2016
Eukaryotic cells are defined by compartments through which the trafficking of macromolecules is m... more Eukaryotic cells are defined by compartments through which the trafficking of macromolecules is mediated by large complexes, such as the nuclear pore, transport vesicles and intraflagellar transport. The assembly and maintenance of these complexes is facilitated by endomembrane coatomers, long suspected to be divergently related on the basis of structural and more recently phylogenomic analysis. By performing supervised walks in sequence space across coatomer superfamilies, we uncover subtle sequence patterns that have remained elusive to date, ultimately unifying eukaryotic coatomers by divergent evolution. The conserved residues shared by 3,502 endomembrane coatomer components are mapped onto the solenoid superhelix of nucleoporin and COPII protein structures, thus determining the invariant elements of coatomer architecture. This ancient structural motif can be considered as a universal signature connecting eukaryotic coatomers involved in multiple cellular processes across cell p...
Standards in Genomic Sciences, 2015
The function annotation process in computational biology has increasingly shifted from the tradit... more The function annotation process in computational biology has increasingly shifted from the traditional characterization of individual biochemical roles of protein molecules to the system-wide detection of entire metabolic pathways and genomic structures. The so-called genome-aware methods broaden misannotation inconsistencies in genome sequences beyond protein function assignments, encompassing phylogenetic anomalies and artifactual genomic regions. We outline three categories of error propagation in databases by providing striking examples - at various levels of appreciation by the community from traditional to emerging, thus raising awareness for future solutions.
Background / Purpose: This work is about employing modules and EasyBuild to assist with the softw... more Background / Purpose: This work is about employing modules and EasyBuild to assist with the software complexity challenge on HPC/Grid/Cloud platforms that support bioinformatics and computational biology activity. Common tools such as BLAST, HMMER, Bowtie, BWA many more are readily supported. Also, R includes Bioconductor and complex intel-compiler builds are easily possible. Main conclusion: This is possibly the only known long-term supportable solution for shared environments, whereby multiple applications to serve users must co-exist and, at the same time, top performance must be achievable.
Methods in molecular biology (Clifton, N.J.), 2014
Nowadays, it is possible to identify terms corresponding to biological entities within passages i... more Nowadays, it is possible to identify terms corresponding to biological entities within passages in biomedical text corpora: critically, their potential relationships then need to be detected. These relationships are typically detected by co-occurrence analysis, revealing associations between bioentities through their coexistence in single sentences and/or entire abstracts. These associations implicitly define networks, whose nodes represent terms/bioentities/concepts being connected by relationship edges; edge weights might represent confidence for these semantic connections.This chapter provides a review of current methods for co-occurrence analysis, focusing on data storage, analysis, and representation. We highlight scenarios of these approaches implemented by useful tools for information extraction and knowledge inference in the field of systems biology. We illustrate the practical utility of two online resources providing services of this type-namely, STRING and BioTextQuest-co...
2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE), 2012
ABSTRACT A considerable research effort has already been put on the identification (and consequen... more ABSTRACT A considerable research effort has already been put on the identification (and consequently filtering) of local segments of “unusual” composition (Compositionally Biased or Low Complexity Regions; CBRs or LCRs) in protein sequences. This interest was mainly initiated due to the fact that CBR existence is known to create artifacts (i.e. biologically irrelevant hits) in sequence database search methods. Even though no general biological significance has been demonstrated for CBRs so far, they are often associated with the lack of regular structure. However, application of commonly used methods for CBR detection illustrates that instances of CBRs can be found in proteins with experimentally determined three dimensional structures. In this work, we highlight sequential properties of CBRs detected by two of the most widely used CBR detection algorithms in carefully compiled datasets of proteins with experimentally determined structures. Our goal is to shed light on the properties of CBR sequences, with the future prospect of elucidating their relation to protein three dimensional structure.
Proteins: Structure, Function, and Genetics, 2001
A cascading system of hierarchical artificial neural networks is presented, for the generalized c... more A cascading system of hierarchical artificial neural networks is presented, for the generalized classification of proteins into four distinct classes: Transmembrane, Fibrous, Globular and 'Mixed', from information solely encoded in their amino acid sequences. This system, named PRED-CLASS, is a direct descendant of the recently published PRED-TMR2 algorithm, which initially discriminates transmembrane (TM) from globular, water soluble proteins with considerable success for several representative data sets. The architecture of the individual component networks is kept very simple, reducing the number of free parameters (network synaptic weights) for faster training, improved generalization and avoiding overfitting the data. Capturing information from as little as 50 protein sequences spread along the 4 target classes (6 TM, 10 Fibrous, 13 Globular and 17 Mixed), PRED-CLASS was able to obtain 371 correct predictions out of a set of 387 proteins (success rate ~96%) unambiguously assigned into one of the target classes. Application of PRED-CLASS to several test sets and complete proteomes of several organisms, demonstrates that such a method could serve as a valuable tool in the annotation of genomic ORFs with no functional assignment or as a preliminary step in fold recognition and 'ab initio' structure prediction methods. Detailed results obtained on various data sets, completed genomes, along with a web sever running the PRED-CLASS algorithm can be accessed over the World Wide Web at the URL:
Briefings in Bioinformatics, 2012
More than a decade ago, a number of methods were proposed for the inference of protein interactio... more More than a decade ago, a number of methods were proposed for the inference of protein interactions, using whole-genome information from gene clusters, gene fusions and phylogenetic profiles. This structural and evolutionary view of entire genomes has provided a valuable approach for the functional characterization of proteins, especially those without sequence similarity to proteins of known function. Furthermore, this view has raised the real possibility to detect functional associations of genes and their corresponding proteins for any entire genome sequence.Yet, despite these exciting developments, there have been relatively few cases of real use of these methods outside the computational biology field, as reflected from citation analysis. These methods have the potential to be used in high-throughput experimental settings in functional genomics and proteomics to validate results with very high accuracy and good coverage. In this critical survey, we provide a comprehensive overview of 30 most prominent examples of single pairwise protein interaction cases in small-scale studies, where protein interactions have either been detected by gene fusion or yielded additional, corroborating evidence from biochemical observations. Our conclusion is that with the derivation of a validated gold-standard corpus and better data integration with big experiments, gene fusion detection can truly become a valuable tool for large-scale experimental biology.
Bioinformatics, 1998
FT is a tool written in C++, which implements the Fourier analysis method to locate periodicities... more FT is a tool written in C++, which implements the Fourier analysis method to locate periodicities in aminoacid or DNA sequences. It is provided for free public use on a WWW server with a Java interface.
Bioinformatics (Oxford, England), Jan 20, 2015
Local compositionally biased and low complexity regions (LCRs) in amino acid sequences have initi... more Local compositionally biased and low complexity regions (LCRs) in amino acid sequences have initially attracted the interest of researchers due to their implication in generating artifacts in sequence database searches. There is accumulating evidence of the biological significance of LCRs both in physiological and in pathological situations. Nonetheless, LCR-related algorithms and tools have not gained wide appreciation across the research community, partly due to the fact that only a handful of user-friendly software is currently freely available. We developed LCR-eXXXplorer, an extensible online platform attempting to fill this gap. LCR-eXXXplorer offers tools for displaying LCRs from the UniProt/SwissProt knowledgebase, in combination with other relevant protein features, predicted or experimentally verified. Moreover, users may perform powerful queries against a custom designed sequence/LCR-centric database. We anticipate that LCR-eXXXplorer will be a useful starting point in re...
Bioinformatics (Oxford, England), Jan 15, 2015
The iterative process of finding relevant information in biomedical literature and performing bio... more The iterative process of finding relevant information in biomedical literature and performing bioinformatics analyses might result in an endless loop for an inexperienced user, considering the exponential growth of scientific corpora and the plethora of tools designed to mine PubMed V R and related biological databases. Herein, we describe BioTextQuest + , a web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest + enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents. The front-end interface facilitates the browsing of document clustering per subject, the analysis of term co-occurrence, the generation of tag clouds containing highly represented terms per cluster and at-a-glance popup windows with information about relevant genes and proteins. Moreover, to support experimental research, BioTextQuest + addresses integration of its primary functionality with biological repositories and software tools able to deliver further bioinformatics services. The Google-like interface extends beyond simple use by offering a range of advanced parameterization for expert users. We demonstrate the functionality of BioTextQuest + through several exemplary research scenarios including author disambiguation, functional term enrichment, knowledge acquisition and concept discovery linking major human diseases, such as obesity and ageing. Availability: The service is accessible at http://bioinformatics.med.
IEEE journal of biomedical and health informatics, 2014
International Journal on Artificial Intelligence Tools, 2014
ABSTRACT This paper presents an in-depth look of how FPGA computing can offer substantial speedup... more ABSTRACT This paper presents an in-depth look of how FPGA computing can offer substantial speedups in the execution of bioinformatics algorithms, with specific results achieved to date for a broad range of algorithms. Examples and case studies are presented for sequence comparison (BLAST, CAST), multiple sequence alignment (MAFFT, T-Coffee), RNA and protein secondary structure prediction (Zuker, Predator), gene prediction (Glimmer/GlimmerHMM) and phylogenetic tree computation (RAxML), running on mainstream FPGA technologies as well as high-end FPGA-based systems (Convey HC1, BeeCube). This work also presents technological and other obstacles that need to be overcome in order for FPGA computing to become a mainstream technology in Bioinformatics.
IEEE Access, 2022
Trying to extract features from complex sequential data for classification and prediction problem... more Trying to extract features from complex sequential data for classification and prediction problems is an extremely difficult task. This task is even more challenging when both the upstream and downstream information of a time-series is important to process the sequence at a specific time-step. One typical problem which falls in this category is Protein Secondary Structure Prediction (PSSP). Recurrent Neural Networks (RNNs) have been successful in handling sequential data. These methods are demanding in terms of time and space efficiency. On the other hand, simple Feed-Forward Neural Networks (FFNNs) can be trained really fast with the Backpropagation algorithm, but in practice they give poor results in this category of problems. The Hessian Free Optimization (HFO) algorithm is one of the latest developments in the field of Artificial Neural Network (ANN) training algorithms which can converge faster and more accurately. In this paper, we present the implementation of simple FFNNs trained with the powerful HFO second-order learning algorithm for the PSSP problem. In our approach, a single FFNN trained with the HFO learning algorithm can achieve an approximately 79.6% per residue (Q 3) accuracy on the PISCES dataset. Despite the simplicity of our method, the results are comparable to some of the state of the art methods which have been designed for this problem. A majority voting ensemble method and filtering with Support Vector Machines have also been applied, which increase our results to 80.4% per residue (Q 3) accuracy. Finally, our method has been tested on the CASP13 independent dataset and achieved 78.14% per residue (Q 3) accuracy. Moreover, the HFO does not require tuning of any parameters which makes training much faster than other state of the art methods, a very important feature with big datasets and facilitates fast training of FFNN ensembles. INDEX TERMS Hessian free optimization, neural networks, protein secondary structure prediction, second order learning algorithms.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 2012
Filtering of protein secondary structure prediction aims to provide physicochemically realistic r... more Filtering of protein secondary structure prediction aims to provide physicochemically realistic results, while it usually improves the predictive performance. We performed a comparative study on this challenging problem, utilising both machine learning techniques and empirical rules and we found that combinations of the two lead to the highest improvement.
Scientific reports, 2016
Eukaryotic cells are defined by compartments through which the trafficking of macromolecules is m... more Eukaryotic cells are defined by compartments through which the trafficking of macromolecules is mediated by large complexes, such as the nuclear pore, transport vesicles and intraflagellar transport. The assembly and maintenance of these complexes is facilitated by endomembrane coatomers, long suspected to be divergently related on the basis of structural and more recently phylogenomic analysis. By performing supervised walks in sequence space across coatomer superfamilies, we uncover subtle sequence patterns that have remained elusive to date, ultimately unifying eukaryotic coatomers by divergent evolution. The conserved residues shared by 3,502 endomembrane coatomer components are mapped onto the solenoid superhelix of nucleoporin and COPII protein structures, thus determining the invariant elements of coatomer architecture. This ancient structural motif can be considered as a universal signature connecting eukaryotic coatomers involved in multiple cellular processes across cell p...
Standards in Genomic Sciences, 2015
The function annotation process in computational biology has increasingly shifted from the tradit... more The function annotation process in computational biology has increasingly shifted from the traditional characterization of individual biochemical roles of protein molecules to the system-wide detection of entire metabolic pathways and genomic structures. The so-called genome-aware methods broaden misannotation inconsistencies in genome sequences beyond protein function assignments, encompassing phylogenetic anomalies and artifactual genomic regions. We outline three categories of error propagation in databases by providing striking examples - at various levels of appreciation by the community from traditional to emerging, thus raising awareness for future solutions.
Background / Purpose: This work is about employing modules and EasyBuild to assist with the softw... more Background / Purpose: This work is about employing modules and EasyBuild to assist with the software complexity challenge on HPC/Grid/Cloud platforms that support bioinformatics and computational biology activity. Common tools such as BLAST, HMMER, Bowtie, BWA many more are readily supported. Also, R includes Bioconductor and complex intel-compiler builds are easily possible. Main conclusion: This is possibly the only known long-term supportable solution for shared environments, whereby multiple applications to serve users must co-exist and, at the same time, top performance must be achievable.
Methods in molecular biology (Clifton, N.J.), 2014
Nowadays, it is possible to identify terms corresponding to biological entities within passages i... more Nowadays, it is possible to identify terms corresponding to biological entities within passages in biomedical text corpora: critically, their potential relationships then need to be detected. These relationships are typically detected by co-occurrence analysis, revealing associations between bioentities through their coexistence in single sentences and/or entire abstracts. These associations implicitly define networks, whose nodes represent terms/bioentities/concepts being connected by relationship edges; edge weights might represent confidence for these semantic connections.This chapter provides a review of current methods for co-occurrence analysis, focusing on data storage, analysis, and representation. We highlight scenarios of these approaches implemented by useful tools for information extraction and knowledge inference in the field of systems biology. We illustrate the practical utility of two online resources providing services of this type-namely, STRING and BioTextQuest-co...
2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE), 2012
ABSTRACT A considerable research effort has already been put on the identification (and consequen... more ABSTRACT A considerable research effort has already been put on the identification (and consequently filtering) of local segments of “unusual” composition (Compositionally Biased or Low Complexity Regions; CBRs or LCRs) in protein sequences. This interest was mainly initiated due to the fact that CBR existence is known to create artifacts (i.e. biologically irrelevant hits) in sequence database search methods. Even though no general biological significance has been demonstrated for CBRs so far, they are often associated with the lack of regular structure. However, application of commonly used methods for CBR detection illustrates that instances of CBRs can be found in proteins with experimentally determined three dimensional structures. In this work, we highlight sequential properties of CBRs detected by two of the most widely used CBR detection algorithms in carefully compiled datasets of proteins with experimentally determined structures. Our goal is to shed light on the properties of CBR sequences, with the future prospect of elucidating their relation to protein three dimensional structure.
Proteins: Structure, Function, and Genetics, 2001
A cascading system of hierarchical artificial neural networks is presented, for the generalized c... more A cascading system of hierarchical artificial neural networks is presented, for the generalized classification of proteins into four distinct classes: Transmembrane, Fibrous, Globular and 'Mixed', from information solely encoded in their amino acid sequences. This system, named PRED-CLASS, is a direct descendant of the recently published PRED-TMR2 algorithm, which initially discriminates transmembrane (TM) from globular, water soluble proteins with considerable success for several representative data sets. The architecture of the individual component networks is kept very simple, reducing the number of free parameters (network synaptic weights) for faster training, improved generalization and avoiding overfitting the data. Capturing information from as little as 50 protein sequences spread along the 4 target classes (6 TM, 10 Fibrous, 13 Globular and 17 Mixed), PRED-CLASS was able to obtain 371 correct predictions out of a set of 387 proteins (success rate ~96%) unambiguously assigned into one of the target classes. Application of PRED-CLASS to several test sets and complete proteomes of several organisms, demonstrates that such a method could serve as a valuable tool in the annotation of genomic ORFs with no functional assignment or as a preliminary step in fold recognition and 'ab initio' structure prediction methods. Detailed results obtained on various data sets, completed genomes, along with a web sever running the PRED-CLASS algorithm can be accessed over the World Wide Web at the URL:
Briefings in Bioinformatics, 2012
More than a decade ago, a number of methods were proposed for the inference of protein interactio... more More than a decade ago, a number of methods were proposed for the inference of protein interactions, using whole-genome information from gene clusters, gene fusions and phylogenetic profiles. This structural and evolutionary view of entire genomes has provided a valuable approach for the functional characterization of proteins, especially those without sequence similarity to proteins of known function. Furthermore, this view has raised the real possibility to detect functional associations of genes and their corresponding proteins for any entire genome sequence.Yet, despite these exciting developments, there have been relatively few cases of real use of these methods outside the computational biology field, as reflected from citation analysis. These methods have the potential to be used in high-throughput experimental settings in functional genomics and proteomics to validate results with very high accuracy and good coverage. In this critical survey, we provide a comprehensive overview of 30 most prominent examples of single pairwise protein interaction cases in small-scale studies, where protein interactions have either been detected by gene fusion or yielded additional, corroborating evidence from biochemical observations. Our conclusion is that with the derivation of a validated gold-standard corpus and better data integration with big experiments, gene fusion detection can truly become a valuable tool for large-scale experimental biology.
Bioinformatics, 1998
FT is a tool written in C++, which implements the Fourier analysis method to locate periodicities... more FT is a tool written in C++, which implements the Fourier analysis method to locate periodicities in aminoacid or DNA sequences. It is provided for free public use on a WWW server with a Java interface.
Bioinformatics (Oxford, England), Jan 20, 2015
Local compositionally biased and low complexity regions (LCRs) in amino acid sequences have initi... more Local compositionally biased and low complexity regions (LCRs) in amino acid sequences have initially attracted the interest of researchers due to their implication in generating artifacts in sequence database searches. There is accumulating evidence of the biological significance of LCRs both in physiological and in pathological situations. Nonetheless, LCR-related algorithms and tools have not gained wide appreciation across the research community, partly due to the fact that only a handful of user-friendly software is currently freely available. We developed LCR-eXXXplorer, an extensible online platform attempting to fill this gap. LCR-eXXXplorer offers tools for displaying LCRs from the UniProt/SwissProt knowledgebase, in combination with other relevant protein features, predicted or experimentally verified. Moreover, users may perform powerful queries against a custom designed sequence/LCR-centric database. We anticipate that LCR-eXXXplorer will be a useful starting point in re...
Bioinformatics (Oxford, England), Jan 15, 2015
The iterative process of finding relevant information in biomedical literature and performing bio... more The iterative process of finding relevant information in biomedical literature and performing bioinformatics analyses might result in an endless loop for an inexperienced user, considering the exponential growth of scientific corpora and the plethora of tools designed to mine PubMed V R and related biological databases. Herein, we describe BioTextQuest + , a web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest + enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents. The front-end interface facilitates the browsing of document clustering per subject, the analysis of term co-occurrence, the generation of tag clouds containing highly represented terms per cluster and at-a-glance popup windows with information about relevant genes and proteins. Moreover, to support experimental research, BioTextQuest + addresses integration of its primary functionality with biological repositories and software tools able to deliver further bioinformatics services. The Google-like interface extends beyond simple use by offering a range of advanced parameterization for expert users. We demonstrate the functionality of BioTextQuest + through several exemplary research scenarios including author disambiguation, functional term enrichment, knowledge acquisition and concept discovery linking major human diseases, such as obesity and ageing. Availability: The service is accessible at http://bioinformatics.med.
IEEE journal of biomedical and health informatics, 2014
International Journal on Artificial Intelligence Tools, 2014
ABSTRACT This paper presents an in-depth look of how FPGA computing can offer substantial speedup... more ABSTRACT This paper presents an in-depth look of how FPGA computing can offer substantial speedups in the execution of bioinformatics algorithms, with specific results achieved to date for a broad range of algorithms. Examples and case studies are presented for sequence comparison (BLAST, CAST), multiple sequence alignment (MAFFT, T-Coffee), RNA and protein secondary structure prediction (Zuker, Predator), gene prediction (Glimmer/GlimmerHMM) and phylogenetic tree computation (RAxML), running on mainstream FPGA technologies as well as high-end FPGA-based systems (Convey HC1, BeeCube). This work also presents technological and other obstacles that need to be overcome in order for FPGA computing to become a mainstream technology in Bioinformatics.