Giovanni Felici | Consiglio Nazionale delle Ricerche (CNR) (original) (raw)
Uploads
Papers by Giovanni Felici
Information Processing & Management, 2014
Much of the valuable information in supporting decision making processes originates in text-based... more Much of the valuable information in supporting decision making processes originates in text-based documents. Although these documents can be effectively searched and ranked by modern search engines, actionable knowledge need to be extracted and transformed in a structured form before being used in a decision process. In this paper we describe how the discovery of semantic information embedded in natural language documents can be viewed as an optimization problem aimed at assigning a sequence of labels (hidden states) to a set of interdependent variables (textual tokens). Dependencies among variables are efficiently modeled through Conditional Random Fields, an indirected graphical model able to represent the distribution of labels given a set of observations. The Markov property of these models prevent them to take into account longrange dependencies among variables, which are indeed relevant in Natural Language Processing. In order to overcome this limitation we propose an inference method based on Integer Programming formulation of the problem, where long distance dependencies are included through non-deterministic soft constraints.
BIOMAT 2008 - International Symposium on Mathematical and Computational Biology, 2009
Lecture Notes in Computer Science, 1996
ABSTRACT
SNPs are positions of the DNA sequences where the differences among individuals are embedded. The... more SNPs are positions of the DNA sequences where the differences among individuals are embedded. The knowledge of such SNPs is crucial for disease association studies, but even if the number of such positions is low (about 1% of the entire sequence), the cost to extract the complete information is actually very high. Recent studies have shown that DNA sequences are structured into blocks of positions, that are conserved during evolution, where there is strong correlation among values (alleles) of different loci. To reduce the cost of extracting SNPs information, the block structure of the DNA has suggested to limit the process to a subset of SNPs, the so-called Tag SNPs, that are able to maintain the most of the information contained in the whole sequence. In this paper, we apply a technique for feature selection based on integer programming to the problem of Tag SNP selection. Moreover, to test the quality of our approach, we consider also the problem of SNPs reconstruction, i.e. the problem of deriving unknown SNPs from the value of Tag SNPs and propose two reconstruction methods, one based on a majority vote and the other on a machine learning approach. We test our algorithm on two public data sets of different nature, providing results that are, when comparable, in line with the related literature. One of the interesting aspects of the proposed method is to be found in its capability to deal simultaneously with very large SNPs sets, and, in addition, to provide highly informative reconstruction rules in the form of logic formulas.
BMC research notes, 2014
Next Generation Sequencing (NGS) machines extract from a biological sample a large number of shor... more Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis. We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ...
In this paper we present a new bound obtained with the probabilistic method for the solution of t... more In this paper we present a new bound obtained with the probabilistic method for the solution of the Set Covering problem with unit costs. The bound is valid for problems of fixed dimension, thus extending previous similar asymptotic results, and it depends only on the number of rows of the coefficient matrix and the row densities. We also consider the particular case of matrices that are almost block decomposable, and show how the bound may improve according to the particular decomposition adopted. Such final result may provide interesting indications for comparing different matrix decomposition strategies.
Alzheimer's Disease (AD) and its preliminary stage -Mild Cognitive Impairment (MCI) -are the most... more Alzheimer's Disease (AD) and its preliminary stage -Mild Cognitive Impairment (MCI) -are the most widespread neurodegenerative disorders, and their investigation remains an open challenge. ElectroEncephalography (EEG) appears as a non-invasive and repeatable technique to diagnose brain abnormalities. Despite technical advances, the analysis of EEG spectra is usually carried out by experts that must manually perform laborious interpretations. Computational methods may lead to a quantitative analysis of these signals and hence to characterize EEG time series. The aim of this work is to achieve an automatic patients classification from the EEG biomedical signals involved in AD and MCI in order to support medical doctors in the right diagnosis formulation. The analysis of the biological EEG signals requires effective and efficient computer science methods to extract relevant information. Data mining, which guides the automated knowledge discovery process, is a natural way to approach EEG data analysis. Specifically, in our work we apply the following analysis steps: (i) pre-processing of EEG data; (ii) processing of the EEG-signals by the application of time-frequency transforms; and (iii) classification by means of machine learning methods. We obtain promising results from the classification of AD, MCI, and control samples that can assist the medical doctors in identifying the pathology.
Objective: Alzheimer's Disease (AD) is the most common form of dementia, for which actually no cu... more Objective: Alzheimer's Disease (AD) is the most common form of dementia, for which actually no cure is known [1]. Different studies have shown that AD has (at least) three major effects on electroencephalography (EEG) signals: enhanced complexity, slowing of signals, and perturbations in EEG synchrony [2]. The aim of this work is to achieve an automatic patients classification from EEG biomedical signals involved in AD and Mild Cognitive Impairment (MCI) in order to support physicians in a more correct individual diagnosis.
Background. DNA assembly consists in reconstructing the unknown primary structure of a DNA sequen... more Background. DNA assembly consists in reconstructing the unknown primary structure of a DNA sequence from a large number of its fragments, called reads, that are obtained in the sequencing process. The need for fast assembly methods has increased with the introduction of next generation sequencing (NGS) machines, that can produce and extract, at low cost, a large number of short reads from a genomic source. A large class of DNA assembly methods rely on a filtering step, where promising read pairs are separated from non-promising ones in order to reduce the computational burden of the main assembly algorithm. Faster filtering can thus provide a significant contribution to speed up the reconstruction of sequenced DNA. Methods. We propose a fitering method for read pairs based on alignment free distance. The similarity of two reads is assessed by comparing the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment free distance with the Needleman-Wunsch edit distance and with the quality of the BLAST alignment. Our comparison is based on a very simple assumption: the most correct distance is that obtained by knowing in advance the reference sequence that we are trying to align. We compute the overlap between the reads that is obtained once they have been aligned on the original DNA sequence, and use that as a reference distance; then, we verify how the alignment free and the alignment based distances are able to reproduce this ideal distance. The capability of correctly reproducing this ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae (yeast), Escherichia coli, and Homo sapiens (human) genomic sequences. Comparisons are based on the correctness of threshold predictors and are measured and cross-validated over different samples from the same sets of reads. Results. We show that, for the considered sequences, the adopted alignment free distance performs as well as, or better, than the more time consuming distances that require the alignment of the reads. Such assessment is based on prediction precision of the analyzed distances both on training and on test sets. Conclusions. We present computational results that show the efficacy of an alignment free distance in estimating a good read-to-read distance measure. We conclude that read pairs filtering based on alignment free distances may significantly accelerate the assembly process without a substantial loss in accuracy for the DNA sample sequence reconstruction.
In this work we consider a method for the extraction of knowledge expressed in Disjunctive Normal... more In this work we consider a method for the extraction of knowledge expressed in Disjunctive Normal Form (DNF) from data. The method is mainly designed for classification purposes, and is based on three main steps: Discretization, Feature Selection, and Formula Extraction. The three steps are formulated as optimization problems and solved with ad hoc algorithmic strategies. When used for classification purposes, the proposed approach is designed to perform exact separation of training data and can thus be exposed to overfitting when a significant amount of noise is present. We analyze the main problems that may arise when this method deals with noisy data and propose extensions for the three steps of the method.
Microarray Logic Analyzer (MALA) is a clustering and classification software, particularly engine... more Microarray Logic Analyzer (MALA) is a clustering and classification software, particularly engineered for microarray gene expression analysis. The aims of MALA are to cluster the microarray gene expression profiles in order to reduce the amount of data to be analyzed and to classify the microarray experiments. To fulfil this objective MALA uses a machine learning process based methodology, that relies on 1) Discretization, 2) Gene clustering, 3) Feature selection, 4) Formulas computation, 5) Classification. In this paper we describe the methodology, the software design, the different releases and user interfaces of MALA. We also emphasize its strengths: the identification of classification formulas that are able to precisely describe in a compact way the different classes of the microarray experiments. Finally, we show the experimental results obtained on a real microarray data set coming from Alzheimer diseased versus control mice microarray probes, and conclude that MALA is a powerful and reliable software for microarray gene expression analysis.
2014 Complexity in Engineering (COMPENG), 2014
2014 25th International Workshop on Database and Expert Systems Applications, 2014
Information Processing & Management, 2014
Much of the valuable information in supporting decision making processes originates in text-based... more Much of the valuable information in supporting decision making processes originates in text-based documents. Although these documents can be effectively searched and ranked by modern search engines, actionable knowledge need to be extracted and transformed in a structured form before being used in a decision process. In this paper we describe how the discovery of semantic information embedded in natural language documents can be viewed as an optimization problem aimed at assigning a sequence of labels (hidden states) to a set of interdependent variables (textual tokens). Dependencies among variables are efficiently modeled through Conditional Random Fields, an indirected graphical model able to represent the distribution of labels given a set of observations. The Markov property of these models prevent them to take into account longrange dependencies among variables, which are indeed relevant in Natural Language Processing. In order to overcome this limitation we propose an inference method based on Integer Programming formulation of the problem, where long distance dependencies are included through non-deterministic soft constraints.
BIOMAT 2008 - International Symposium on Mathematical and Computational Biology, 2009
Lecture Notes in Computer Science, 1996
ABSTRACT
SNPs are positions of the DNA sequences where the differences among individuals are embedded. The... more SNPs are positions of the DNA sequences where the differences among individuals are embedded. The knowledge of such SNPs is crucial for disease association studies, but even if the number of such positions is low (about 1% of the entire sequence), the cost to extract the complete information is actually very high. Recent studies have shown that DNA sequences are structured into blocks of positions, that are conserved during evolution, where there is strong correlation among values (alleles) of different loci. To reduce the cost of extracting SNPs information, the block structure of the DNA has suggested to limit the process to a subset of SNPs, the so-called Tag SNPs, that are able to maintain the most of the information contained in the whole sequence. In this paper, we apply a technique for feature selection based on integer programming to the problem of Tag SNP selection. Moreover, to test the quality of our approach, we consider also the problem of SNPs reconstruction, i.e. the problem of deriving unknown SNPs from the value of Tag SNPs and propose two reconstruction methods, one based on a majority vote and the other on a machine learning approach. We test our algorithm on two public data sets of different nature, providing results that are, when comparable, in line with the related literature. One of the interesting aspects of the proposed method is to be found in its capability to deal simultaneously with very large SNPs sets, and, in addition, to provide highly informative reconstruction rules in the form of logic formulas.
BMC research notes, 2014
Next Generation Sequencing (NGS) machines extract from a biological sample a large number of shor... more Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis. We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ...
In this paper we present a new bound obtained with the probabilistic method for the solution of t... more In this paper we present a new bound obtained with the probabilistic method for the solution of the Set Covering problem with unit costs. The bound is valid for problems of fixed dimension, thus extending previous similar asymptotic results, and it depends only on the number of rows of the coefficient matrix and the row densities. We also consider the particular case of matrices that are almost block decomposable, and show how the bound may improve according to the particular decomposition adopted. Such final result may provide interesting indications for comparing different matrix decomposition strategies.
Alzheimer's Disease (AD) and its preliminary stage -Mild Cognitive Impairment (MCI) -are the most... more Alzheimer's Disease (AD) and its preliminary stage -Mild Cognitive Impairment (MCI) -are the most widespread neurodegenerative disorders, and their investigation remains an open challenge. ElectroEncephalography (EEG) appears as a non-invasive and repeatable technique to diagnose brain abnormalities. Despite technical advances, the analysis of EEG spectra is usually carried out by experts that must manually perform laborious interpretations. Computational methods may lead to a quantitative analysis of these signals and hence to characterize EEG time series. The aim of this work is to achieve an automatic patients classification from the EEG biomedical signals involved in AD and MCI in order to support medical doctors in the right diagnosis formulation. The analysis of the biological EEG signals requires effective and efficient computer science methods to extract relevant information. Data mining, which guides the automated knowledge discovery process, is a natural way to approach EEG data analysis. Specifically, in our work we apply the following analysis steps: (i) pre-processing of EEG data; (ii) processing of the EEG-signals by the application of time-frequency transforms; and (iii) classification by means of machine learning methods. We obtain promising results from the classification of AD, MCI, and control samples that can assist the medical doctors in identifying the pathology.
Objective: Alzheimer's Disease (AD) is the most common form of dementia, for which actually no cu... more Objective: Alzheimer's Disease (AD) is the most common form of dementia, for which actually no cure is known [1]. Different studies have shown that AD has (at least) three major effects on electroencephalography (EEG) signals: enhanced complexity, slowing of signals, and perturbations in EEG synchrony [2]. The aim of this work is to achieve an automatic patients classification from EEG biomedical signals involved in AD and Mild Cognitive Impairment (MCI) in order to support physicians in a more correct individual diagnosis.
Background. DNA assembly consists in reconstructing the unknown primary structure of a DNA sequen... more Background. DNA assembly consists in reconstructing the unknown primary structure of a DNA sequence from a large number of its fragments, called reads, that are obtained in the sequencing process. The need for fast assembly methods has increased with the introduction of next generation sequencing (NGS) machines, that can produce and extract, at low cost, a large number of short reads from a genomic source. A large class of DNA assembly methods rely on a filtering step, where promising read pairs are separated from non-promising ones in order to reduce the computational burden of the main assembly algorithm. Faster filtering can thus provide a significant contribution to speed up the reconstruction of sequenced DNA. Methods. We propose a fitering method for read pairs based on alignment free distance. The similarity of two reads is assessed by comparing the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment free distance with the Needleman-Wunsch edit distance and with the quality of the BLAST alignment. Our comparison is based on a very simple assumption: the most correct distance is that obtained by knowing in advance the reference sequence that we are trying to align. We compute the overlap between the reads that is obtained once they have been aligned on the original DNA sequence, and use that as a reference distance; then, we verify how the alignment free and the alignment based distances are able to reproduce this ideal distance. The capability of correctly reproducing this ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae (yeast), Escherichia coli, and Homo sapiens (human) genomic sequences. Comparisons are based on the correctness of threshold predictors and are measured and cross-validated over different samples from the same sets of reads. Results. We show that, for the considered sequences, the adopted alignment free distance performs as well as, or better, than the more time consuming distances that require the alignment of the reads. Such assessment is based on prediction precision of the analyzed distances both on training and on test sets. Conclusions. We present computational results that show the efficacy of an alignment free distance in estimating a good read-to-read distance measure. We conclude that read pairs filtering based on alignment free distances may significantly accelerate the assembly process without a substantial loss in accuracy for the DNA sample sequence reconstruction.
In this work we consider a method for the extraction of knowledge expressed in Disjunctive Normal... more In this work we consider a method for the extraction of knowledge expressed in Disjunctive Normal Form (DNF) from data. The method is mainly designed for classification purposes, and is based on three main steps: Discretization, Feature Selection, and Formula Extraction. The three steps are formulated as optimization problems and solved with ad hoc algorithmic strategies. When used for classification purposes, the proposed approach is designed to perform exact separation of training data and can thus be exposed to overfitting when a significant amount of noise is present. We analyze the main problems that may arise when this method deals with noisy data and propose extensions for the three steps of the method.
Microarray Logic Analyzer (MALA) is a clustering and classification software, particularly engine... more Microarray Logic Analyzer (MALA) is a clustering and classification software, particularly engineered for microarray gene expression analysis. The aims of MALA are to cluster the microarray gene expression profiles in order to reduce the amount of data to be analyzed and to classify the microarray experiments. To fulfil this objective MALA uses a machine learning process based methodology, that relies on 1) Discretization, 2) Gene clustering, 3) Feature selection, 4) Formulas computation, 5) Classification. In this paper we describe the methodology, the software design, the different releases and user interfaces of MALA. We also emphasize its strengths: the identification of classification formulas that are able to precisely describe in a compact way the different classes of the microarray experiments. Finally, we show the experimental results obtained on a real microarray data set coming from Alzheimer diseased versus control mice microarray probes, and conclude that MALA is a powerful and reliable software for microarray gene expression analysis.
2014 Complexity in Engineering (COMPENG), 2014
2014 25th International Workshop on Database and Expert Systems Applications, 2014