Giuseppe Jurman - Profile on Academia.edu (original) (raw)
Papers by Giuseppe Jurman
ArXiv, 2015
Different strategies have been considered to extract information from social media about how simi... more Different strategies have been considered to extract information from social media about how similarly people react to the same news or event. In this context, a powerful method is offered by the application of graph techniques to the contents produced by social network users. In particular, large events typically attract enough content traffic along time to enable an analysis that explicitly models a dependence from the time dimension. Here we demonstrate how it is possible to extend the application of community detection strategies in complex networks to the case of time-dependent multilayer networks, whenever the connection between consecutive time layers is non-trivial. We apply the method to 400K Twitter post related to the Expo event held in Milan (Italy) between May and October 2015.
TAASRAD19 Radar Scans 2010-2016
TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity datas... more TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity dataset collected by the Civil Protection weather radar of the Trentino South Tyrol Region, in the Italian Alps.<br> The dataset includes 894,916 scans of precipitation from more than 9 years of data, offering a novel resource to develop and benchmark analog ensemble models and machine learning solutions for precipitation nowcasting. Data are expressed as 2D images, considering the maximum reflectivity on the vertical section and 5 minutes sampling rate, covering an area of 240km of diameter at 500m horizontal resolution. The TAASRAD19 distribution also includes a curated set of 1,732 sequences, for a total of 362,233 radar images, labeled with precipitation type tags assigned by expert meteorologists. We validated TAASRAD19 as a benchmark for nowcasting using deep learning model to forecast reflectivity and a procedure based on the UMAP dimensionality reduction method for interactive exploration.<br> Software methods for data pre-processing, model training and inference, and a pre-trained model are<br> publicly available at https://github.com/MPBA/TAASRAD19 for replication and reproducibility.
TAASRAD19 Radar Sequences 2010-2019 NetCDF
TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity datas... more TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity dataset collected by the Civil Protection weather radar of the Trentino South Tyrol Region, in the Italian Alps.<br> The dataset includes 894,916 scans of precipitation from more than 9 years of data, offering a novel resource to develop and benchmark analog ensemble models and machine learning solutions for precipitation nowcasting. Data are expressed as 2D images, considering the maximum reflectivity on the vertical section and 5 minutes sampling rate, covering an area of 240km of diameter at 500m horizontal resolution. The TAASRAD19 distribution also includes a curated set of 1,732 sequences, for a total of 362,233 radar images, labeled with precipitation type tags assigned by expert meteorologists. We validated TAASRAD19 as a benchmark for nowcasting using deep learning model to forecast reflectivity and a procedure based on the UMAP dimensionality reduction method for interactive exploration.<br> Software methods for data pre-processing, model training and inference, and a pre-trained model are<br> publicly available at https://github.com/MPBA/TAASRAD19 for replication and reproducibility.
Genome Biology, 2021
Background Oncopanel genomic testing, which identifies important somatic variants, is increasingl... more Background Oncopanel genomic testing, which identifies important somatic variants, is increasingly common in medical practice and especially in clinical trials. Currently, there is a paucity of reliable genomic reference samples having a suitably large number of pre-identified variants for properly assessing oncopanel assay analytical quality and performance. The FDA-led Sequencing and Quality Control Phase 2 (SEQC2) consortium analyze ten diverse cancer cell lines individually and their pool, termed Sample A, to develop a reference sample with suitably large numbers of coding positions with known (variant) positives and negatives for properly evaluating oncopanel analytical performance. Results In reference Sample A, we identify more than 40,000 variants down to 1% allele frequency with more than 25,000 variants having less than 20% allele frequency with 1653 variants in COSMIC-related genes. This is 5–100× more than existing commercially available samples. We also identify an unpr...
IEEE Transactions on Network Science and Engineering, 2020
A variety of complex systems exhibit different types of relationships simultaneously that can be ... more A variety of complex systems exhibit different types of relationships simultaneously that can be modeled by multiplex networks. A typical problem is to determine the community structure of such systems that, in general, depend on one or more parameters to be tuned. In this study we propose one measure, grounded on information theory, to find the optimal value of the relax rate characterizing Multiplex Infomap, the generalization of the Infomap algorithm to the realm of multilayer networks. We evaluate our methodology on synthetic networks, to show that the most representative community structure can be reliably identified when the most appropriate relax rate is used. Capitalizing on these results, we use this measure to identify the most reliable meso-scale functional organization in the human protein-protein interaction multiplex network and compare the observed clusters against a collection of independently annotated gene sets from the Molecular Signatures Database (MSigDB). Our analysis reveals that modules obtained with the optimal value of the relax rate are biologically significant and, remarkably, with higher functional content than the ones obtained from the aggregate representation of the human proteome. Our framework allows us to characterize the meso-scale structure of those multilayer systems whose layers are not explicitly interconnected each other -as in the case of edge-colored models -the ones describing most biological networks, from proteomes to connectomes.
Journal of Environmental Science and Health, Part C, 2018
We introduce here ML4Tox, a framework offering Deep Learning and Support Vector Machine models to... more We introduce here ML4Tox, a framework offering Deep Learning and Support Vector Machine models to predict agonist, antagonist, and binding activities of chemical compounds, in this case for the estrogen receptor ligand-binding domain. The ML4Tox models have been developed with a 10 Â 5-fold cross-validation schema on the training portion of the CERAPP ToxCast dataset, formed by 1677 chemicals, each described by 777 molecular features. On the CERAPP "All Literature" evaluation set (agonist: 6319 compounds; antagonist 6539; binding 7283), ML4Tox significantly improved sensitivity over published results on all three tasks, with agonist: 0.78 vs 0.56; antagonist: 0.69 vs 0.11; binding: 0.66 vs 0.26.
Heliyon, 2016
Evolving multiplex networks are a powerful model for representing the dynamics along time of diff... more Evolving multiplex networks are a powerful model for representing the dynamics along time of different phenomena, such as social networks, power grids, biological pathways. However, exploring the structure of the multiplex network time series is still an open problem. Here we propose a two-steps strategy to tackle this problem based on the concept of distance (metric) between networks. Given a multiplex graph, first a network of networks is built for each time steps, and then a real valued time series is obtained by the sequence of (simple) networks by evaluating the distance from the first element of the series. The effectiveness of this approach in detecting the occurring changes along the original time series is shown on a synthetic example first, and then on the Gulf dataset of political events.
A Machine Learning Pipeline for Identification of Discriminant Pathways
Springer Handbook of Bio-/Neuroinformatics, 2014
Identifying the molecular pathways more prone to disruption during a pathological process is a ke... more Identifying the molecular pathways more prone to disruption during a pathological process is a key task in network medicine and, more generally, in systems biology. This chapter describes a pipeline that couples a machine learning solution for molecular profiling with a recent network comparison method. The pipeline can identify changes occurring between specific sub-modules of networks built in a case-control biomarker study, discriminating key groups of genes whose interactions are modified by an underlying condition. Different algorithms can be chosen to implement the workflow steps. Three applications on genome-wide data are presented regarding the susceptibility of children to air pollution, and early and late onset of Parkinsonʼs and Alzheimerʼs diseases.
Nature biotechnology, 2014
The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differen... more The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differential gene expression has not been rigorously assessed using a range of chemical treatment conditions. Here we use a comprehensive study design to generate Illumina RNA-seq and Affymetrix microarray data from the same liver samples of rats exposed in triplicate to varying degrees of perturbation by 27 chemicals representing multiple modes of action (MOAs). The cross-platform concordance in terms of differentially expressed genes (DEGs) or enriched pathways is linearly correlated with treatment effect size (R(2)0.8). Furthermore, the concordance is also affected by transcript abundance and biological complexity of the MOA. RNA-seq outperforms microarray (93% versus 75%) in DEG verification as assessed by quantitative PCR, with the gain mainly due to its improved accuracy for low-abundance transcripts. Nonetheless, classifiers to predict MOAs perform similarly when developed using data fro...
PloS one, 2012
The traditional staging system is inadequate to identify those patients with stage II colorectal ... more The traditional staging system is inadequate to identify those patients with stage II colorectal cancer (CRC) at high risk of recurrence or with stage III CRC at low risk. A number of gene expression signatures to predict CRC prognosis have been proposed, but none is routinely used in the clinic. The aim of this work was to assess the prediction ability and potential clinical usefulness of these signatures in a series of independent datasets. A literature review identified 31 gene expression signatures that used gene expression data to predict prognosis in CRC tissue. The search was based on the PubMed database and was restricted to papers published from January 2004 to December 2011. Eleven CRC gene expression datasets with outcome information were identified and downloaded from public repositories. Random Forest classifier was used to build predictors from the gene lists. Matthews correlation coefficient was chosen as a measure of classification accuracy and its associated p-value...
PLoS ONE, 2012
The identification of robust lists of molecular biomarkers related to a disease is a fundamental ... more The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for the discovery of biomarkers using microarray data often provide results with limited overlap. These differences are imputable to 1) dataset size (few subjects with respect to the number of features); 2) heterogeneity of the disease; 3) heterogeneity of experimental protocols and computational pipelines employed in the analysis. In this paper, we focus on the first two issues and assess, both on simulated (through an in silico regulation network model) and real clinical datasets, the consistency of candidate biomarkers provided by a number of different methods. We extensively simulated the effect of heterogeneity characteristic of complex diseases on different sets of microarray data. Heterogeneity was reproduced by simulating both intrinsic variability of the population and the alteration of regulatory mechanisms. Population variability was simulated by modeling evolution of a pool of subjects; then, a subset of them underwent alterations in regulatory mechanisms so as to mimic the disease state. The simulated data allowed us to outline advantages and drawbacks of different methods across multiple studies and varying number of samples and to evaluate precision of feature selection on a benchmark with known biomarkers. Although comparable classification accuracy was reached by different methods, the use of external cross-validation loops is helpful in finding features with a higher degree of precision and stability. Application to real data confirmed these results.
Nature Biotechnology, 2010
Gene expression data from microarrays are being applied to predict preclinical and clinical endpo... more Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis. 'modeling factors' , used to construct each model and the 'internal' and 'external' performance of each model. Internal performance measures the ability of the model to classify the training samples, based on cross-validation exercises. External performance measures the ability of the model to classify the blinded independent validation data. We considered several performance metrics, including Matthews Correlation Coefficient (MCC), accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC) and root mean squared error (r.m.s.e.). These two tables contain data on >30,000 models. Here we report performance based on MCC because
International Journal of Cancer, 2006
We analyzed the expression signatures of 14 tumor biopsies from children affected by alveolar rha... more We analyzed the expression signatures of 14 tumor biopsies from children affected by alveolar rhabdomyosarcoma (ARMS) to identify genes correlating to biological features of this tumor. Seven of these patients were positive for the PAX3-FKHR fusion gene and 7 were negative. We used a cDNA platform containing a large majority of probes derived from muscle tissues. The comparison of transcription profiles of tumor samples with fetal skeletal muscle identified 171 differentially expressed genes common to all ARMS patients. The functional classification analysis of altered genes led to the identification of a group of transcripts (LGALS1, BIN1) that may be relevant for the tumorigenic processes. The muscle-specific microarray platform was able to distinguish PAX3-FKHR positive and negative ARMS through the expression pattern of a limited number of genes (RAC1, CFL1, CCND1, IGFBP2) that might be biologically relevant for the different clinical behavior and aggressiveness of the 2 ARMS subtypes. Expression levels for selected candidate genes were validated by quantitative real-time reverse-transcription PCR.
Briefings in Bioinformatics, 2007
The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data ... more The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data requires a complex analysis path. Preprocessing and machine-learning modules are pipelined, starting from raw spectra, to set up a predictive classifier based on a shortlist of candidate features. As a machine-learning problem, proteomic profiling on MS data needs caution like the microarray case. The risk of overfitting and of selection bias effects is pervasive: not only potential features easily outnumber samples by 10 3 times, but it is easy to neglect informationleakage effects during preprocessing from spectra to peaks. The aim of this review is to explain how to build a general purpose design analysis protocol (DAP) for predictive proteomic profiling: we show how to limit leakage due to parameter tuning and how to organize classification and ranking on large numbers of replicate versions of the original data to avoid selection bias. The DAP can be used with alternative components, i.e. with different preprocessing methods (peak clustering or wavelet based), classifiers (e.g. Support Vector Machine) or feature ranking methods (recursive feature elimination or I-Relief). A procedure for assessing stability and predictive value of the resulting biomarkers' list is also provided. The approach is exemplified with experiments on synthetic datasets (from the Cromwell MS simulator) and with publicly available datasets from cancer studies.
Bioinformatics, 2007
Motivation: We propose a method for studying the stability of biomarker lists obtained from funct... more Motivation: We propose a method for studying the stability of biomarker lists obtained from functional genomics studies. It is common to adopt resampling methods to tune and evaluate markerbased diagnostic and prognostic systems in order to prevent selection bias. Such caution promotes honest estimation of class prediction, but leads to alternative sets of solutions. In microarray studies, the difference in lists may be bewildering, also due to the presence of modules of functionally related genes. Methods for assessing stability understand the dependency of the markers on the data or on the predictor's type and help selecting solutions. Results: A computational framework for comparing sets of ranked biomarker lists is presented. Notions and algorithms are based on concepts from permutation group theory. We introduce several algebraic indicators and metric methods for symmetric groups, including the Canberra distance, a weighted version of Spearman's footrule. We also consider distances between partial lists and an aggregation of sets of lists into an optimal list based on voting theory (Borda count). The stability indicators are applied in practical situations to several synthetic, cancer microarray and proteomics datasets. The addressed issues are predictive classification, presence of modules, comparison of alternative biomarker lists, outlier removal, control of selection bias by randomization techniques and enrichment analysis.
Prediction of SRS From Genotype in Autism
ABSTRACT Background: Autism Spectrum Disorder (ASD) symptoms are heterogeneous and hard to discri... more ABSTRACT Background: Autism Spectrum Disorder (ASD) symptoms are heterogeneous and hard to discriminate in distinct subtypes. Although candidate loci have been recently identified by integration of large ASD cohorts (Wang et al. 2009), new bioinformatics methods are needed to cope with high individual variability. The l1-l2 regularization is a feature selection technique capable of generating a specific signature in biologically complex settings. It was applied to detect markers of transcriptional response of neuroblastoma to hypoxia (Fardin et al. 2009), and proposed for predicting quantitative phenotypes traits from high dimensional genetic data (Guzzetta et al 2009). Here we studied its first large scale application to whole genome association data from the AGRE research program. Objectives: We aim to predict Social Responsiveness Scale (SRS) levels by means of a new bioinformatics platform for quantitative phenotype prediction. Although currently there is a limited coverage of the SRS phenotypes in the AGRE cohort, this is a powerful set of indicators that can be used to determine individual trajectories, the ultimate goal for our analysis. Here we set a bioinformatics experiment in which all unfiltered variant positions in the genome are used as potential markers and training is based on extreme value cases. Methods: Given the 2,883 AGRE samples genotyped by the Broad Institute with the Affymetrix 5.0 platform (399,197 SNPs), we first identified 803 individuals with only ADI or ADOS-confirmed autism diagnosis and 1446 healthy controls not tested for ADI. Individuals having a teacher-administered SRS questionnaire were then selected, leaving 144 cases and 19 controls. We considered the highest 17 and lowest 18 SRS total scores (respectively, only cases and only controls). A linear l1l2-regularization regression model was trained on all features, using the SRS total score as target. The experiment protocol was based on the 10x5 FDA’s MAQC-II procedure (5-fold cross-validation repeated 10 times). For the l1l2 parameter set having the best average R^2 score computed from CV test portions, we evaluated the Area under the Curve (AUC) for classification from real predictions (Wilcoxon Mann-Whitney) and ranked the weights corresponding to each selected SNP. Results: AUC was 0.723 (95% CI: 0.684-0.768), with a fit of R2 = 0.237 (95% CI: 0.155-0.331). The same 51,744 SNPs were consistently selected in all experiments. Ranked by regression weights, the top 30 markers all had an average position higher than 150. Of these, 24 belong to only four regions: 3p12.2 (3), 8q21.11 (10), 11p12 (3), 11p14.1 (5), Xp11.4 (3). Near loci on chromosomes 8 and 11 had been previously identified for SRS by Duvall et al (2007). Within the top 500 SNPs, we also found 12 SNPs at loci 5p14.1 (2), 14q21.1 (4) and Xq21.1 (6) indicated as candidate markers for autism (Wang et al 2009). Conclusions: This study represents the first application of a regression method to an autism-related quantitative phenotype. When trained on extreme values of the SRS score, the l1l2 method fairly discriminated cases from controls and explained 23.7% of variance. Moreover, selected markers were stable and consistent with literature. Top ranked markers are being investigated.
IEEE Access, 2021
Even if measuring the outcome of binary classifications is a pivotal task in machine learning and... more Even if measuring the outcome of binary classifications is a pivotal task in machine learning and statistics, no consensus has been reached yet about which statistical rate to employ to this end. In the last century, the computer science and statistics communities have introduced several scores summing up the correctness of the predictions with respect to the ground truth values. Among these scores, the Matthews correlation coefficient (MCC) was shown to have several advantages over confusion entropy, accuracy, F 1 score, balanced accuracy, bookmaker informedness, markedness, and diagnostic odds ratio: MCC, in fact, produces a high score only if the majority of the predicted negative data instances and the majority of the positive data instances are correct, and therefore it results being very trustworthy on imbalanced datasets. In this study, we compare MCC with two other popular scores: Cohen's Kappa, a metric that originated in social sciences, and the Brier score, a strictly proper scoring function which emerged in weather forecasting studies. After explaining the mathematical properties and the relationships between MCC and each of these two rates, we report some use cases where these scores generate different values, which lead to discordant outcomes, where MCC provides a more truthful and informative result. We highlight the reasons why it is more advisable to use MCC rather that Cohen's Kappa and the Brier score to evaluate binary classifications.
Nature, 2014
Regulated transcription controls the diversity, developmental pathways and spatial organization o... more Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly 'housekeeping', whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell-type-specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter-based expression analysis reveals key transcription factors defining cell states and links them to binding-site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample Reprints and permissions information is available at www.nature.com/reprints.
ArXiv, 2015
Different strategies have been considered to extract information from social media about how simi... more Different strategies have been considered to extract information from social media about how similarly people react to the same news or event. In this context, a powerful method is offered by the application of graph techniques to the contents produced by social network users. In particular, large events typically attract enough content traffic along time to enable an analysis that explicitly models a dependence from the time dimension. Here we demonstrate how it is possible to extend the application of community detection strategies in complex networks to the case of time-dependent multilayer networks, whenever the connection between consecutive time layers is non-trivial. We apply the method to 400K Twitter post related to the Expo event held in Milan (Italy) between May and October 2015.
TAASRAD19 Radar Scans 2010-2016
TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity datas... more TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity dataset collected by the Civil Protection weather radar of the Trentino South Tyrol Region, in the Italian Alps.<br> The dataset includes 894,916 scans of precipitation from more than 9 years of data, offering a novel resource to develop and benchmark analog ensemble models and machine learning solutions for precipitation nowcasting. Data are expressed as 2D images, considering the maximum reflectivity on the vertical section and 5 minutes sampling rate, covering an area of 240km of diameter at 500m horizontal resolution. The TAASRAD19 distribution also includes a curated set of 1,732 sequences, for a total of 362,233 radar images, labeled with precipitation type tags assigned by expert meteorologists. We validated TAASRAD19 as a benchmark for nowcasting using deep learning model to forecast reflectivity and a procedure based on the UMAP dimensionality reduction method for interactive exploration.<br> Software methods for data pre-processing, model training and inference, and a pre-trained model are<br> publicly available at https://github.com/MPBA/TAASRAD19 for replication and reproducibility.
TAASRAD19 Radar Sequences 2010-2019 NetCDF
TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity datas... more TAASRAD19 (Trentino-Alto Adige/Südtirol Radar 2019) is a high-resolution radar reflectivity dataset collected by the Civil Protection weather radar of the Trentino South Tyrol Region, in the Italian Alps.<br> The dataset includes 894,916 scans of precipitation from more than 9 years of data, offering a novel resource to develop and benchmark analog ensemble models and machine learning solutions for precipitation nowcasting. Data are expressed as 2D images, considering the maximum reflectivity on the vertical section and 5 minutes sampling rate, covering an area of 240km of diameter at 500m horizontal resolution. The TAASRAD19 distribution also includes a curated set of 1,732 sequences, for a total of 362,233 radar images, labeled with precipitation type tags assigned by expert meteorologists. We validated TAASRAD19 as a benchmark for nowcasting using deep learning model to forecast reflectivity and a procedure based on the UMAP dimensionality reduction method for interactive exploration.<br> Software methods for data pre-processing, model training and inference, and a pre-trained model are<br> publicly available at https://github.com/MPBA/TAASRAD19 for replication and reproducibility.
Genome Biology, 2021
Background Oncopanel genomic testing, which identifies important somatic variants, is increasingl... more Background Oncopanel genomic testing, which identifies important somatic variants, is increasingly common in medical practice and especially in clinical trials. Currently, there is a paucity of reliable genomic reference samples having a suitably large number of pre-identified variants for properly assessing oncopanel assay analytical quality and performance. The FDA-led Sequencing and Quality Control Phase 2 (SEQC2) consortium analyze ten diverse cancer cell lines individually and their pool, termed Sample A, to develop a reference sample with suitably large numbers of coding positions with known (variant) positives and negatives for properly evaluating oncopanel analytical performance. Results In reference Sample A, we identify more than 40,000 variants down to 1% allele frequency with more than 25,000 variants having less than 20% allele frequency with 1653 variants in COSMIC-related genes. This is 5–100× more than existing commercially available samples. We also identify an unpr...
IEEE Transactions on Network Science and Engineering, 2020
A variety of complex systems exhibit different types of relationships simultaneously that can be ... more A variety of complex systems exhibit different types of relationships simultaneously that can be modeled by multiplex networks. A typical problem is to determine the community structure of such systems that, in general, depend on one or more parameters to be tuned. In this study we propose one measure, grounded on information theory, to find the optimal value of the relax rate characterizing Multiplex Infomap, the generalization of the Infomap algorithm to the realm of multilayer networks. We evaluate our methodology on synthetic networks, to show that the most representative community structure can be reliably identified when the most appropriate relax rate is used. Capitalizing on these results, we use this measure to identify the most reliable meso-scale functional organization in the human protein-protein interaction multiplex network and compare the observed clusters against a collection of independently annotated gene sets from the Molecular Signatures Database (MSigDB). Our analysis reveals that modules obtained with the optimal value of the relax rate are biologically significant and, remarkably, with higher functional content than the ones obtained from the aggregate representation of the human proteome. Our framework allows us to characterize the meso-scale structure of those multilayer systems whose layers are not explicitly interconnected each other -as in the case of edge-colored models -the ones describing most biological networks, from proteomes to connectomes.
Journal of Environmental Science and Health, Part C, 2018
We introduce here ML4Tox, a framework offering Deep Learning and Support Vector Machine models to... more We introduce here ML4Tox, a framework offering Deep Learning and Support Vector Machine models to predict agonist, antagonist, and binding activities of chemical compounds, in this case for the estrogen receptor ligand-binding domain. The ML4Tox models have been developed with a 10 Â 5-fold cross-validation schema on the training portion of the CERAPP ToxCast dataset, formed by 1677 chemicals, each described by 777 molecular features. On the CERAPP "All Literature" evaluation set (agonist: 6319 compounds; antagonist 6539; binding 7283), ML4Tox significantly improved sensitivity over published results on all three tasks, with agonist: 0.78 vs 0.56; antagonist: 0.69 vs 0.11; binding: 0.66 vs 0.26.
Heliyon, 2016
Evolving multiplex networks are a powerful model for representing the dynamics along time of diff... more Evolving multiplex networks are a powerful model for representing the dynamics along time of different phenomena, such as social networks, power grids, biological pathways. However, exploring the structure of the multiplex network time series is still an open problem. Here we propose a two-steps strategy to tackle this problem based on the concept of distance (metric) between networks. Given a multiplex graph, first a network of networks is built for each time steps, and then a real valued time series is obtained by the sequence of (simple) networks by evaluating the distance from the first element of the series. The effectiveness of this approach in detecting the occurring changes along the original time series is shown on a synthetic example first, and then on the Gulf dataset of political events.
A Machine Learning Pipeline for Identification of Discriminant Pathways
Springer Handbook of Bio-/Neuroinformatics, 2014
Identifying the molecular pathways more prone to disruption during a pathological process is a ke... more Identifying the molecular pathways more prone to disruption during a pathological process is a key task in network medicine and, more generally, in systems biology. This chapter describes a pipeline that couples a machine learning solution for molecular profiling with a recent network comparison method. The pipeline can identify changes occurring between specific sub-modules of networks built in a case-control biomarker study, discriminating key groups of genes whose interactions are modified by an underlying condition. Different algorithms can be chosen to implement the workflow steps. Three applications on genome-wide data are presented regarding the susceptibility of children to air pollution, and early and late onset of Parkinsonʼs and Alzheimerʼs diseases.
Nature biotechnology, 2014
The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differen... more The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differential gene expression has not been rigorously assessed using a range of chemical treatment conditions. Here we use a comprehensive study design to generate Illumina RNA-seq and Affymetrix microarray data from the same liver samples of rats exposed in triplicate to varying degrees of perturbation by 27 chemicals representing multiple modes of action (MOAs). The cross-platform concordance in terms of differentially expressed genes (DEGs) or enriched pathways is linearly correlated with treatment effect size (R(2)0.8). Furthermore, the concordance is also affected by transcript abundance and biological complexity of the MOA. RNA-seq outperforms microarray (93% versus 75%) in DEG verification as assessed by quantitative PCR, with the gain mainly due to its improved accuracy for low-abundance transcripts. Nonetheless, classifiers to predict MOAs perform similarly when developed using data fro...
PloS one, 2012
The traditional staging system is inadequate to identify those patients with stage II colorectal ... more The traditional staging system is inadequate to identify those patients with stage II colorectal cancer (CRC) at high risk of recurrence or with stage III CRC at low risk. A number of gene expression signatures to predict CRC prognosis have been proposed, but none is routinely used in the clinic. The aim of this work was to assess the prediction ability and potential clinical usefulness of these signatures in a series of independent datasets. A literature review identified 31 gene expression signatures that used gene expression data to predict prognosis in CRC tissue. The search was based on the PubMed database and was restricted to papers published from January 2004 to December 2011. Eleven CRC gene expression datasets with outcome information were identified and downloaded from public repositories. Random Forest classifier was used to build predictors from the gene lists. Matthews correlation coefficient was chosen as a measure of classification accuracy and its associated p-value...
PLoS ONE, 2012
The identification of robust lists of molecular biomarkers related to a disease is a fundamental ... more The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for the discovery of biomarkers using microarray data often provide results with limited overlap. These differences are imputable to 1) dataset size (few subjects with respect to the number of features); 2) heterogeneity of the disease; 3) heterogeneity of experimental protocols and computational pipelines employed in the analysis. In this paper, we focus on the first two issues and assess, both on simulated (through an in silico regulation network model) and real clinical datasets, the consistency of candidate biomarkers provided by a number of different methods. We extensively simulated the effect of heterogeneity characteristic of complex diseases on different sets of microarray data. Heterogeneity was reproduced by simulating both intrinsic variability of the population and the alteration of regulatory mechanisms. Population variability was simulated by modeling evolution of a pool of subjects; then, a subset of them underwent alterations in regulatory mechanisms so as to mimic the disease state. The simulated data allowed us to outline advantages and drawbacks of different methods across multiple studies and varying number of samples and to evaluate precision of feature selection on a benchmark with known biomarkers. Although comparable classification accuracy was reached by different methods, the use of external cross-validation loops is helpful in finding features with a higher degree of precision and stability. Application to real data confirmed these results.
Nature Biotechnology, 2010
Gene expression data from microarrays are being applied to predict preclinical and clinical endpo... more Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis. 'modeling factors' , used to construct each model and the 'internal' and 'external' performance of each model. Internal performance measures the ability of the model to classify the training samples, based on cross-validation exercises. External performance measures the ability of the model to classify the blinded independent validation data. We considered several performance metrics, including Matthews Correlation Coefficient (MCC), accuracy, sensitivity, specificity, area under the receiver operating characteristic curve (AUC) and root mean squared error (r.m.s.e.). These two tables contain data on >30,000 models. Here we report performance based on MCC because
International Journal of Cancer, 2006
We analyzed the expression signatures of 14 tumor biopsies from children affected by alveolar rha... more We analyzed the expression signatures of 14 tumor biopsies from children affected by alveolar rhabdomyosarcoma (ARMS) to identify genes correlating to biological features of this tumor. Seven of these patients were positive for the PAX3-FKHR fusion gene and 7 were negative. We used a cDNA platform containing a large majority of probes derived from muscle tissues. The comparison of transcription profiles of tumor samples with fetal skeletal muscle identified 171 differentially expressed genes common to all ARMS patients. The functional classification analysis of altered genes led to the identification of a group of transcripts (LGALS1, BIN1) that may be relevant for the tumorigenic processes. The muscle-specific microarray platform was able to distinguish PAX3-FKHR positive and negative ARMS through the expression pattern of a limited number of genes (RAC1, CFL1, CCND1, IGFBP2) that might be biologically relevant for the different clinical behavior and aggressiveness of the 2 ARMS subtypes. Expression levels for selected candidate genes were validated by quantitative real-time reverse-transcription PCR.
Briefings in Bioinformatics, 2007
The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data ... more The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data requires a complex analysis path. Preprocessing and machine-learning modules are pipelined, starting from raw spectra, to set up a predictive classifier based on a shortlist of candidate features. As a machine-learning problem, proteomic profiling on MS data needs caution like the microarray case. The risk of overfitting and of selection bias effects is pervasive: not only potential features easily outnumber samples by 10 3 times, but it is easy to neglect informationleakage effects during preprocessing from spectra to peaks. The aim of this review is to explain how to build a general purpose design analysis protocol (DAP) for predictive proteomic profiling: we show how to limit leakage due to parameter tuning and how to organize classification and ranking on large numbers of replicate versions of the original data to avoid selection bias. The DAP can be used with alternative components, i.e. with different preprocessing methods (peak clustering or wavelet based), classifiers (e.g. Support Vector Machine) or feature ranking methods (recursive feature elimination or I-Relief). A procedure for assessing stability and predictive value of the resulting biomarkers' list is also provided. The approach is exemplified with experiments on synthetic datasets (from the Cromwell MS simulator) and with publicly available datasets from cancer studies.
Bioinformatics, 2007
Motivation: We propose a method for studying the stability of biomarker lists obtained from funct... more Motivation: We propose a method for studying the stability of biomarker lists obtained from functional genomics studies. It is common to adopt resampling methods to tune and evaluate markerbased diagnostic and prognostic systems in order to prevent selection bias. Such caution promotes honest estimation of class prediction, but leads to alternative sets of solutions. In microarray studies, the difference in lists may be bewildering, also due to the presence of modules of functionally related genes. Methods for assessing stability understand the dependency of the markers on the data or on the predictor's type and help selecting solutions. Results: A computational framework for comparing sets of ranked biomarker lists is presented. Notions and algorithms are based on concepts from permutation group theory. We introduce several algebraic indicators and metric methods for symmetric groups, including the Canberra distance, a weighted version of Spearman's footrule. We also consider distances between partial lists and an aggregation of sets of lists into an optimal list based on voting theory (Borda count). The stability indicators are applied in practical situations to several synthetic, cancer microarray and proteomics datasets. The addressed issues are predictive classification, presence of modules, comparison of alternative biomarker lists, outlier removal, control of selection bias by randomization techniques and enrichment analysis.
Prediction of SRS From Genotype in Autism
ABSTRACT Background: Autism Spectrum Disorder (ASD) symptoms are heterogeneous and hard to discri... more ABSTRACT Background: Autism Spectrum Disorder (ASD) symptoms are heterogeneous and hard to discriminate in distinct subtypes. Although candidate loci have been recently identified by integration of large ASD cohorts (Wang et al. 2009), new bioinformatics methods are needed to cope with high individual variability. The l1-l2 regularization is a feature selection technique capable of generating a specific signature in biologically complex settings. It was applied to detect markers of transcriptional response of neuroblastoma to hypoxia (Fardin et al. 2009), and proposed for predicting quantitative phenotypes traits from high dimensional genetic data (Guzzetta et al 2009). Here we studied its first large scale application to whole genome association data from the AGRE research program. Objectives: We aim to predict Social Responsiveness Scale (SRS) levels by means of a new bioinformatics platform for quantitative phenotype prediction. Although currently there is a limited coverage of the SRS phenotypes in the AGRE cohort, this is a powerful set of indicators that can be used to determine individual trajectories, the ultimate goal for our analysis. Here we set a bioinformatics experiment in which all unfiltered variant positions in the genome are used as potential markers and training is based on extreme value cases. Methods: Given the 2,883 AGRE samples genotyped by the Broad Institute with the Affymetrix 5.0 platform (399,197 SNPs), we first identified 803 individuals with only ADI or ADOS-confirmed autism diagnosis and 1446 healthy controls not tested for ADI. Individuals having a teacher-administered SRS questionnaire were then selected, leaving 144 cases and 19 controls. We considered the highest 17 and lowest 18 SRS total scores (respectively, only cases and only controls). A linear l1l2-regularization regression model was trained on all features, using the SRS total score as target. The experiment protocol was based on the 10x5 FDA’s MAQC-II procedure (5-fold cross-validation repeated 10 times). For the l1l2 parameter set having the best average R^2 score computed from CV test portions, we evaluated the Area under the Curve (AUC) for classification from real predictions (Wilcoxon Mann-Whitney) and ranked the weights corresponding to each selected SNP. Results: AUC was 0.723 (95% CI: 0.684-0.768), with a fit of R2 = 0.237 (95% CI: 0.155-0.331). The same 51,744 SNPs were consistently selected in all experiments. Ranked by regression weights, the top 30 markers all had an average position higher than 150. Of these, 24 belong to only four regions: 3p12.2 (3), 8q21.11 (10), 11p12 (3), 11p14.1 (5), Xp11.4 (3). Near loci on chromosomes 8 and 11 had been previously identified for SRS by Duvall et al (2007). Within the top 500 SNPs, we also found 12 SNPs at loci 5p14.1 (2), 14q21.1 (4) and Xq21.1 (6) indicated as candidate markers for autism (Wang et al 2009). Conclusions: This study represents the first application of a regression method to an autism-related quantitative phenotype. When trained on extreme values of the SRS score, the l1l2 method fairly discriminated cases from controls and explained 23.7% of variance. Moreover, selected markers were stable and consistent with literature. Top ranked markers are being investigated.
IEEE Access, 2021
Even if measuring the outcome of binary classifications is a pivotal task in machine learning and... more Even if measuring the outcome of binary classifications is a pivotal task in machine learning and statistics, no consensus has been reached yet about which statistical rate to employ to this end. In the last century, the computer science and statistics communities have introduced several scores summing up the correctness of the predictions with respect to the ground truth values. Among these scores, the Matthews correlation coefficient (MCC) was shown to have several advantages over confusion entropy, accuracy, F 1 score, balanced accuracy, bookmaker informedness, markedness, and diagnostic odds ratio: MCC, in fact, produces a high score only if the majority of the predicted negative data instances and the majority of the positive data instances are correct, and therefore it results being very trustworthy on imbalanced datasets. In this study, we compare MCC with two other popular scores: Cohen's Kappa, a metric that originated in social sciences, and the Brier score, a strictly proper scoring function which emerged in weather forecasting studies. After explaining the mathematical properties and the relationships between MCC and each of these two rates, we report some use cases where these scores generate different values, which lead to discordant outcomes, where MCC provides a more truthful and informative result. We highlight the reasons why it is more advisable to use MCC rather that Cohen's Kappa and the Brier score to evaluate binary classifications.
Nature, 2014
Regulated transcription controls the diversity, developmental pathways and spatial organization o... more Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly 'housekeeping', whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell-type-specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter-based expression analysis reveals key transcription factors defining cell states and links them to binding-site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample Reprints and permissions information is available at www.nature.com/reprints.